A Toy Front-End for LLVM, written in Rust
My current side project is a compiler written in Rust that emits LLVM IR. LLVM’s API is a bit daunting for noobs, and there aren’t many tutorials out there (and they’re all in C++, so it’s not always obvious how to do the same in Rust). I wish someone had held my hand as I got started with this, so here’s what I would have shown myself.
For Rust, your best option for interfacing with LLVM is through the llvm-sys
crate. A kind person on the Internet hosts docs for it
here. Of course, you should
also check out LLVM’s tutorial,
since it helps you understand how LLVM “thinks”. This post is basically a Rust
translation of a subset of LLVM’s tutorial.
You can check out the final code for this post here.
Getting a working development environment
For starters, here’s a reproducible way to get hacking with LLVM:
# `curl` is just so we can next install Rust
sudo apt-get -y install clang curl llvm-3.8-dev
curl https://sh.rustup.rs -sSf | sh
# The `llvm-sys` crate expects something called `llvm-config` on your PATH.
sudo ln -s /usr/bin/llvm-config-3.8 /usr/bin/llvm-config
If you run that on a recent Ubuntu (you may need to apt-get update
), you’ll be
ready to go. Alternatively, you could run that inside a Vagrant box, with the
following Vagrantfile
:
Vagrant.configure("2") do |config|
config.vm.box = "bento/ubuntu-16.04"
end
You can get started by running cargo init llvm-example --bin
and putting the
following (taken from llvm-sys) in src/main.rs
:
//! Construct a function that does nothing in LLVM IR.
extern crate llvm_sys as llvm;
use std::ptr;
fn main() {
unsafe {
// Set up a context, module and builder in that context.
let context = llvm::core::LLVMContextCreate();
let module = llvm::core::LLVMModuleCreateWithName(b"nop\0".as_ptr() as *const _);
let builder = llvm::core::LLVMCreateBuilderInContext(context);
// Get the type signature for void nop(void);
// Then create it in our module.
let void = llvm::core::LLVMVoidTypeInContext(context);
let function_type = llvm::core::LLVMFunctionType(void, ptr::null_mut(), 0, 0);
let function = llvm::core::LLVMAddFunction(module, b"nop\0".as_ptr() as *const _,
function_type);
// Create a basic block in the function and set our builder to generate
// code in it.
let bb = llvm::core::LLVMAppendBasicBlockInContext(context, function,
b"entry\0".as_ptr() as *const _);
llvm::core::LLVMPositionBuilderAtEnd(builder, bb);
// Emit a `ret void` into the function
llvm::core::LLVMBuildRetVoid(builder);
// Dump the module as IR to stdout.
llvm::core::LLVMDumpModule(module);
// Clean up. Values created in the context mostly get cleaned up there.
llvm::core::LLVMDisposeBuilder(builder);
llvm::core::LLVMDisposeModule(module);
llvm::core::LLVMContextDispose(context);
}
}
And in your Cargo.toml
:
[package]
name = "llvm-example"
version = "0.1.0"
authors = ["Ulysse Carion <ulysse@ulysse.io>"]
[[bin]]
name = "main"
[dependencies]
llvm-sys = "0.2"
You should get:
vagrant@vagrant:/vagrant$ cargo run
Compiling llvm-example v0.1.0 (file:///vagrant)
Running `target/debug/main`
; ModuleID = 'nop'
define void @nop() {
entry:
ret void
}
Hooray! Now we can start writing our own stuff.
A slightly less trivial program
To get started, let’s compile a program that simply sets a return code, by
ret
urning an integer from a function called main
.
Here’s how I did it: (We’re gonna need a parser at some point, so I added it
now; I used the peg
crate):
#![feature(plugin)]
#![plugin(peg_syntax_ext)]
extern crate llvm_sys as llvm;
use std::ffi::CString;
use std::fs::File;
use std::io::Read;
use std::ptr;
fn main() {
let mut input = String::new();
let mut f = File::open("in.ex").unwrap();
f.read_to_string(&mut input).unwrap();
let parsed_input = parser::program(&input).unwrap();
unsafe {
codegen(parsed_input);
}
}
peg! parser(r#"
#[pub]
program -> String
= i:int_literal "\n" { i }
int_literal -> String
= [0-9]+ { match_str.to_owned() }
"#);
unsafe fn codegen(input: String) {
let context = llvm::core::LLVMContextCreate();
let module = llvm::core::LLVMModuleCreateWithName(b"example_module\0".as_ptr() as *const _);
let builder = llvm::core::LLVMCreateBuilderInContext(context);
// In LLVM, you get your types from functions.
let int_type = llvm::core::LLVMInt64TypeInContext(context);
let function_type = llvm::core::LLVMFunctionType(int_type, ptr::null_mut(), 0, 0);
let function = llvm::core::LLVMAddFunction(module, b"main\0".as_ptr() as *const _, function_type);
let entry_name = CString::new("entry").unwrap();
let bb = llvm::core::LLVMAppendBasicBlockInContext(context, function, entry_name.as_ptr());
llvm::core::LLVMPositionBuilderAtEnd(builder, bb);
// The juicy part: construct a `LLVMValue` from a Rust value:
let int_value: u64 = input.parse().unwrap();
let int_value = llvm::core::LLVMConstInt(int_type, int_value, 0);
llvm::core::LLVMBuildRet(builder, int_value);
// Instead of dumping to stdout, let's write out the IR to `out.ll`
let out_file = CString::new("out.ll").unwrap();
llvm::core::LLVMPrintModuleToFile(module, out_file.as_ptr(), ptr::null_mut());
llvm::core::LLVMDisposeBuilder(builder);
llvm::core::LLVMDisposeModule(module);
llvm::core::LLVMContextDispose(context);
}
It works! Check it out:
vagrant@vagrant:/vagrant$ cat in.ex
42
vagrant@vagrant:/vagrant$ cargo run
Running `target/debug/main`
vagrant@vagrant:/vagrant$ lli-3.8 out.ll ; echo $?
42
Cool! And here’s what out.ll
looks like, btw:
; ModuleID = 'example_module'
define i64 @main() {
entry:
ret i64 42
}
Arithmetic
Let’s add support for adding, subtracting, multiplying, and dividing numbers. To do that, we need to extend our grammar. Let’s introduce an enum to represent the AST:
pub enum Expr {
Add(Box<Expr>, Box<Expr>),
Sub(Box<Expr>, Box<Expr>),
Mul(Box<Expr>, Box<Expr>),
Div(Box<Expr>, Box<Expr>),
Literal(String),
}
We also need to extend our grammar:
// `product` and `sum` are that way to get operator precedence
peg! parser(r#"
use super::Expr;
#[pub]
program -> Expr
= e:expression "\n" { e }
expression -> Expr
= sum
sum -> Expr
= a:product _ "+" _ b:sum { Expr::Add(Box::new(a), Box::new(b)) }
/ a:product _ "-" _ b:sum { Expr::Sub(Box::new(a), Box::new(b)) }
/ product
product -> Expr
= a:int_literal _ "*" _ b:product { Expr::Mul(Box::new(a), Box::new(b)) }
/ a:int_literal _ "/" _ b:product { Expr::Div(Box::new(a), Box::new(b)) }
/ int_literal
int_literal -> Expr
= [0-9]+ { Expr::Literal(match_str.to_owned()) }
_ = " "*
"#);
Next, let’s emit code. You can specify strings like “addtmp”, and these will be used as part of the name of the corresponding “register” in the IR.
// When you write out instructions in LLVM, you get back `LLVMValueRef`s. You
// can then use these references in other instructions.
unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, expr: Expr) -> LLVMValueRef {
match expr {
Expr::Literal(int_literal) => {
let int_type = llvm::core::LLVMInt64TypeInContext(context);
llvm::core::LLVMConstInt(int_type, int_literal.parse().unwrap(), 0)
},
Expr::Add(lhs, rhs) => {
let lhs = codegen_expr(context, builder, *lhs);
let rhs = codegen_expr(context, builder, *rhs);
let name = CString::new("addtmp").unwrap();
llvm::core::LLVMBuildAdd(builder, lhs, rhs, name.as_ptr())
},
Expr::Sub(lhs, rhs) => {
let lhs = codegen_expr(context, builder, *lhs);
let rhs = codegen_expr(context, builder, *rhs);
let name = CString::new("subtmp").unwrap();
llvm::core::LLVMBuildSub(builder, lhs, rhs, name.as_ptr())
},
Expr::Mul(lhs, rhs) => {
let lhs = codegen_expr(context, builder, *lhs);
let rhs = codegen_expr(context, builder, *rhs);
let name = CString::new("multmp").unwrap();
llvm::core::LLVMBuildMul(builder, lhs, rhs, name.as_ptr())
},
Expr::Div(lhs, rhs) => {
let lhs = codegen_expr(context, builder, *lhs);
let rhs = codegen_expr(context, builder, *rhs);
let name = CString::new("divtmp").unwrap();
llvm::core::LLVMBuildUDiv(builder, lhs, rhs, name.as_ptr())
},
}
}
Now you can execute programs like 10 * 4 + 20 / 2 - 8
! Pretty cool, if you ask
me.
Variables
We’re going to take the easy road and assume that programs don’t do any annoying
stuff like referencing undefined variables. We just store variables in
registers, and store them in a HashMap<String, LLVMValueRef>
. This works
because there’s only one path through the program.
We extend the language and parser:
pub enum Expr {
Literal(String),
Ref(String),
Assign(String, Box<Expr>),
Add(Box<Expr>, Box<Expr>),
Sub(Box<Expr>, Box<Expr>),
Mul(Box<Expr>, Box<Expr>),
Div(Box<Expr>, Box<Expr>),
}
peg! parser(r#"
use super::Expr;
#[pub]
program -> Vec<Expr>
= e:(expression ** "\n") "\n" { e }
expression -> Expr
= i:identifier _ "=" _ s:sum { Expr::Assign(i, Box::new(s)) }
/ sum
sum -> Expr
= a:product _ "+" _ b:sum { Expr::Add(Box::new(a), Box::new(b)) }
/ a:product _ "-" _ b:sum { Expr::Sub(Box::new(a), Box::new(b)) }
/ product
product -> Expr
= a:ref_or_literal _ "*" _ b:product { Expr::Mul(Box::new(a), Box::new(b)) }
/ a:ref_or_literal _ "/" _ b:product { Expr::Div(Box::new(a), Box::new(b)) }
/ ref_or_literal
ref_or_literal -> Expr
= i:identifier { Expr::Ref(i) }
/ int_literal
identifier -> String
= [a-zA-Z]+ { match_str.to_owned() }
int_literal -> Expr
= [0-9]+ { Expr::Literal(match_str.to_owned()) }
_ = " "*
"#);
We then add support for the two new expressions:
unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, names: &mut HashMap<String, LLVMValueRef>, expr: Expr) -> LLVMValueRef {
match expr {
// ...
Expr::Ref(name) => {
*names.get(&name).unwrap()
},
Expr::Assign(name, expr) => {
let new_value = codegen_expr(context, builder, names, *expr);
names.insert(name, new_value);
new_value
},
}
}
And then a quick update in the function codegen
:
let int_type = llvm::core::LLVMInt64TypeInContext(context);
let zero = llvm::core::LLVMConstInt(int_type, 0, 0);
let mut names = HashMap::new();
let mut return_value = zero; // return value on empty program
for expr in input {
return_value = codegen_expr(context, builder, &mut names, expr);
}
llvm::core::LLVMBuildRet(builder, return_value);
And voila! Check it out:
vagrant@vagrant:/vagrant$ cat in.ex
a = 3
b = 76
a + b
vagrant@vagrant:/vagrant$ cargo run
Running `target/debug/main`
vagrant@vagrant:/vagrant$ cat out.ll
; ModuleID = 'example_module'
define i64 @main() {
entry:
ret i64 79
}
If
Things get a little tricker with if
. The easiest way to get if
working is to
store all local variables on the stack, and let LLVM do the optimizing. In LLVM,
you create a stack variable with the alloca
instruction, and then read/write
it using load
/store
.
To do this, we again extend the langauge and grammar, by adding a new parser rule:
expression -> Expr
= if_expression
/ i:identifier _ "=" _ s:expression { Expr::Assign(i, Box::new(s)) }
/ sum
if_expression -> Expr
= "if" _ e:expression _ "{\n" _ then_body:statements _ "}" _ "else" _ "{\n" _ else_body:statements _ "}" {
Expr::If(Box::new(e), then_body, else_body)
}
And adding a new type of AST node:
pub enum Expr {
Literal(String),
Ref(String),
Assign(String, Box<Expr>),
Add(Box<Expr>, Box<Expr>),
Sub(Box<Expr>, Box<Expr>),
Mul(Box<Expr>, Box<Expr>),
Div(Box<Expr>, Box<Expr>),
If(Box<Expr>, Vec<Expr>, Vec<Expr>),
}
Finally, we emit code for an if
expression:
unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, func: LLVMValueRef, names: &mut HashMap<String, LLVMValueRef>, expr: Expr) -> LLVMValueRef {
match expr {
// ...
Expr::If(condition, then_body, else_body) => {
let condition_value = codegen_expr(context, builder, func, names, *condition);
let int_type = llvm::core::LLVMInt64TypeInContext(context);
let zero = llvm::core::LLVMConstInt(int_type, 0, 0);
// `is_nonzero` is a 1-bit integer
let name = CString::new("is_nonzero").unwrap();
let is_nonzero = llvm::core::LLVMBuildICmp(builder, llvm::LLVMIntPredicate::LLVMIntNE, condition_value, zero, name.as_ptr());
// It's fine to create blocks first, and then fill them in later.
let entry_name = CString::new("entry").unwrap();
let then_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr());
let else_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr());
let merge_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr());
llvm::core::LLVMBuildCondBr(builder, is_nonzero, then_block, else_block);
llvm::core::LLVMPositionBuilderAtEnd(builder, then_block);
let mut then_return = zero;
for expr in then_body {
then_return = codegen_expr(context, builder, func, names, expr);
}
llvm::core::LLVMBuildBr(builder, merge_block);
llvm::core::LLVMPositionBuilderAtEnd(builder, else_block);
let mut else_return = zero;
for expr in else_body {
else_return = codegen_expr(context, builder, func, names, expr);
}
llvm::core::LLVMBuildBr(builder, merge_block);
// Position the builder so that it's ready to work on the next
// expression.
llvm::core::LLVMPositionBuilderAtEnd(builder, merge_block);
zero
}
}
}
That’s a lot of code, but it does what you’d expect. You can now run programs like this:
a = 1
if a {
a = 42
} else {
a = 13
}
a
Which produces the following IR:
; ModuleID = 'example_module'
define i64 @main() {
entry:
%a = alloca i64
store i64 1, i64* %a
%a1 = load i64, i64* %a
%is_nonzero = icmp ne i64 %a1, 0
br i1 %is_nonzero, label %entry2, label %entry3
entry2: ; preds = %entry
store i64 42, i64* %a
br label %entry4
entry3: ; preds = %entry
store i64 13, i64* %a
br label %entry4
entry4: ; preds = %entry3, %entry2
%a5 = load i64, i64* %a
ret i64 %a5
}
However, we’re not quite finished. Currently our if “expression” always
evaluates to zero. Instead, if
should evaluate to then_return
if we execute
the “then” path, or else_return
otherwise.
How do you get LLVM to track which branch it went through? By using a “Phi”
node. You give the phi
instruction a list of (block, value)
pairs, and that
phi node will return the value corresponding to the block that was previously
executed.
Here’s how we can finish off if
. Note that we have to update then_block
and
else_block
. This is because we want the last block in the “then”/“else”
branch, and previously then_block
was the first block of the “then”/“else”.
// ...
// This is mostly the same code as before, just note the new calls to
// `LLVMGetInsertBlock`.
llvm::core::LLVMPositionBuilderAtEnd(builder, then_block);
let mut then_return = zero;
for expr in then_body {
then_return = codegen_expr(context, builder, func, names, expr);
}
llvm::core::LLVMBuildBr(builder, merge_block);
let then_block = llvm::core::LLVMGetInsertBlock(builder);
llvm::core::LLVMPositionBuilderAtEnd(builder, else_block);
let mut else_return = zero;
for expr in else_body {
else_return = codegen_expr(context, builder, func, names, expr);
}
llvm::core::LLVMBuildBr(builder, merge_block);
let else_block = llvm::core::LLVMGetInsertBlock(builder);
// Insert the phi node
llvm::core::LLVMPositionBuilderAtEnd(builder, merge_block);
let phi_name = CString::new("iftmp").unwrap();
let phi = llvm::core::LLVMBuildPhi(builder, int_type, phi_name.as_ptr());
let mut values = vec![then_return, else_return];
let mut blocks = vec![then_block, else_block];
llvm::core::LLVMAddIncoming(phi, values.as_mut_ptr(), blocks.as_mut_ptr(), 2);
phi
And there you go, an amazing compiler:
vagrant@vagrant:/vagrant$ cat in.ex
a = 1
b = 0
c = if a {
if b {
11
} else {
40
}
} else {
if b {
10
} else {
20
}
}
c + 2
vagrant@vagrant:/vagrant$ cargo run
Running `target/debug/main`
vagrant@vagrant:/vagrant$ lli-3.8 out.ll ; echo $?
42
So cool! Here’s the code we emit for that example input program:
; ModuleID = 'example_module'
define i64 @main() {
entry:
%a = alloca i64
%b = alloca i64
%c = alloca i64
store i64 1, i64* %a
store i64 0, i64* %b
%a1 = load i64, i64* %a
%is_nonzero = icmp ne i64 %a1, 0
br i1 %is_nonzero, label %entry2, label %entry3
entry2: ; preds = %entry
%b5 = load i64, i64* %b
%is_nonzero6 = icmp ne i64 %b5, 0
br i1 %is_nonzero6, label %entry7, label %entry8
entry3: ; preds = %entry
%b10 = load i64, i64* %b
%is_nonzero11 = icmp ne i64 %b10, 0
br i1 %is_nonzero11, label %entry12, label %entry13
entry4: ; preds = %entry14, %entry9
%iftmp16 = phi i64 [ %iftmp, %entry9 ], [ %iftmp15, %entry14 ]
store i64 %iftmp16, i64* %c
%c17 = load i64, i64* %c
%addtmp = add i64 %c17, 2
ret i64 %addtmp
entry7: ; preds = %entry2
br label %entry9
entry8: ; preds = %entry2
br label %entry9
entry9: ; preds = %entry8, %entry7
%iftmp = phi i64 [ 11, %entry7 ], [ 40, %entry8 ]
br label %entry4
entry12: ; preds = %entry3
br label %entry14
entry13: ; preds = %entry3
br label %entry14
entry14: ; preds = %entry13, %entry12
%iftmp15 = phi i64 [ 10, %entry12 ], [ 20, %entry13 ]
br label %entry4
}
Note the pattern the blocks have: excluding the first entry, they’re in groups
of three, with the first being a “then” branch, then an “else” branch, and then
the “merge” block (with its recognizable phi
instruction). That’s a relic of
the fact that each time we encounter an “if” expression, we append three new
blocks to main. The block triplets are ordered as they were recursively explored
in the AST.
That’s all I got! Hopefully at this point you have enough of a base to lead your own destiny.
Shout-out to Sunjay Varma for pointing out a broken link in this post.