A Toy Front-End for LLVM, written in Rust

My current side project is a compiler written in Rust that emits LLVM IR. LLVM’s API is a bit daunting for noobs, and there aren’t many tutorials out there (and they’re all in C++, so it’s not always obvious how to do the same in Rust). I wish someone had held my hand as I got started with this, so here’s what I would have shown myself.

For Rust, your best option for interfacing with LLVM is through the llvm-sys crate. A kind person on the Internet hosts docs for it here. Of course, you should also check out LLVM’s tutorial, since it helps you understand how LLVM “thinks”. This post is basically a Rust translation of a subset of LLVM’s tutorial.

You can check out the final code for this post here.

Getting a working development environment

For starters, here’s a reproducible way to get hacking with LLVM:

# `curl` is just so we can next install Rust
sudo apt-get -y install clang curl llvm-3.8-dev
curl https://sh.rustup.rs -sSf | sh

# The `llvm-sys` crate expects something called `llvm-config` on your PATH.
sudo ln -s /usr/bin/llvm-config-3.8 /usr/bin/llvm-config

If you run that on a recent Ubuntu (you may need to apt-get update), you’ll be ready to go. Alternatively, you could run that inside a Vagrant box, with the following Vagrantfile:

Vagrant.configure("2") do |config|
  config.vm.box = "bento/ubuntu-16.04"
end

You can get started by running cargo init llvm-example --bin and putting the following (taken from llvm-sys) in src/main.rs:

//! Construct a function that does nothing in LLVM IR.

extern crate llvm_sys as llvm;

use std::ptr;

fn main() {
    unsafe {
        // Set up a context, module and builder in that context.
        let context = llvm::core::LLVMContextCreate();
        let module = llvm::core::LLVMModuleCreateWithName(b"nop\0".as_ptr() as *const _);
        let builder = llvm::core::LLVMCreateBuilderInContext(context);

        // Get the type signature for void nop(void);
        // Then create it in our module.
        let void = llvm::core::LLVMVoidTypeInContext(context);
        let function_type = llvm::core::LLVMFunctionType(void, ptr::null_mut(), 0, 0);
        let function = llvm::core::LLVMAddFunction(module, b"nop\0".as_ptr() as *const _,
                                                   function_type);

        // Create a basic block in the function and set our builder to generate
        // code in it.
        let bb = llvm::core::LLVMAppendBasicBlockInContext(context, function,
                                                           b"entry\0".as_ptr() as *const _);
        llvm::core::LLVMPositionBuilderAtEnd(builder, bb);

        // Emit a `ret void` into the function
        llvm::core::LLVMBuildRetVoid(builder);

        // Dump the module as IR to stdout.
        llvm::core::LLVMDumpModule(module);

        // Clean up. Values created in the context mostly get cleaned up there.
        llvm::core::LLVMDisposeBuilder(builder);
        llvm::core::LLVMDisposeModule(module);
        llvm::core::LLVMContextDispose(context);
    }
}

And in your Cargo.toml:

[package]
name = "llvm-example"
version = "0.1.0"
authors = ["Ulysse Carion <ulysse@ulysse.io>"]

[[bin]]
name = "main"

[dependencies]
llvm-sys = "0.2"

You should get:

vagrant@vagrant:/vagrant$ cargo run
   Compiling llvm-example v0.1.0 (file:///vagrant)
     Running `target/debug/main`
; ModuleID = 'nop'

define void @nop() {
entry:
  ret void
}

Hooray! Now we can start writing our own stuff.

A slightly less trivial program

To get started, let’s compile a program that simply sets a return code, by returning an integer from a function called main.

Here’s how I did it: (We’re gonna need a parser at some point, so I added it now; I used the peg crate):

#![feature(plugin)]
#![plugin(peg_syntax_ext)]

extern crate llvm_sys as llvm;

use std::ffi::CString;
use std::fs::File;
use std::io::Read;
use std::ptr;

fn main() {
    let mut input = String::new();
    let mut f = File::open("in.ex").unwrap();
    f.read_to_string(&mut input).unwrap();

    let parsed_input = parser::program(&input).unwrap();

    unsafe {
        codegen(parsed_input);
    }
}

peg! parser(r#"
    #[pub]
    program -> String
        = i:int_literal "\n" { i }

    int_literal -> String
        = [0-9]+ { match_str.to_owned() }
"#);

unsafe fn codegen(input: String) {
    let context = llvm::core::LLVMContextCreate();
    let module = llvm::core::LLVMModuleCreateWithName(b"example_module\0".as_ptr() as *const _);
    let builder = llvm::core::LLVMCreateBuilderInContext(context);

    // In LLVM, you get your types from functions.
    let int_type = llvm::core::LLVMInt64TypeInContext(context);
    let function_type = llvm::core::LLVMFunctionType(int_type, ptr::null_mut(), 0, 0);
    let function = llvm::core::LLVMAddFunction(module, b"main\0".as_ptr() as *const _, function_type);

    let entry_name = CString::new("entry").unwrap();
    let bb = llvm::core::LLVMAppendBasicBlockInContext(context, function, entry_name.as_ptr());
    llvm::core::LLVMPositionBuilderAtEnd(builder, bb);

    // The juicy part: construct a `LLVMValue` from a Rust value:
    let int_value: u64 = input.parse().unwrap();
    let int_value = llvm::core::LLVMConstInt(int_type, int_value, 0);

    llvm::core::LLVMBuildRet(builder, int_value);

    // Instead of dumping to stdout, let's write out the IR to `out.ll`
    let out_file = CString::new("out.ll").unwrap();
    llvm::core::LLVMPrintModuleToFile(module, out_file.as_ptr(), ptr::null_mut());

    llvm::core::LLVMDisposeBuilder(builder);
    llvm::core::LLVMDisposeModule(module);
    llvm::core::LLVMContextDispose(context);
}

It works! Check it out:

vagrant@vagrant:/vagrant$ cat in.ex
42
vagrant@vagrant:/vagrant$ cargo run
     Running `target/debug/main`
vagrant@vagrant:/vagrant$ lli-3.8 out.ll ; echo $?
42

Cool! And here’s what out.ll looks like, btw:

; ModuleID = 'example_module'

define i64 @main() {
entry:
  ret i64 42
}

Arithmetic

Let’s add support for adding, subtracting, multiplying, and dividing numbers. To do that, we need to extend our grammar. Let’s introduce an enum to represent the AST:

pub enum Expr {
    Add(Box<Expr>, Box<Expr>),
    Sub(Box<Expr>, Box<Expr>),
    Mul(Box<Expr>, Box<Expr>),
    Div(Box<Expr>, Box<Expr>),
    Literal(String),
}

We also need to extend our grammar:

// `product` and `sum` are that way to get operator precedence
peg! parser(r#"
    use super::Expr;

    #[pub]
    program -> Expr
        = e:expression "\n" { e }

    expression -> Expr
        = sum

    sum -> Expr
        = a:product _ "+" _ b:sum { Expr::Add(Box::new(a), Box::new(b)) }
        / a:product _ "-" _ b:sum { Expr::Sub(Box::new(a), Box::new(b)) }
        / product

    product -> Expr
        = a:int_literal _ "*" _ b:product { Expr::Mul(Box::new(a), Box::new(b)) }
        / a:int_literal _ "/" _ b:product { Expr::Div(Box::new(a), Box::new(b)) }
        / int_literal

    int_literal -> Expr
        = [0-9]+ { Expr::Literal(match_str.to_owned()) }

    _ = " "*
"#);

Next, let’s emit code. You can specify strings like “addtmp”, and these will be used as part of the name of the corresponding “register” in the IR.

// When you write out instructions in LLVM, you get back `LLVMValueRef`s. You
// can then use these references in other instructions.
unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, expr: Expr) -> LLVMValueRef {
    match expr {
        Expr::Literal(int_literal) => {
            let int_type = llvm::core::LLVMInt64TypeInContext(context);
            llvm::core::LLVMConstInt(int_type, int_literal.parse().unwrap(), 0)
        },

        Expr::Add(lhs, rhs) => {
            let lhs = codegen_expr(context, builder, *lhs);
            let rhs = codegen_expr(context, builder, *rhs);

            let name = CString::new("addtmp").unwrap();
            llvm::core::LLVMBuildAdd(builder, lhs, rhs, name.as_ptr())
        },

        Expr::Sub(lhs, rhs) => {
            let lhs = codegen_expr(context, builder, *lhs);
            let rhs = codegen_expr(context, builder, *rhs);

            let name = CString::new("subtmp").unwrap();
            llvm::core::LLVMBuildSub(builder, lhs, rhs, name.as_ptr())
        },

        Expr::Mul(lhs, rhs) => {
            let lhs = codegen_expr(context, builder, *lhs);
            let rhs = codegen_expr(context, builder, *rhs);

            let name = CString::new("multmp").unwrap();
            llvm::core::LLVMBuildMul(builder, lhs, rhs, name.as_ptr())
        },

        Expr::Div(lhs, rhs) => {
            let lhs = codegen_expr(context, builder, *lhs);
            let rhs = codegen_expr(context, builder, *rhs);

            let name = CString::new("divtmp").unwrap();
            llvm::core::LLVMBuildUDiv(builder, lhs, rhs, name.as_ptr())
        },
    }
}

Now you can execute programs like 10 * 4 + 20 / 2 - 8! Pretty cool, if you ask me.

Variables

We’re going to take the easy road and assume that programs don’t do any annoying stuff like referencing undefined variables. We just store variables in registers, and store them in a HashMap<String, LLVMValueRef>. This works because there’s only one path through the program.

We extend the language and parser:

pub enum Expr {
    Literal(String),
    Ref(String),
    Assign(String, Box<Expr>),
    Add(Box<Expr>, Box<Expr>),
    Sub(Box<Expr>, Box<Expr>),
    Mul(Box<Expr>, Box<Expr>),
    Div(Box<Expr>, Box<Expr>),
}

peg! parser(r#"
    use super::Expr;

    #[pub]
    program -> Vec<Expr>
        = e:(expression ** "\n") "\n" { e }

    expression -> Expr
        = i:identifier _ "=" _ s:sum { Expr::Assign(i, Box::new(s)) }
        / sum

    sum -> Expr
        = a:product _ "+" _ b:sum { Expr::Add(Box::new(a), Box::new(b)) }
        / a:product _ "-" _ b:sum { Expr::Sub(Box::new(a), Box::new(b)) }
        / product

    product -> Expr
        = a:ref_or_literal _ "*" _ b:product { Expr::Mul(Box::new(a), Box::new(b)) }
        / a:ref_or_literal _ "/" _ b:product { Expr::Div(Box::new(a), Box::new(b)) }
        / ref_or_literal

    ref_or_literal -> Expr
        = i:identifier { Expr::Ref(i) }
        / int_literal

    identifier -> String
        = [a-zA-Z]+ { match_str.to_owned() }

    int_literal -> Expr
        = [0-9]+ { Expr::Literal(match_str.to_owned()) }

    _ = " "*
"#);

We then add support for the two new expressions:

unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, names: &mut HashMap<String, LLVMValueRef>, expr: Expr) -> LLVMValueRef {
    match expr {
        // ...

        Expr::Ref(name) => {
            *names.get(&name).unwrap()
        },

        Expr::Assign(name, expr) => {
            let new_value = codegen_expr(context, builder, names, *expr);
            names.insert(name, new_value);
            new_value
        },
    }
}

And then a quick update in the function codegen:

let int_type = llvm::core::LLVMInt64TypeInContext(context);
let zero = llvm::core::LLVMConstInt(int_type, 0, 0);

let mut names = HashMap::new();
let mut return_value = zero; // return value on empty program
for expr in input {
    return_value = codegen_expr(context, builder, &mut names, expr);
}
llvm::core::LLVMBuildRet(builder, return_value);

And voila! Check it out:

vagrant@vagrant:/vagrant$ cat in.ex
a = 3
b = 76
a + b
vagrant@vagrant:/vagrant$ cargo run
     Running `target/debug/main`
vagrant@vagrant:/vagrant$ cat out.ll
; ModuleID = 'example_module'

define i64 @main() {
entry:
  ret i64 79
}

If

Things get a little tricker with if. The easiest way to get if working is to store all local variables on the stack, and let LLVM do the optimizing. In LLVM, you create a stack variable with the alloca instruction, and then read/write it using load/store.

To do this, we again extend the langauge and grammar, by adding a new parser rule:

expression -> Expr
    = if_expression
    / i:identifier _ "=" _ s:expression { Expr::Assign(i, Box::new(s)) }
    / sum

if_expression -> Expr
    = "if" _ e:expression _ "{\n" _ then_body:statements _ "}" _ "else" _ "{\n" _ else_body:statements _ "}" {
        Expr::If(Box::new(e), then_body, else_body)
    }

And adding a new type of AST node:

pub enum Expr {
    Literal(String),
    Ref(String),
    Assign(String, Box<Expr>),
    Add(Box<Expr>, Box<Expr>),
    Sub(Box<Expr>, Box<Expr>),
    Mul(Box<Expr>, Box<Expr>),
    Div(Box<Expr>, Box<Expr>),
    If(Box<Expr>, Vec<Expr>, Vec<Expr>),
}

Finally, we emit code for an if expression:

unsafe fn codegen_expr(context: LLVMContextRef, builder: LLVMBuilderRef, func: LLVMValueRef, names: &mut HashMap<String, LLVMValueRef>, expr: Expr) -> LLVMValueRef {
    match expr {
        // ...

        Expr::If(condition, then_body, else_body) => {
            let condition_value = codegen_expr(context, builder, func, names, *condition);
            let int_type = llvm::core::LLVMInt64TypeInContext(context);
            let zero = llvm::core::LLVMConstInt(int_type, 0, 0);

            // `is_nonzero` is a 1-bit integer
            let name = CString::new("is_nonzero").unwrap();
            let is_nonzero = llvm::core::LLVMBuildICmp(builder, llvm::LLVMIntPredicate::LLVMIntNE, condition_value, zero, name.as_ptr());

            // It's fine to create blocks first, and then fill them in later.
            let entry_name = CString::new("entry").unwrap();
            let then_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr());
            let else_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr());
            let merge_block = llvm::core::LLVMAppendBasicBlockInContext(context, func, entry_name.as_ptr());

            llvm::core::LLVMBuildCondBr(builder, is_nonzero, then_block, else_block);

            llvm::core::LLVMPositionBuilderAtEnd(builder, then_block);
            let mut then_return = zero;
            for expr in then_body {
                then_return = codegen_expr(context, builder, func, names, expr);
            }
            llvm::core::LLVMBuildBr(builder, merge_block);

            llvm::core::LLVMPositionBuilderAtEnd(builder, else_block);
            let mut else_return = zero;
            for expr in else_body {
                else_return = codegen_expr(context, builder, func, names, expr);
            }
            llvm::core::LLVMBuildBr(builder, merge_block);

            // Position the builder so that it's ready to work on the next
            // expression.
            llvm::core::LLVMPositionBuilderAtEnd(builder, merge_block);
            zero
        }
    }
}

That’s a lot of code, but it does what you’d expect. You can now run programs like this:

a = 1
if a {
    a = 42
} else {
    a = 13
}
a

Which produces the following IR:

; ModuleID = 'example_module'

define i64 @main() {
entry:
  %a = alloca i64
  store i64 1, i64* %a
  %a1 = load i64, i64* %a
  %is_nonzero = icmp ne i64 %a1, 0
  br i1 %is_nonzero, label %entry2, label %entry3

entry2:                                           ; preds = %entry
  store i64 42, i64* %a
  br label %entry4

entry3:                                           ; preds = %entry
  store i64 13, i64* %a
  br label %entry4

entry4:                                           ; preds = %entry3, %entry2
  %a5 = load i64, i64* %a
  ret i64 %a5
}

However, we’re not quite finished. Currently our if “expression” always evaluates to zero. Instead, if should evaluate to then_return if we execute the “then” path, or else_return otherwise.

How do you get LLVM to track which branch it went through? By using a “Phi” node. You give the phi instruction a list of (block, value) pairs, and that phi node will return the value corresponding to the block that was previously executed.

Here’s how we can finish off if. Note that we have to update then_block and else_block. This is because we want the last block in the “then”/“else” branch, and previously then_block was the first block of the “then”/“else”.

// ...

// This is mostly the same code as before, just note the new calls to
// `LLVMGetInsertBlock`.

llvm::core::LLVMPositionBuilderAtEnd(builder, then_block);
let mut then_return = zero;
for expr in then_body {
    then_return = codegen_expr(context, builder, func, names, expr);
}
llvm::core::LLVMBuildBr(builder, merge_block);
let then_block = llvm::core::LLVMGetInsertBlock(builder);

llvm::core::LLVMPositionBuilderAtEnd(builder, else_block);
let mut else_return = zero;
for expr in else_body {
    else_return = codegen_expr(context, builder, func, names, expr);
}
llvm::core::LLVMBuildBr(builder, merge_block);
let else_block = llvm::core::LLVMGetInsertBlock(builder);

// Insert the phi node
llvm::core::LLVMPositionBuilderAtEnd(builder, merge_block);
let phi_name = CString::new("iftmp").unwrap();
let phi = llvm::core::LLVMBuildPhi(builder, int_type, phi_name.as_ptr());

let mut values = vec![then_return, else_return];
let mut blocks = vec![then_block, else_block];

llvm::core::LLVMAddIncoming(phi, values.as_mut_ptr(), blocks.as_mut_ptr(), 2);
phi

And there you go, an amazing compiler:

vagrant@vagrant:/vagrant$ cat in.ex
a = 1
b = 0
c = if a {
    if b {
        11
    } else {
        40
    }
} else {
    if b {
        10
    } else {
        20
    }
}
c + 2
vagrant@vagrant:/vagrant$ cargo run
     Running `target/debug/main`
vagrant@vagrant:/vagrant$ lli-3.8 out.ll ; echo $?
42

So cool! Here’s the code we emit for that example input program:

; ModuleID = 'example_module'

define i64 @main() {
entry:
  %a = alloca i64
  %b = alloca i64
  %c = alloca i64
  store i64 1, i64* %a
  store i64 0, i64* %b
  %a1 = load i64, i64* %a
  %is_nonzero = icmp ne i64 %a1, 0
  br i1 %is_nonzero, label %entry2, label %entry3

entry2:                                           ; preds = %entry
  %b5 = load i64, i64* %b
  %is_nonzero6 = icmp ne i64 %b5, 0
  br i1 %is_nonzero6, label %entry7, label %entry8

entry3:                                           ; preds = %entry
  %b10 = load i64, i64* %b
  %is_nonzero11 = icmp ne i64 %b10, 0
  br i1 %is_nonzero11, label %entry12, label %entry13

entry4:                                           ; preds = %entry14, %entry9
  %iftmp16 = phi i64 [ %iftmp, %entry9 ], [ %iftmp15, %entry14 ]
  store i64 %iftmp16, i64* %c
  %c17 = load i64, i64* %c
  %addtmp = add i64 %c17, 2
  ret i64 %addtmp

entry7:                                           ; preds = %entry2
  br label %entry9

entry8:                                           ; preds = %entry2
  br label %entry9

entry9:                                           ; preds = %entry8, %entry7
  %iftmp = phi i64 [ 11, %entry7 ], [ 40, %entry8 ]
  br label %entry4

entry12:                                          ; preds = %entry3
  br label %entry14

entry13:                                          ; preds = %entry3
  br label %entry14

entry14:                                          ; preds = %entry13, %entry12
  %iftmp15 = phi i64 [ 10, %entry12 ], [ 20, %entry13 ]
  br label %entry4
}

Note the pattern the blocks have: excluding the first entry, they’re in groups of three, with the first being a “then” branch, then an “else” branch, and then the “merge” block (with its recognizable phi instruction). That’s a relic of the fact that each time we encounter an “if” expression, we append three new blocks to main. The block triplets are ordered as they were recursively explored in the AST.

That’s all I got! Hopefully at this point you have enough of a base to lead your own destiny.

Shout-out to Sunjay Varma for pointing out a broken link in this post.