Writing a Simple Compiler with LLVM

Welcome back, fellow developers! In our previous articles, we've explored the core components of LLVM and guided you through its installation and basic usage. Now it's time to get our hands dirty and write a simple compiler using LLVM. This article will take you through the process of building a compiler from scratch, leveraging LLVM's powerful libraries and tools. So, open up Vim (or your preferred IDE), and let's dive in!

Overview of Compiler Components

A typical compiler consists of several key components:

Lexical Analysis (Lexer): Converts the source code into tokens.
Syntax Analysis (Parser): Analyzes the tokens according to grammar rules and generates an Abstract Syntax Tree (AST).
Semantic Analysis: Checks for semantic errors and annotates the AST with type information.
Intermediate Representation (IR) Generation: Translates the AST into LLVM IR.
Optimization: Applies various optimization passes to the IR.
Code Generation: Converts the optimized IR into target machine code.

In this tutorial, we'll implement a basic compiler that covers these components for a simple arithmetic language.

Step 1: Setting Up the Project

First, create a new directory for your compiler project and initialize a CMake project:

mkdir my_compiler
cd my_compiler
mkdir src
mkdir build

Create a CMakeLists.txt file in the root directory with the following content:

cmake_minimum_required(VERSION 3.10)
project(MyCompiler)

set(CMAKE_CXX_STANDARD 14)

find_package(LLVM REQUIRED CONFIG)

include_directories(${LLVM_INCLUDE_DIRS})
add_definitions(${LLVM_DEFINITIONS})

add_executable(my_compiler src/main.cpp src/lexer.cpp src/parser.cpp src/codegen.cpp)

target_link_libraries(my_compiler LLVM)

Step 2: Implementing the Lexer

Create a file named lexer.cpp in the src directory and implement the lexical analyzer. The lexer will convert the source code into tokens.

#include <string>
#include <vector>
#include <cctype>
#include <iostream>

enum Token {
    tok_eof = -1,
    tok_identifier = -2,
    tok_number = -3,
    tok_plus = '+',
    tok_minus = '-',
    tok_mul = '*',
    tok_div = '/',
    tok_lparen = '(',
    tok_rparen = ')'
};

static std::string IdentifierStr;  // Filled in if tok_identifier
static double NumVal;              // Filled in if tok_number

int gettok() {
    static int LastChar = ' ';

    // Skip any whitespace.
    while (isspace(LastChar))
        LastChar = getchar();

    if (isalpha(LastChar)) {  // Identifier: [a-zA-Z][a-zA-Z0-9]*
        IdentifierStr = LastChar;
        while (isalnum((LastChar = getchar())))
            IdentifierStr += LastChar;
        return tok_identifier;
    }

    if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
        std::string NumStr;
        do {
            NumStr += LastChar;
            LastChar = getchar();
        } while (isdigit(LastChar) || LastChar == '.');

        NumVal = strtod(NumStr.c_str(), nullptr);
        return tok_number;
    }

    if (LastChar == '#') {
        // Comment until end of line.
        do
            LastChar = getchar();
        while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');

        if (LastChar != EOF)
            return gettok();
    }

    if (LastChar == EOF)
        return tok_eof;

    // Otherwise, just return the character as its ASCII value.
    int ThisChar = LastChar;
    LastChar = getchar();
    return ThisChar;
}

Step 3: Implementing the Parser

Create a file named parser.cpp in the src directory and implement the syntax analyzer. The parser will convert the tokens into an Abstract Syntax Tree (AST).

#include "lexer.cpp"
#include <memory>

/// ExprAST - Base class for all expression nodes.
class ExprAST {
public:
    virtual ~ExprAST() = default;
};

/// NumberExprAST - Expression class for numeric literals like "1.0".
class NumberExprAST : public ExprAST {
    double Val;

public:
    NumberExprAST(double Val) : Val(Val) {}
};

/// VariableExprAST - Expression class for referencing a variable, like "a".
class VariableExprAST : public ExprAST {
    std::string Name;

public:
    VariableExprAST(const std::string &Name) : Name(Name) {}
};

/// BinaryExprAST - Expression class for a binary operator.
class BinaryExprAST : public ExprAST {
    char Op;
    std::unique_ptr<ExprAST> LHS, RHS;

public:
    BinaryExprAST(char Op, std::unique_ptr<ExprAST> LHS, std::unique_ptr<ExprAST> RHS)
        : Op(Op), LHS(std::move(LHS)), RHS(std::move(RHS)) {}
};

/// Parser - A simple recursive descent parser for the arithmetic language.

static int CurTok;
static int getNextToken() { return CurTok = gettok(); }

std::unique_ptr<ExprAST> LogError(const char *Str) {
    std::cerr << "Error: " << Str << "\n";
    return nullptr;
}

std::unique_ptr<ExprAST> ParseNumberExpr() {
    auto Result = std::make_unique<NumberExprAST>(NumVal);
    getNextToken(); // consume the number
    return std::move(Result);
}

std::unique_ptr<ExprAST> ParseExpression();

std::unique_ptr<ExprAST> ParseParenExpr() {
    getNextToken(); // eat '('
    auto V = ParseExpression();
    if (!V)
        return nullptr;

    if (CurTok != tok_rparen)
        return LogError("expected ')'");
    getNextToken(); // eat ')'
    return V;
}

std::unique_ptr<ExprAST> ParseIdentifierExpr() {
    std::string IdName = IdentifierStr;

    getNextToken(); // eat identifier

    return std::make_unique<VariableExprAST>(IdName);
}

std::unique_ptr<ExprAST> ParsePrimary() {
    switch (CurTok) {
    default:
        return LogError("unknown token when expecting an expression");
    case tok_identifier:
        return ParseIdentifierExpr();
    case tok_number:
        return ParseNumberExpr();
    case tok_lparen:
        return ParseParenExpr();
    }
}

int GetTokPrecedence() {
    if (!isascii(CurTok))
        return -1;

    switch (CurTok) {
    case '+':
    case '-':
        return 20;
    case '*':
    case '/':
        return 40;
    default:
        return -1;
    }
}

std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec, std::unique_ptr<ExprAST> LHS) {
    while (true) {
        int TokPrec = GetTokPrecedence();

        if (TokPrec < ExprPrec)
            return LHS;

        int BinOp = CurTok;
        getNextToken(); // eat binop

        auto RHS = ParsePrimary();
        if (!RHS)
            return nullptr;

        int NextPrec = GetTokPrecedence();
        if (TokPrec < NextPrec) {
            RHS = ParseBinOpRHS(TokPrec + 1, std::move(RHS));
            if (!RHS)
                return nullptr;
        }

        LHS = std::make_unique<BinaryExprAST>(BinOp, std::move(LHS), std::move(RHS));
    }
}

std::unique_ptr<ExprAST> ParseExpression() {
    auto LHS = ParsePrimary();
    if (!LHS)
        return nullptr;

    return ParseBinOpRHS(0, std::move(LHS));
}

Step 4: Generating LLVM IR

Create a file named codegen.cpp in the src directory and implement the code generator. The code generator will translate the AST into LLVM IR.

#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/Module.h"
#include "llvm/IR/Verifier.h"
#include "parser.cpp"

using namespace llvm;

static LLVMContext TheContext;
static IRBuilder<> Builder(TheContext);
static std::unique_ptr<Module> TheModule;
static std::map<std::string, Value *> NamedValues;

Value *LogErrorV(const char *Str) {
    LogError(Str);
    return nullptr;
}

Value *NumberExprAST::codegen() {
    return ConstantFP::get(TheContext, APFloat(Val));
}

Value *VariableExprAST::codegen() {
    Value *V = NamedValues[Name];
    if (!V)
        return LogErrorV("Unknown variable name");
    return V;
}

Value *BinaryExprAST::codegen() {
    Value *L = LHS->codegen();
    Value *R = RHS->codegen();
    if (!L || !R)
        return nullptr;

    switch (Op) {
    case '+':
        return Builder.CreateFAdd(L, R, "addtmp");
    case '-':
        return Builder.CreateFSub(L, R, "subtmp");
    case '*':
        return Builder.CreateFMul(L, R, "multmp");
    case '/':
        return Builder.Create

FDiv(L, R, "divtmp");
    default:
        return LogErrorV("invalid binary operator");
    }
}

Step 5: Putting It All Together

Finally, create a main.cpp file in the src directory to tie everything together.

#include "codegen.cpp"

int main() {
    // Initialize LLVM components
    TheModule = std::make_unique<Module>("my cool jit", TheContext);

    // Prime the first token.
    getNextToken();

    // Run the main loop.
    while (true) {
        switch (CurTok) {
        case tok_eof:
            return 0;
        case ';': // ignore top-level semicolons.
            getNextToken();
            break;
        default:
            // Parse a top-level expression
            auto Expr = ParseExpression();
            if (Expr) {
                if (auto *IR = Expr->codegen()) {
                    IR->print(errs());
                    fprintf(stderr, "\n");
                }
            } else {
                getNextToken();
            }
            break;
        }
    }

    return 0;
}

Conclusion

Congratulations! You've just written a simple compiler using LLVM. We've covered the essential components of a compiler, including lexical analysis, parsing, and code generation. This project is a great starting point for more complex compiler development.

In the next part of this series, we'll delve into advanced optimizations with LLVM, exploring how to leverage LLVM's powerful optimization passes to improve the performance of your code. Stay tuned to our blog at slaptijack.com for more in-depth tutorials and insights into LLVM and other modern software development practices. If you have any questions or need further assistance, feel free to reach out. And remember, whether you're working on a simple compiler or a complex project, always strive for optimization and efficiency. Happy coding!