Compilers and VMs

On this page

Table of Contents

Before we start writing the compiler and VM, we need to understand what they actually are and how they fit together.

What is a compiler?
#

A compiler translates source code from one representation to another. In part 1 we built the front-end (the lexer and parser) which turns Monkey source code into an AST. In part 2 we add a back-end.

Instead of tree-walking the AST (which is what our interpreter does), the compiler will walk the AST and emit bytecode instructions alongside a constant pool. The VM then executes that bytecode. If you’ve ever wondered how Python works under the hood, this is basically it.

graph LR
SC[Source Code] --> Lexer --> Tokens
Tokens --> Parser --> AST

      subgraph "Part 1 (Interpreter)"
          AST --> Evaluator --> R1[Result]
      end

      subgraph "Part 2 (Compiler)"
          AST --> Compiler --> BC[Bytecode + Constants]
          BC --> VM --> R2[Result]
      end

What is a virtual machine (VM)?
#

A VM is a program that emulates a computer. It has the same core components as a physical CPU: an instruction pointer that tracks the current instruction, a stack for intermediate values, and memory for variables and data structures.

Our VM is a stack machine. All operations take their operands from the stack and push results back onto the stack. Forth nerds, stand up.

Example: computing 1 + 2
#

Step 1              Step 2              Step 3
        OpConstant 0        OpConstant 1           OpAdd
        ─────────────       ─────────────       ─────────────

                            ┌───────────┐
                            │     2     │
        ┌───────────┐       ├───────────┤       ┌───────────┐
        │     1     │       │     1     │       │     3     │
        └───────────┘       └───────────┘       └───────────┘

         push 1         ──▶  push 2         ──▶  pop 2 & 1
                                                 push 3

The VM never needs to name the operands, they are always “the top two things on the stack.”

Stack machines vs. register machines
#

In a stack machine, operands are implicit. OpAdd means “pop two, push the sum.” You need more instructions because every operand has to be pushed first, but the instructions themselves are dead simple. That simplicity is why we’re building one.

A register machine names its operands explicitly: ADD R1, R2, R3 means “R3 = R1 + R2.” Fewer instructions, but the encoding gets more complex. Real CPUs (x86, ARM) are register machines. Most toy language VMs are stack machines.

What is bytecode?
#

Bytecode is a flat sequence of bytes encoding instructions for the VM. Each instruction starts with an opcode (one byte) that identifies the operation, optionally followed by operands.

Instruction format
#

[opcode: 1 byte] [operand1: N bytes] [operand2: N bytes] ...

Different opcodes have different numbers and sizes of operands:

Opcode	Operands	Total bytes	Meaning
`OpConstant`	index: 2 bytes (u16)	3	Push constant at index
`OpAdd`	none	1	Pop two, push sum
`OpSetGlobal`	index: 2 bytes (u16)	3	Pop value, store in global
`OpSetLocal`	index: 1 byte (u8)	2	Pop value, store in local
`OpClosure`	index: 2 bytes (u16), count: 1 byte (u8)	4	Create closure

Operands use big-endian encoding (most significant byte first).

The constant pool
#

Literal values (integers, strings, compiled functions) live in a separate array called the constant pool. Bytecode references them by index with OpConstant.

Why bother? An integer is 8 bytes. A u16 index is 2 bytes. The constant pool keeps the instruction stream compact.

Constants: [42, "hello", <compiled function>]
Bytecode:  OpConstant 0   -- pushes 42
           OpConstant 1   -- pushes "hello"
           OpConstant 2   -- pushes the function

The compiler/VM contract
#

The compiler and VM have to agree on what each opcode means, how many operands it has, how wide each operand is, and the byte encoding. If either side gets this wrong, everything breaks silently. This contract lives in the code.zig module, which we build in chapter 6.

Why is this faster?
#

The tree-walking interpreter from part 1 is slow for a few reasons. Every evaluation recurses through heap-allocated tree nodes. Every node requires a switch on a tagged union. Variable names get resolved by string lookup on every single access.

The bytecode VM avoids all of that. Bytecode is a contiguous array of bytes, so the CPU cache is happy. Dispatch is a switch on a single byte. The compiler resolves names and operators once at compile time, so the VM doesn’t repeat that work. And most VM operations just shuffle values on the stack without allocating anything.

What we’ll build in part 2
#

Chapter	What	New Opcodes
6	Bytecode format, minimal compiler/VM	`OpConstant`, `OpAdd`, `OpPop`
7	Arithmetic, booleans, comparisons, prefix	`OpSub`, `OpMul`, `OpDiv`, `OpTrue`, `OpFalse`, `OpEqual`, `OpNotEqual`, `OpGreaterThan`, `OpMinus`, `OpBang`
8	Conditionals (if/else)	`OpJumpNotTruthy`, `OpJump`, `OpNull`
9	Global variable bindings	`OpSetGlobal`, `OpGetGlobal`
10	Strings, arrays, hashes	`OpArray`, `OpHash`, `OpIndex`
11	Functions, locals, arguments	`OpCall`, `OpReturnValue`, `OpReturn`, `OpSetLocal`, `OpGetLocal`
12	Built-in functions	`OpGetBuiltin`
13	Closures	`OpClosure`, `OpGetFree`, `OpCurrentClosure`

In chapter 6 we implement the bytecode format and get the first working compiler/VM running.

What is a compiler? #

What is a virtual machine (VM)? #

Example: computing 1 + 2 #

Stack machines vs. register machines #

What is bytecode? #

Instruction format #

The constant pool #

The compiler/VM contract #

Why is this faster? #

What we’ll build in part 2 #