We have a parser that handles arithmetic expressions. Numbers, operators, parentheses. It works. But we’re not trying to build a calculator - we’re building a compiler for a real programming language.
Time to add the rest. Functions, variables, conditionals. Everything we need to write actual programs.
Same strategy as before: one piece at a time. We’ll start simple and build up.
Adding Identifiers
The first thing we need is identifiers - names for variables and functions. When you write factorial(n), both factorial and n are identifiers.
Lexing Identifiers
An identifier is a sequence of letters, digits, and underscores. But it can’t start with a digit - x1 is valid, 1x is not. That’s a convention most programming languages follow, and we will too.
So …
We have a parser that handles arithmetic expressions. Numbers, operators, parentheses. It works. But we’re not trying to build a calculator - we’re building a compiler for a real programming language.
Time to add the rest. Functions, variables, conditionals. Everything we need to write actual programs.
Same strategy as before: one piece at a time. We’ll start simple and build up.
Adding Identifiers
The first thing we need is identifiers - names for variables and functions. When you write factorial(n), both factorial and n are identifiers.
Lexing Identifiers
An identifier is a sequence of letters, digits, and underscores. But it can’t start with a digit - x1 is valid, 1x is not. That’s a convention most programming languages follow, and we will too.
So how do we recognize them? Same approach as numbers. We see a letter or underscore, we know an identifier is starting. Keep reading characters as long as they’re valid (letters, digits, underscores), then we’ve got our identifier.
First, we need a token type for identifiers. Open up your TokenType enum and add it:
class TokenType(Enum):
Number = auto()
Plus = auto()
Minus = auto()
Mul = auto()
Div = auto()
LParen = auto()
RParen = auto()
Identifier = auto() # New!
Eof = auto()
Now in the lexer’s main loop, add identifier recognition after number handling:
# Recognize numbers
if self.current.isdigit():
start = self.cursor
while self.current.isdigit():
self.increment()
self.push_token(TokenType.Number, self.source[start:self.cursor])
continue
# Recognize identifiers
if self.current.isalpha() or self.current == '_':
start = self.cursor
while self.current.isalnum() or self.current == '_':
self.increment()
word = self.source[start:self.cursor]
self.push_token(TokenType.Identifier, word)
continue
See the pattern? Both follow the same structure: mark where we start, keep reading valid characters, slice out the substring, create a token.
For numbers: start with a digit, keep reading digits. For identifiers: start with a letter or underscore, keep reading letters, digits, or underscores.
The difference is what counts as valid. Numbers can’t start with letters. Identifiers can’t start with digits - x1 is valid, 1x is not.
Parsing Identifiers
Now the parser needs to understand these identifier tokens. When we see x in the token stream, we need to create an AST node for it.
We need a new expression type. Open abstract.py and add:
class IdentifierExpr(Expression):
def __init__(self, value: Token):
self.value = value
Just like NumberExpr, but for identifiers instead of numbers. In the parser’s parse_primary(), add a case to handle them:
case TokenType.Identifier:
token = self.consume()
return IdentifierExpr(token)
Same pattern as numbers: consume the token, wrap it in the appropriate node type.
Let’s test it. Add this to parser.py:
if __name__ == "__main__":
from lexer import Lexer
# Test identifier parsing
source = "x + y"
lexer = Lexer(source)
tokens = lexer.lex()
parser = Parser(tokens)
ast = parser.parse_expression()
print(f"Root: {ast.op.value}")
print(f" Left: {ast.left.value.value}") # "x"
print(f" Right: {ast.right.value.value}") # "y"
Stop and think about what just happened. We added maybe 15 lines of code total - a few lines to the lexer, a new AST node, one case in the parser. And now our compiler can parse identifier expressions.
This is the pattern. This is how you build a compiler. Not by implementing everything at once, but by adding one piece at a time. Each piece is small. Each piece is testable. And they compose.
Keywords
We need keywords: func, if, else, for, in. But keywords look just like identifiers - they’re both sequences of letters.
The solution: lex everything that looks like an identifier, then check if it’s a keyword.
First, we need token types for each keyword. Expand the TokenType enum:
class TokenType(Enum):
# ... existing tokens ...
Func = auto()
If = auto()
Else = auto()
For = auto()
In = auto()
Identifier = auto()
Eof = auto()
When we lex something like func, how do we know it’s a keyword and not just a variable named func? We need to check after we’ve read the whole word.
Create a keyword lookup table at the top of your lexer file:
KEYWORDS = {
"func": TokenType.Func,
"if": TokenType.If,
"else": TokenType.Else,
"extern": TokenType.Extern,
}
Now update the identifier lexing code to check this table:
if self.current.isalpha() or self.current == '_':
start = self.cursor
while self.current.isalnum() or self.current == '_':
self.increment()
word = self.source[start:self.cursor]
# Check if it's a keyword
token_type = KEYWORDS.get(word, TokenType.Identifier)
self.push_token(token_type, word)
continue
See what we’re doing? We read the word first - doesn’t matter if it’s a keyword or identifier, same characters. Then we check: is this word in our keyword dictionary? If yes, use the keyword token type. If no, it’s a regular identifier.
This is why func becomes a keyword but factorial stays an identifier. Same lexing logic, different token types based on the lookup.
More Operators
We can parse identifiers and keywords, but to actually write useful programs we need more operators. Look at what we want to support:
func factorial(n) : if n < 2 1 else n * factorial(n - 1)
func fib(n) : if n < 2 n else fib(n - 1) + fib(n - 2)
We need:
:- separates the function signature from its body,- separates function parameters and arguments< > ==- comparison operators for conditionals
These aren’t arithmetic operators. They’re structural - they define the shape of our programs. Without :, we can’t define functions. Without ,, we can’t have multiple parameters. Without < > ==, we can’t make decisions.
Add them to TokenType:
class TokenType(Enum):
# Delimiters
LParen = auto()
RParen = auto()
Colon = auto() # Function body separator
Comma = auto() # Parameter/argument separator
# Operators
Plus = auto()
Minus = auto()
Mul = auto()
Div = auto()
Less = auto() # Comparison
Greater = auto() # Comparison
Equal = auto() # Equality (==)
# ... rest of tokens ...
Most of these are single characters. Update the single-character token map in the lexer:
single_char_tokens = {
'(': TokenType.LParen,
')': TokenType.RParen,
':': TokenType.Colon,
',': TokenType.Comma,
'+': TokenType.Plus,
'-': TokenType.Minus,
'*': TokenType.Mul,
'/': TokenType.Div,
'<': TokenType.Less,
'>': TokenType.Greater,
}
But wait, == is two characters. If we just check self.current, we only see the first =. We need to look ahead at the next character without consuming it.
This is a common pattern in lexers: lookahead. Sometimes you need to see what’s coming to decide what you’re currently looking at. Is = the start of ==, or just a single =?
To solve this, we can create a helper method that lets us look ahead without consuming:
def peek(self, offset: int = 1) -> str:
"""Look ahead at the next character without consuming it."""
pos = self.cursor + offset
if pos < len(self.source):
return self.source[pos]
return '\0'
This works like current, but it looks ahead by offset positions. peek() gives us the next character, peek(2) gives us the one after that, all without moving the cursor. We can look ahead, decide what we’re dealing with, then consume accordingly.
Now in the main loop, before we check single-character tokens, add the check for ==:
# Recognize identifiers and keywords
if self.current.isalpha() or self.current == '_':
start = self.cursor
while self.current.isalnum() or self.current == '_':
self.increment()
word = self.source[start:self.cursor]
token_type = KEYWORDS.get(word, TokenType.Identifier)
self.push_token(token_type, word)
continue
# Check for == before single characters
if self.current == '=' and self.peek() == '=':
self.push_token(TokenType.Equal, "==")
self.increment()
self.increment()
continue
# Recognize single-character tokens
single_char_tokens = {
'(': TokenType.LParen,
')': TokenType.RParen,
':': TokenType.Colon,
',': TokenType.Comma,
'+': TokenType.Plus,
'-': TokenType.Minus,
'*': TokenType.Mul,
'/': TokenType.Div,
'<': TokenType.Less,
'>': TokenType.Greater,
}
if token_type := single_char_tokens.get(self.current):
self.push_token(token_type, self.current)
self.increment()
continue
See the ordering? We check for == before we check single-character tokens. If the current character is = and the next is also =, we have ==. Consume both characters (two increment() calls) and create the token.
This order matters. If we checked single-character tokens first, we’d see the first =, not find it in our dictionary (we don’t have a single = token in YFC), and error out before we ever got to check if it was ==. Always check longer matches before shorter ones.
Comments
Comments start with # and go to the end of the line. We don’t want to tokenize them - they’re for humans, not the compiler. Just skip them entirely.
Add this near the top of the lexer’s main loop, right after we skip whitespace:
while self.current != '\0':
# Skip whitespace
if self.current.isspace():
self.increment()
continue
# Skip comments
if self.current == '#':
while self.current != '\n' and self.current != '\0':
self.increment()
continue
# Recognize numbers
if self.current.isdigit():
start = self.cursor
while self.current.isdigit():
self.increment()
self.push_token(TokenType.Number, self.source[start:self.cursor])
continue
# ... rest of lexing logic
When we see #, we consume characters until we hit a newline or the end of the file. No token gets created. By the time we get to parsing and code generation, comments are already gone - the compiler never sees them.
Updating the Precedence Map
We’ve added comparison operators (<, >, ==) to the lexer, but our parser doesn’t know how to handle them yet. We need to add them to the precedence map in parser.py:
PRECEDENCE_MAP = {
TokenType.Less: 10, # New!
TokenType.Greater: 10, # New!
TokenType.Equal: 10, # New!
TokenType.Plus: 20,
TokenType.Minus: 20,
TokenType.Mul: 40,
TokenType.Div: 40,
}
Comparisons have lower precedence than arithmetic, so 2 + 3 < 5 and 1 + 1 == 2 parse as (2 + 3) < 5 and (1 + 1) == 2. The arithmetic happens first, then we compare the results.
This means our existing binary operation parsing already handles these operators - they just flow through the precedence climbing logic we built in Part 5. No new parsing code needed.
Function Calls
Now let’s parse function calls like factorial(5).
Here’s the thing: when we see an identifier, we don’t immediately know what it is. Is factorial a variable, or is it a function call? We need to look ahead. If the next token is (, it’s a function call. If not, it’s just a variable reference.
This is why lookahead matters. We parse the identifier, then decide what to do based on what comes next.
First, add a CallExpr AST node:
class CallExpr(Expression):
def __init__(self, callee: IdentifierExpr, arguments: list[Expression]):
self.callee = callee
self.arguments = arguments
A function call has a name (the callee) and a list of argument expressions.
In parse_primary(), update the identifier case:
case TokenType.Identifier:
token = self.consume()
ident = IdentifierExpr(token)
# Just an identifier
if self.current._type != TokenType.LParen:
return ident
# It's a function call
self.consume() # eat '('
arguments: list[Expression] = []
while self.current._type != TokenType.RParen and self.current._type != TokenType.Eof:
arguments.append(self.parse_expression())
if self.current._type == TokenType.Comma:
self.consume()
self.consume_assert(TokenType.RParen, "Expected ')' after function arguments")
return CallExpr(ident, arguments)
When we see an identifier, we check what comes next. If it’s (, it’s a function call. We loop parsing expressions (the arguments) separated by commas, until we hit ).
Notice we’re doing a lot of “consume this token and make sure it’s the right type.” We check for ), we check for ,, we’ll be checking for : and other delimiters constantly. Writing if token._type != expected: raise error everywhere gets old fast.
Let’s add a helper to clean this up:
def consume_assert(self, expected: TokenType, msg: str) -> Token:
"""Consume a token and assert it's the expected type."""
token = self.consume()
if token._type != expected:
raise ParserError(f"{msg}. Got {token._type}")
return token
Now instead of consume-check-error, we just consume_assert(TokenType.RParen, "Expected ')'"). One line. Clearer intent. Better error messages.
Let’s see if function calls actually work. Add this test to parser.py:
if __name__ == "__main__":
from lexer import Lexer
# Test function call parsing
source = "factorial(5)"
lexer = Lexer(source)
tokens = lexer.lex()
parser = Parser(tokens)
ast = parser.parse_expression()
print(f"Function: {ast.callee.value.value}")
print(f"Arguments: {len(ast.arguments)}")
print(f"Arg 0: {ast.arguments[0].value.value}")
Run it:
$ python parser.py
Function: factorial
Arguments: 1
Arg 0: 5
The parser sees factorial, looks ahead and finds (. That tells it this is a function call, not just a variable. It then loops, parsing each argument expression (which could be arbitrarily complex - nested calls, arithmetic, whatever), separating them by commas.
If Expressions
Now for conditionals. In most languages, if is a statement - it does something (executes a branch of code) but doesn’t return a value. Statements are actions. You can’t write x = if condition ... in C or Python because if doesn’t produce a value.
YFC is different. We only have expressions, not statements. Everything evaluates to a value. So if is an expression that evaluates to either the then-branch or the else-branch:
if n < 2
1
else
n * factorial(n - 1)
Add the IfExpr node:
class IfExpr(Expression):
def __init__(
self,
condition: Expression,
then_expr: Expression,
else_expr: Expression | None,
):
self.condition = condition
self.then_expr = then_expr
self.else_expr = else_expr # None if no else clause
An if-expression has three parts: the condition, the then-branch, and optionally an else-branch.
Add the if-expression case to parse_primary():
case TokenType.If:
self.consume() # eat 'if'
condition = self.parse_expression()
then_expr = self.parse_expression()
else_expr = None
if self.current._type == TokenType.Else:
self.consume()
else_expr = self.parse_expression()
return IfExpr(condition, then_expr, else_expr)
Three expressions in a row: condition, then-branch, optional else-branch. The else keyword is optional - if it’s not there, else_expr stays None.
Test it:
source = "if n < 2 1 else n * 2"
lexer = Lexer(source)
tokens = lexer.lex()
parser = Parser(tokens)
ast = parser.parse_expression()
print(f"Type: {type(ast).__name__}")
print(f"Condition: {ast.condition.op.value}") # <
print(f"Then: {ast.then_expr.value.value}") # 1
print(f"Else: {ast.else_expr.op.value}") # *
Notice what we just did. We parsed three separate expressions in sequence - condition, then-branch, else-branch - and tied them together into a single IfExpr node. The parser understands that these three pieces form one logical unit.
And because if is an expression, you can nest them. Put an if inside another if. Use one as a function argument. This compositional structure is powerful - you get a lot of flexibility from a simple rule.
Spans
Before we add more features, let’s improve error reporting. Back in Part 3, we added Location tracking to tokens - every token knows its row and column. But expressions don’t have locations yet. When we report an error in code generation, we want to point to the whole expression, not just one token.
An expression can span multiple tokens. The expression 2 + 3 * 4 starts at the 2 and ends at the 4. We need to capture that range.
A Span is just a pair of locations - where something starts and where it ends:
@dataclass
class Span:
start: Location
end: Location
Simple. But powerful. Now we can point to exactly which part of the source code an expression came from.
For example, if we have an error in this code:
if n < 2 1 else n * factorial(n - 1)
The entire if-expression spans from the i in if to the ) in factorial(n - 1). With that information, we can show:
Error: Type mismatch in if expression
if n < 2 1 else n * factorial(n - 1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The underline covers the exact expression that’s problematic. Much better than just saying “error on line 5.”
Update the Expression base class:
class Expression:
def __init__(self, span: Span):
self.span = span
Now all expressions need spans. Update NumberExpr:
class NumberExpr(Expression):
def __init__(self, span: Span, value: Token):
super().__init__(span)
self.value = value
For a number, the span is just the token’s location:
case TokenType.Number:
token = self.consume()
span = Span(token.location, token.location)
return NumberExpr(span, token)
For a binary operation, the span goes from the start of the left expression to the end of the right:
def parse_binary_op(self, left: Expression, op: Token) -> Expression:
current_precedence = PRECEDENCE_MAP.get(op._type, 0)
right = self.parse_primary()
while PRECEDENCE_MAP.get(self.current._type, 0) > current_precedence:
next_op = self.consume()
right = self.parse_binary_op(right, next_op)
span = Span(left.span.start, right.span.end)
return BinaryOp(span, left, op, right)
Update all the other expression types similarly:
class IdentifierExpr(Expression):
def __init__(self, span: Span, value: Token):
super().__init__(span)
self.value = value
class CallExpr(Expression):
def __init__(self, span: Span, callee: IdentifierExpr, arguments: list[Expression]):
super().__init__(span)
self.callee = callee
self.arguments = arguments
class IfExpr(Expression):
def __init__(
self,
span: Span,
condition: Expression,
then_expr: Expression,
else_expr: Expression | None,
):
super().__init__(span)
self.condition = condition
self.then_expr = then_expr
self.else_expr = else_expr
Update the parsing code to create proper spans. For if-expressions:
case TokenType.If:
start = self.consume().location # Save start location
condition = self.parse_expression()
then_expr = self.parse_expression()
else_expr = None
if self.current._type == TokenType.Else:
self.consume()
else_expr = self.parse_expression()
# Span goes from 'if' keyword to end of last expression
end = then_expr.span.end if else_expr is None else else_expr.span.end
return IfExpr(Span(start, end), condition, then_expr, else_expr)
For function calls:
case TokenType.Identifier:
token = self.consume()
ident = IdentifierExpr(Span(token.location, token.location), token)
if self.current._type != TokenType.LParen:
return ident
self.consume()
arguments: list[Expression] = []
while self.current._type != TokenType.RParen and self.current._type != TokenType.Eof:
arguments.append(self.parse_expression())
if self.current._type == TokenType.Comma:
self.consume()
end_token = self.consume_assert(TokenType.RParen, "Expected ')' after arguments")
return CallExpr(Span(ident.span.start, end_token.location), ident, arguments)
Now every expression knows exactly where it came from in the source code.
Function Definitions
Now for functions themselves. So far everything we’ve parsed has been expressions - things that evaluate to values. But functions are different. They’re not values themselves - they’re named chunks of code.
Think of a function as a label. When we write func factorial(n) : ..., we’re saying “here’s a piece of code, and its name is factorial.” At the assembly level, that’s literally what happens - factorial becomes a label in the generated code, and calling the function is just jumping to that label.
Remember the assembly we looked at in Part 2? Here’s a snippet:
fib:
push rbp
mov rbp, rsp
...
call fib # Jump to the fib label
...
ret
main:
push rbp
mov rbp, rsp
mov edi, 10
call fib # Jump to the fib label
pop rbp
ret
See? fib: and main: are just labels. When you call fib, the CPU jumps to that label and starts executing instructions. That’s all a function is at the machine level - a named location in the instruction stream.
Functions sit at the top level of a program. You don’t write a function inside an expression (at least not in YFC - some languages allow this). They’re definitions, not computations.
So we need a different kind of AST node - not an Expression, but a Function:
class Function:
def __init__(
self,
span: Span,
name: str,
parameters: list[Token],
body: Expression | None,
):
self.span = span
self.name = name
self.parameters = parameters
self.body = body # None for extern functions
Functions have the grammar: func name(params) : body. Let’s break this down:
funckeyword- Function name (an identifier)
(then a list of parameters separated by commas, then):separator- Body expression
A concrete example: func add(x, y) : x + y
Add a parse_function() method to the parser:
def parse_function(self, is_extern: bool = False) -> Function:
"""Parse a function definition."""
start = self.consume().location # eat 'func'
name = self.consume_assert(TokenType.Identifier, "Expected function name")
self.consume_assert(TokenType.LParen, "Expected '(' after function name")
# Parse parameters
parameters: list[Token] = []
while self.current._type != TokenType.RParen and self.current._type != TokenType.Eof:
param = self.consume_assert(TokenType.Identifier, "Expected parameter name")
parameters.append(param)
if self.current._type == TokenType.Comma:
self.consume()
self.consume_assert(TokenType.RParen, "Expected ')' after parameters")
end = self.consume_assert(TokenType.Colon, "Expected ':' before function body").location
# Parse body (None for extern functions)
body = None if is_extern else self.parse_expression()
return Function(
Span(start, body.span.end if body is not None else end),
name.value,
parameters,
body,
)
Let’s walk through this step by step:
- Consume
funcand save location - We need the start location for our span - Get the function name - Must be an identifier.
factorial,add, whatever - Expect
(- Function definitions always have parentheses, even with no parameters - Parse parameters - This is the interesting part. We loop until we hit
):
- Each iteration, we expect an identifier (the parameter name)
- Add it to our parameters list
- If the next token is a comma, consume it and keep going
- If not, we’re done with parameters
This handles zero parameters (immediate )), one parameter (no comma), and multiple parameters (comma-separated)
- Expect
)- Close the parameter list - Expect
:- Separates the signature from the body. Save its location for the span - Parse the body - Here’s where
is_externcomes in. Extern functions have no body, so we set it toNone. Regular functions parse an expression as the body - Build the Function node - The span goes from
functo the end of the body (or to:for extern functions)
The parameter parsing loop is the same pattern we used for function call arguments - keep parsing items separated by commas until we hit the closing delimiter. It’s a common pattern you’ll see in parsers all the time.
A Note on Extern Functions
Notice the is_extern flag and the fact that body can be None. This handles extern function declarations.
Sometimes you want to call functions that aren’t written in YFC. Maybe you want to call C standard library functions, or system calls. These functions are defined elsewhere - in compiled C libraries that we’ll link against.
An extern declaration looks like:
extern func putchard(char) :
No body. We’re just telling the compiler “there’s a function called putchard that takes one parameter. Trust me, it exists.” During code generation, we’ll emit a call to putchard. Then when we link our compiled code with a C library that implements it, the linker finds the actual putchard implementation and wires everything up.
What’s putchard? It’s short for “put char double” - it takes a number (double) and prints it as an ASCII character. Since YFC only has numbers, we can’t have strings or format specifiers like printf. But we can print characters: putchard(72) prints ‘H’, putchard(105) prints ‘i’. That’s enough to print text, one character at a time.
This is how YFC can interact with the outside world. Want to print output? Call putchard. Want to allocate memory? Declare extern func malloc(size) : and call it. We just need to declare them as extern, and the linker handles the rest.
Many languages don’t do it this way. Python has its own print() built-in. Rust has std::println! in its standard library. These are implemented in the language itself (or in Rust’s case, wrapped in Rust code that eventually calls system functions). But we’re keeping YFC simple - we’ll just call C functions directly.
For now, we just parse them and set body = None. We’ll deal with the linking details when we get to code generation.
Top-Level Parsing
So far we’ve been parsing individual expressions with parse_expression(). But a real program isn’t just one expression - it’s multiple function definitions, maybe some top-level expressions to execute.
We need a method that parses an entire program from start to finish. Loop through all the tokens, building a list of functions:
def parse(self) -> list[Function]:
"""Parse the entire program."""
functions = []
while self.current._type != TokenType.Eof:
if self.current._type == TokenType.Func:
functions.append(self.parse_function())
else:
# Top-level expression - wrap it in an anonymous function
expr = self.parse_expression()
func = Function(expr.span, f"__anon_{len(functions)}", [], expr)
functions.append(func)
return functions
Keep going until we hit EOF. At each step: is this a function definition (starts with func)? Parse it as a function. Otherwise, it’s a top-level expression - parse it and wrap it in an anonymous function.
Why wrap top-level expressions in functions? Because YFC only has expressions, not statements (yet). When you write 2 + 3 at the top level, or factorial(5), we need a way to execute these in sequence during code generation. Wrapping them as functions gives us a uniform interface - everything becomes a function to compile. Don’t worry about the details now; it’ll make sense when we get to code generation.
Before we test it, let’s add a helper to print the AST tree so we can actually see what we built. We need a function that recursively walks through all the expression types and shows their structure:
def print_expr(expr: Expression, indent: int = 0) -> None:
"""Recursively print the AST structure."""
prefix = " " * indent
if isinstance(expr, NumberExpr):
print(f"{prefix}NumberExpr({expr.value.value})")
elif isinstance(expr, IdentifierExpr):
print(f"{prefix}IdentifierExpr({expr.value.value})")
elif isinstance(expr, BinaryOp):
print(f"{prefix}BinaryOp({expr.op.value})")
print(f"{prefix} left:")
print_expr(expr.left, indent + 2)
print(f"{prefix} right:")
print_expr(expr.right, indent + 2)
elif isinstance(expr, CallExpr):
print(f"{prefix}CallExpr")
print(f"{prefix} callee: {expr.callee.value.value}")
print(f"{prefix} args:")
for arg in expr.arguments:
print_expr(arg, indent + 2)
elif isinstance(expr, IfExpr):
print(f"{prefix}IfExpr")
print(f"{prefix} condition:")
print_expr(expr.condition, indent + 2)
print(f"{prefix} then:")
print_expr(expr.then_expr, indent + 2)
if expr.else_expr:
print(f"{prefix} else:")
print_expr(expr.else_expr, indent + 2)
This handles all our expression types. For each one, it prints the type and relevant info (like the operator or variable name), then recursively prints any sub-expressions with increased indentation.
Now test the full parser. Add this __main__ block to the bottom of parser.py:
if __name__ == "__main__":
from lexer import Lexer
source = """
# Test the full parser
func factorial(n) :
if n < 2
1
else
n * factorial(n - 1)
func fib(n) :
if n < 2
n
else
fib(n - 1) + fib(n - 2)
# Top-level expression
factorial(5)
"""
lexer = Lexer(source)
tokens = lexer.lex()
parser = Parser(tokens)
functions = parser.parse()
print("\nParsed functions:")
for func in functions:
print(f"\nFunction: {func.name}")
print(f" Parameters: {[p.value for p in func.parameters]}")
if func.body:
print(f" Body:")
print_expr(func.body, indent=2)
Run it:
$ python parser.py
Output:
Function: factorial
Parameters: ['n']
Body:
IfExpr
condition:
BinaryOp(<)
left:
IdentifierExpr(n)
right:
NumberExpr(2)
then:
NumberExpr(1)
else:
BinaryOp(*)
left:
IdentifierExpr(n)
right:
CallExpr
callee: factorial
args:
BinaryOp(-)
left:
IdentifierExpr(n)
right:
NumberExpr(1)
Function: fib
Parameters: ['n']
Body:
IfExpr
condition:
BinaryOp(<)
left:
IdentifierExpr(n)
right:
NumberExpr(2)
then:
IdentifierExpr(n)
else:
BinaryOp(+)
left:
CallExpr
callee: fib
args:
BinaryOp(-)
left:
IdentifierExpr(n)
right:
NumberExpr(1)
right:
CallExpr
callee: fib
args:
BinaryOp(-)
left:
IdentifierExpr(n)
right:
NumberExpr(2)
Function: __anon_2
Parameters: []
Body:
CallExpr
callee: factorial
args:
NumberExpr(5)
Look at that tree. That’s the entire program - recursive functions with conditionals. Every piece is there.
Check out fib(n - 1) + fib(n - 2): a BinaryOp(+) with two CallExpr nodes, each containing their own BinaryOp(-) for the arguments. The recursive calls are right there in the AST, ready to be compiled.
We didn’t just parse the syntax. We built a semantic representation of what the code means. The tree structure shows the relationships between operations, the order of evaluation, everything we need to generate code later.
What We’ve Built
We started with arithmetic. Now we can parse:
- Identifiers: variable and function names
- Keywords:
func,if,else,extern - Operators: arithmetic (
+,-,*,/) and comparison (<,>,==) - Comments:
#to end of line - Function calls:
factorial(5)with multiple arguments - Conditionals:
if n < 2 1 else n * 2 - Function definitions:
func factorial(n) : ... - Extern declarations:
extern func putchard(char) : - Spans: every expression knows its source range
- Location tracking: every token knows where it came from
The frontend is complete. We can take YFC source code and turn it into an AST. We can’t run anything yet - that’s code generation. But we can parse the full language.
We didn’t rewrite anything. Same lexer structure as Part 3 - just more token types. Same parser structure as Part 5 - just more cases in parse_primary(). The architecture scaled.
Understanding the Compiler Structure
What we just built is called the compiler frontend. Compilers are typically split into two main parts:
Frontend (what we have):
- Lexer - Source code → Tokens
- Parser - Tokens → AST
- Semantic Analysis - Type checking, error detection (we’re skipping this)
The frontend understands the source language. It knows the syntax, the grammar, what’s valid and what’s not. Its job is to take text and produce a structured representation.
Backend (what’s next):
- Code Generation - AST → Assembly/Machine Code
- Optimization - Make the code faster, smaller
- Linking - Combine with libraries, resolve external symbols
The backend understands the target machine. It knows x86_64 assembly, register allocation, calling conventions. Its job is to take the AST and produce executable code.
The AST is the interface between them. The frontend produces it, the backend consumes it. This separation is powerful - you can have multiple frontends (different languages) targeting the same backend (same CPU architecture), or one frontend targeting multiple backends (different CPUs).
What’s Next
This was a lot. If you’re feeling overwhelmed, that’s normal. Go back and re-read sections. Run the code yourself. Add print statements everywhere - in parse_primary(), in parse_binary_op(), in the lexer. See the recursion happening. Watch how parse_expression() calls parse_primary() which might call parse_expression() again for nested structures. See it build the tree piece by piece.
Change the test program. Add more functions. Try nested conditionals. Break things - remove a closing paren, misspell a keyword. See what error messages you get. The best way to understand a parser is to watch it parse, and to watch it fail.
You built a compiler frontend. That’s not nothing.
Next, in Part 7, we’ll take stock of what we’ve built. We’ll discuss Turing completeness - why our tiny language can theoretically compute anything. We’ll talk about what’s missing - mutable state, data structures, loops. We’ll visualize the AST and really understand what we have.
After that? The backend. Turning these trees into assembly. That’s when we get a working compiler.
Next: Not Yet Released Part 7: Taking Stock Previous: Part 5: Building the Parser
Code Repository: All accompanying code for this tutorial series is available on GitHub. The repository includes complete source code for each step, example programs, and additional resources to help you follow along.