Writing Your First Compiler - Part 3: Lexical Analysis

The first step in any compiler is lexical analysis - breaking source code into tokens. When you see 2 + 3, you immediately recognize three separate things: a number, a plus sign, and another number. Your brain does this automatically. But a compiler has to explicitly walk through each character and figure out what’s meaningful.

Let’s say we want to tokenize this expression:

2 + 3

We need to walk through the string character by character. When we see 2, that’s a number. The space doesn’t mean anything. The + is an operator. Another space. Then 3 is another number.

So we end up with three tokens: NUMBER(2), PLUS(+), NUMBER(3).

That’s it. That’s what a lexer does - it turns a string of characters into a list of meaningful tokens.

Tokens and Lexeme…

Let’s say we want to tokenize this expression:

2 + 3

We need to walk through the string character by character. When we see 2, that’s a number. The space doesn’t mean anything. The + is an operator. Another space. Then 3 is another number.

So we end up with three tokens: NUMBER(2), PLUS(+), NUMBER(3).

That’s it. That’s what a lexer does - it turns a string of characters into a list of meaningful tokens.

Tokens and Lexemes

Before we start building, let’s clarify what these terms actually mean. In compiler theory, a token has two parts:

Token type (sometimes called a tag): The category of the symbol. Examples: NUMBER, IDENTIFIER, PLUS, LPAREN. This tells you what kind of thing you’re looking at.

Lexeme (sometimes called the value): The actual substring from the source code that produced the token. Examples: "42", "+", "foo". This is the literal text.

When we say NUMBER(2), we’re saying: “This is a token with type NUMBER and lexeme "2"”. The type tells the parser how to use it. The lexeme tells us what it actually was in the source.

Some tokens have meaningful lexemes - like numbers (42) or identifiers (factorial). Others don’t - the + token’s lexeme is just "+", which doesn’t tell you anything the type doesn’t already say.

Why separate them? Because the parser mostly cares about types. When parsing 2 + 3, the parser just needs to know “number, plus, number” to understand the structure. The actual values (2 and 3) only matter later when we’re generating code or evaluating the expression.

Starting Simple

Before building the entire lexer we should discuss how implementation is going to work in our compiler. We’ll implement features one-by-one, meaning that although YFC will eventually have if-else control structures, for-loops and similar, the best way to not get overwhelmed when building your first compiler is taking it one step at a time.

With that in mind, we’ll build a lexer that handles basic arithmetic: numbers, +, -, *, /, and parentheses. That’s enough to tokenize expressions like (2 + 3) * 4. Once we have this working, we can add more features later.

Building the Lexer

Let’s start coding. Create a new file called lexer.py - this will be the first real piece of our compiler.

The lexer needs to do two things: recognize what kind of thing each piece of text is (a number? an operator?), and package that information up so the rest of the compiler can use it. That means we need to define what a “token” looks like before we can write code that produces them.

Defining Tokens

First, we need a way to represent tokens. A token needs a type (is it a number? an operator?) and a value (what number? which operator?).

from enum import Enum, auto
from dataclasses import dataclass

class TokenType(Enum):
Number = auto()
Plus = auto()
Minus = auto()
Mul = auto()
Div = auto()
LParen = auto()
RParen = auto()
Eof = auto()

@dataclass
class Token:
_type: TokenType
value: str

We use an Enum for token types because it gives us type safety and clear naming. You can’t accidentally use "Plus" (a string) where you meant TokenType.Plus - the type checker will catch that. It also prevents typos and makes the code self-documenting.

For the Token itself, I use a dataclass because it’s just pure data - no behavior, just values we want to store. Dataclasses give us a clean __init__, __repr__, and equality checking for free.

Note that not all tokens have meaningful values. The + token’s value is just "+", but a number token like 42 has the actual number as its value.

The Lexer Structure

Now let’s build the lexer itself. The core idea is simple: maintain a position in the source string, look at the current character, and decide what to do with it.

class Lexer:
def __init__(self, source: str):
self.source = source
self.tokens: list[Token] = []
self.cursor = 0

We need three things:

source: The code we’re tokenizing
tokens: Where we’ll store the tokens we find
cursor: Our position in the source string

This is our state. Everything the lexer needs to know is captured here.

Helper Methods

Before we start lexing, let’s create some utility methods to make our life easier:

@property
def current(self) -> str:
"""Return the current character or EOF marker."""
if 0 <= self.cursor < len(self.source):
return self.source[self.cursor]
return "\0"

def increment(self):
"""Move to the next character."""
self.cursor += 1

def push_token(self, token_type: TokenType, value: str):
"""Add a token to our list."""
self.tokens.append(Token(token_type, value))

The current property gives us the character we’re looking at. When we’ve read everything, it returns \0 - a sentinel value that means “end of input”. This is cleaner than checking bounds everywhere.

The increment() method just moves us forward one character. You might think this is overkill - why not just do self.cursor += 1 directly? Two reasons: First, it makes the code more readable. Second, later we’ll expand this to track line and column numbers for error messages, and having it in one place makes that easy.

The push_token() method creates a token and adds it to our list. Again, we could just append directly, but wrapping it in a method gives us a place to add debugging, validation, or extra tracking later.

The Main Loop

Now for the actual lexing. The strategy is simple: loop through characters one by one, figure out what each character starts (a number? an operator?), and handle it accordingly.

def lex(self) -> list[Token]:
while self.current != '\0':
# Skip whitespace
if self.current.isspace():
self.increment()
continue

# Recognize numbers
if self.current.isdigit():
start = self.cursor
while self.current.isdigit():
self.increment()

self.push_token(TokenType.Number, self.source[start:self.cursor])
continue

# Recognize operators and parentheses
single_char_tokens = {
'(': TokenType.LParen,
')': TokenType.RParen,
'+': TokenType.Plus,
'-': TokenType.Minus,
'*': TokenType.Mul,
'/': TokenType.Div,
}

if token_type := single_char_tokens.get(self.current):
self.push_token(token_type, self.current)
self.increment()
continue

raise Exception(f"Unexpected character: '{self.current}'")

# Add EOF token
self.push_token(TokenType.Eof, "")
return self.tokens

Let’s break this down:

Whitespace: We check for whitespace first because it’s the most common thing we want to ignore. Spaces, tabs, newlines - they separate tokens but aren’t tokens themselves. We just skip over them.

Numbers: When we see a digit, we don’t just read one character. We keep reading digits until we hit something that isn’t a digit. That’s how we handle multi-digit numbers like 42 or 1337. We save the starting position, read all the digits, then slice out that substring and make it a token.

Single-character tokens: Operators and parentheses are all one character, so we use a dictionary to map from the character to its token type. If we find it in the dictionary, we make that token and move on. This is cleaner than a long chain of if/elif statements.

Unknown characters: If we see something that isn’t whitespace, a digit, or an operator, we error out. This catches typos and unsupported syntax early.

EOF token: At the end, we add an EOF (end-of-file) token. This tells the parser “that’s all the input”. Without it, the parser would have to constantly check if there are more tokens. With it, the parser can just look for EOF.

Testing It

Let’s try it out:

if __name__ == "__main__":
source = "(2 + 3) * 4"
lexer = Lexer(source)
tokens = lexer.lex()

print(f"Tokenizing: {source}\n")
for token in tokens:
print(f"{token._type.name}({token.value})")

Run it:

$ python lexer.py
Tokenizing: (2 + 3) * 4

LParen(()
Number(2)
Plus(+)
Number(3)
RParen())
Mul(*)
Number(4)
Eof()

Perfect. We just built a working lexer in about 50 lines of code.

Adding Debug Information

Before we move on, let’s add something important: location tracking. An essential part of any compiler is its ability to tell developers where they’ve made a mistake. While we’re just starting out with the basics, we’ll spend a little bit of time now implementing debug information which will prove invaluable later on.

Right now, if we hit an unexpected character, we just say “Unexpected character: ‘x’”. But in a real program with hundreds of lines, that’s useless. We need to know where the error is - which line and column.

First, let’s add a Location type to track position:

@dataclass
class Location:
row: int
column: int

And update our Token to include it:

@dataclass
class Token:
location: Location
_type: TokenType
value: str

Now we need to track row and column as we lex. Add these to the lexer’s __init__:

class Lexer:
def __init__(self, source: str):
self.source = source
self.tokens: list[Token] = []
self.cursor = 0
self.row = 0
self.column = 0

The trick is in increment(). When we move to the next character, we need to check if it’s a newline:

def increment(self):
"""Move to the next character."""
if self.current == '\n':
self.row += 1
self.column = 0
else:
self.column += 1
self.cursor += 1

If we hit a newline, increment the row and reset the column to 0. Otherwise, just increment the column. Simple, but effective.

Now update push_token() to include the location:

def push_token(self, token_type: TokenType, value: str):
"""Add a token to our list."""
token = Token(
Location(self.row, self.column - len(value) + 1),
token_type,
value,
)
self.tokens.append(token)

Note the self.column - len(value) + 1 - we’ve already advanced past the token, so we need to calculate where it started. For a number like 42, when we create the token, we’re sitting just after the 2, so we subtract the length to get back to where the 4 was.

Finally, update the EOF token in lex():

# Add EOF token
self.tokens.append(Token(Location(self.row, self.column), TokenType.Eof, ""))
return self.tokens

Now when we have an error, we can do:

raise Exception(f"Unexpected character '{self.current}' at row {self.row}, column {self.column}")

Much better. Try lexing something with an error:

source = "2 + @ + 3"
lexer = Lexer(source)
tokens = lexer.lex()  # Error: Unexpected character '@' at row 0, column 4

Perfect. Now we know exactly where the problem is.

What We’ve Built

This is a simple lexer, but it’s real. It handles the core concepts you’ll find in every lexer:

State management (cursor position, row/column tracking)
Character-by-character scanning
Pattern recognition (digits vs operators)
Token generation with location information

The pattern we’ve established - checking character type and dispatching to the right handler - scales to much more complex languages. When we add support for keywords like func and if, we’ll use the same approach: check if we’re looking at a letter, read the whole identifier, then check if it’s a keyword.

What’s Next

Our lexer can break expressions into tokens with proper error reporting, but it doesn’t understand what they mean. It doesn’t know that * should happen before +, or that parentheses group things together.

That’s the parser’s job. The parser takes this flat list of tokens and builds a tree structure that represents the actual meaning of the expression - an Abstract Syntax Tree (AST).

We’ll build that next.

Next: Not Yet Released Previous: Part 2: What Is a Compiler?