Compiler Engineering in Practice – Part 1: What Is a Compiler?

“Compiler Engineering in Practice” is a blog series intended to pass on wisdom that seemingly every seasoned compiler developer knows, but is not systematically written down in any textbook or online resource. Some (but not much) prior experience with compilers is needed.

What is a compiler?

The first and most important question is “what is a compiler?”. In short, a compiler is:

a translator that translates between two different languages, where those languages represent a description of a computation, and
the behavior of the computation in the output language must “match” the behavior of the computation in the input language (more on this below).

For example, an input language can be C, and the output can be x86 assembly. By this definition, an assembler is also a co…

What is a compiler?

The first and most important question is “what is a compiler?”. In short, a compiler is:

a translator that translates between two different languages, where those languages represent a description of a computation, and
the behavior of the computation in the output language must “match” the behavior of the computation in the input language (more on this below).

For example, an input language can be C, and the output can be x86 assembly. By this definition, an assembler is also a compiler (albeit a simple one), in that it reads x86 textual assembly and outputs x86 binary machine code, which are two different languages. The python program that executes Python code contains a compiler – one that reads Python source code and outputs Python interpreter bytecode.

This brings me to my first important point about practical compiler engineering – it’s not some mystical art. Compilers, operating systems, and databases are usually considered some kind of special corner of computer science / software engineering for being complex, and indeed, there are some corners of compilers that are a black art. But taking a step back, a compiler is simply a program that reads a file and writes a file. From a development perspective, it’s not that different from cat or grep.

Why does this matter? Because it means that compilers are easy to debug if you build them right. There are no time-dependent interrupts like an operating system, async external events like a web browser, or large enough scale that hardware has to be considered unreliable like a database. It’s just a command line program (or can be reduced to one if engineered right), such that nearly all bugs are reproducible and debuggable in isolation from the comfort of your workstation. No connecting to a flaky dev board, no extensive mocking of various interfaces.

You might say – wait a minute – if I’m running on my company’s AI hardware, I may need to connect to a dev board. Yes, but if you do things right, you will rarely need to do that when debugging the compiler proper. Which brings me to…

Reliability

Compilers are like operating systems and databases in that the bar for reliability is extremely high. One cannot build a practical compiler haphazardly. Why? Because of miscompiles.

Miscompiles are when the compiler produces an output file in the output language that does not “match” the specification of its computation in the input language. To avoid a miscompile, the output program must behave identically to the input program, as far as can be observed by the outside world, such as network requests, values printed to the console, values written to files, etc.

For integer programs, bit-exact results are required, though there are some nuances regarding undefined behavior, as described in John Regehr’s “laws of physics of compilers”. For floating point programs, the expectation of bit-exact results is usually too strict. Transformations on large floating point computations (like AI programs) need some flexibility to produce slightly different outputs in order to allow efficient execution. There is no widely-agreed-upon formal definition of this, though there are reasonable ways to check for it in practice (“atol/rtol” go a long way).

How bad is a miscompile?

Miscompiles can have massive consequences for customers. A miscompile of a database can cause data loss. A miscompile of an operating system can cause a security vulnerability. A miscompile of an AI program can cause bad medical advice. The stakes are extremely high, and debugging a miscompile when it happens “in the wild” can easily take 3+ months (and it can take months for a customer to even realize that their issue is caused by a miscompile).

If that weren’t enough, there’s a self-serving reason to avoid miscompiles – if you have too many of them, your development velocity on your compiler will grind to a halt. Miscompiles can easily take 100x or 1000x of the time to debug vs a bug that makes itself known during the actual execution of the compiler (rather than the execution of the program that was output by the compiler). That’s why most aspects of practical compiler development revolve around ensuring that if something goes wrong, that it halts the compiler before a faulty output program is produced.

A miscompile is a fundamental failure of the compiler’s contract with its user. Every miscompile should be accompanied by a deep look in the mirror and self-reflection about what went wrong to allow it to sneak through, and what preventative measures can (and should immediately) be taken to ensure that this particular failure mode never happens again.

Especially in the AI space, there are lots of compilers that play fast and loose with this, and as a result get burned. The best compiler engineers tend to be highly pedantic and somewhat paranoid about what can go wrong.

Why compilers are hard – the IR data structure

Compilers do have an essential complexity that makes them “hard”, and this again comes from the whole business of making sure that the input program and the output of the compiler have the same behavior. To understand this, we have to discuss how a compiler represents the meaning of the input program and how it preserves that meaning when producing the output program. This notion of “meaning” is sometimes called the program semantics.

The primary data structure in a compiler is usually some form of graph data structure that represents the compiler’s understanding of “what computation this program is supposed to do”. Hence, it represents the computation that the compiler needs to preserve all the way to the output program. This data structure is usually called an IR (intermediate representation). The primary way that compilers work is by taking an IR that represents the input program, and applying a series of small transformations all of which have been individually verified to not change the meaning of the program (i.e. not miscompile). In doing so, we decompose one large translation problem into many smaller ones, making it manageable.

I think it’s fair to say that compiler IR’s are the single most complex monolithic data structure in all of software engineering, in the sense that interpreting what can and cannot be validly done with the data structure is complex. To be clear, compiler IR’s are not usually very complex in the implementation sense like a “lock-free list” that uses subtle atomic operations to present a simple insert/delete/etc. interface.

Unlike a lock-free list, compiler IR’s usually have a very complex interface, even if they have a very simple internal implementation. Even specifying declaratively or in natural language what are the allowed transformations on the data structure is usually extremely difficult (you’ll see things like “memory models” or “abstract machines” that people spend years or decades trying to define properly).

A very complex schema

Firstly, the nodes in the graph usually have a complex schema. For example, a simple “integer multiply operation” (a node in the graph) is only allowed to have certain integer types as operands (incoming edges). And there may easily be thousands of kinds of operations at varying abstraction levels in any practical compiler, each with their own unique requirements. For example, a simple C * (multiplication) operator will go through the following evolution in Clang:

It first becomes Clang’s BinaryOperator node, which takes two “expressions” as operands (which may be mutable uint32_t values, for example).
It will then be converted to an LLVM IR mul operation, which takes as operands an llvm::Value, which represents an immutable value of the i32 type, say.
It will then be converted to a GlobalISel G_MUL operation, whose operands represent not only an 32-bit integer, but also begin to capture notions like which “register bank” the value should eventually live in.
It will then be turned into a target-specific MIR node like IMUL32rri or IMUL32rr selecting among a variety of physical x86 instructions which can implement a multiplication. At this level, operands may represent physical, mutable hardware registers.

From a compiler developer’s perspective, all these “multiply operations” are deeply different from each other because of the different information captured at each abstraction level (again, compiler developers are usually very pedantic). Failing to adequately differentiate between abstraction levels is a common disease among poorly written compilers.

At every level, precise attention to detail is needed – for example, if the multiplication is expected to overflow mod 2^32 in the source program, and we accidentally convert it to overflow mod 2^64 (such as by using a 64-bit register), then we have introduced a miscompile. Each operation has its own unique set of constraints and properties like these which apply when transforming the program.

Complex interactions between operations

Additionally, how these operations in the IR graph relate to each other can be very complex, especially when mutable variables and control flow are involved. For example, you may realize that an operation always executes, but we may be able to move it around to hide it under an if condition to optimize the program. Consider the program:

x = y + z;
...
if (condition) {
print(x); // The only time that `x` is referenced.
}

Is it safe to convert this to

...
if (condition) {
print(y + z);
}

? Well, it depends on what’s hidden in that .... For example, if the program is:

x = y + z;
...
y += 5;
...
if (condition) {
print(x);
}

Then it’s not legal, since by the time we get to the if, the value of y will have changed and we’ll print the wrong value. One of the primary considerations when designing compiler IR’s is how to make the transformations as simple and obviously correct as possible (more on that in another blog post).

Usually production compilers will deal with IR graphs from thousands to millions of nodes. Understandably then, the compounding effect of the IR complexity is front and center in all compiler design discussions. A single invalid transformation can result in a miscompile.

Compilers are just software

Practical compilers are often live for years or decades and span millions of lines of code, so the entire suite of software engineering wisdom applies to them – good API design, testing, reusability, etc. though usually with additional compiler-specific twists.

For example, while API design is very important for most programs’ code (as it is for compilers’), compilers also have an additional dimension of “IR design”. As described above, the IR can be very complex to understand and transform, and designing it right can greatly mitigate this. (more on this in a future blog post)

Similarly, since compilers are usually decomposed into the successive application of multiple “passes” (self-contained IR transformations), there are a variety of testing and debugging strategies specific to compilers. (more on this in a future blog post).

Conclusion and acknowledgements

I hope you have found this post helpful. I have a few more sketched out that should be coming soon. Please let me know on my LinkedIn if you have any feedback or topics you’d like to suggest. Big thanks to Bjarke Roune for his recent blog post that inspired me to finally get this series off the ground. Also to Dan Gohman for his blog post on canonicalization from years back. There’s too few such blog posts giving the big picture of practical compiler development. Please send me any other ones you know about on LinkedIn.

Stay tuned for future parts of this series:

Modern Compilers in the Age of AI
Organizing a Compiler
Testing, Code Review, and Robustness
The Compiler Lifecycle
…

What is a compiler?

What is a compiler?

Reliability

How bad is a miscompile?

Why compilers are hard – the IR data structure

A very complex schema

Complex interactions between operations

Compilers are just software

Conclusion and acknowledgements

Similar Posts