Optimizing Software with Zero-Copy and Other Techniques

An important aspect in software engineering is the ability to distinguish between premature, unnecessary, and necessary optimizations. A strong case can be made that the initial design benefits massively from optimizations that prevent well-known issues later on, while unnecessary optimizations are those simply do not make any significant difference either way. Meanwhile ‘premature’ optimizations are harder to define, with Knuth’s often quoted-out-of-context statement about these being ‘the root of all evil’ causing significant confusion.

We can find Donald Knuth’s full quote deep in the 1974 article Structured Programming with go to Statements, which at the time was a contentious optimization topic. On page 268, along with the cited quote, we see that it’s a reference to making presumed optimizations without understanding their effect, and without a clear picture of which parts of the program really take up most processing time. Definitely sound advice.

And unlike back in the 1970s we have today many easy ways to analyze application performance and to quantize bottlenecks. This makes it rather inexcusable to spend more time today vilifying the goto statement than to optimize one’s code with simple techniques like zero-copy and binary message formats.

Got To Go Fast

The cache hierarchy of the 2008 Intel Nehalem x86 microarchitecture. (Source: Intel)

There’s a big difference between having a conceptual picture of how one’s code interacts with the hardware and having an in-depth understanding. While the basic concept of more lines of code (LoC) translating into more RAM, CPU, and disk resources used is technically true much of the time, the real challenge lies in understanding how individual CPU cores are scheduled by the OS, how core cache synchronization works, and the impact that the L2 and L3 cache have.

Another major challenge is that of simply moving data around between system RAM, caches and registers, which seems obvious at face value, but the impact of certain decisions can have big implications. For example, passing a pointer to a memory address instead of the entire string, and performing aligned memory accesses instead of unaligned can take more or less time. This latter topic is especially relevant on x86, as this ISA allows unaligned memory access with a major performance penalty, while ARM will hard fault the application at the merest misaligned twitch.

Loading more...