Obscure feature + obscure feature + obscure feature = bug

Senior Engineer, Antithesis

In the quarter century I’ve been programming professionally, I’ve found three C++ compiler bugs; one each in g++1, clang++, and Microsoft’s MSVC. After finding roughly one problem every ten years, my takeaway is that you have to do something really, really obscure. Or actually, you have to use an obscure feature then pile on another obscure feature and pile on another obscure feature.

This is the story of something really, really obscure that we do in our C++ SDK, which led to finding the clang++ bug. If you’re not super familiar with C++, you should still be able to follow the main gist and just skim some of the details.

[B…

← Blog Michael Gibson pic

Senior Engineer, Antithesis

Background: The Antithesis SDKs

Antithesis provides a series of language-specific SDKs that let you add assertions to your code to specify correctness properties of your software.

Antithesis then runs your software while exercising it with some sort of workload (created using our Test Composer or written by hand). As your software runs, our fault injector disrupts the network, crashes containers, induces random delays in code, and so on. Every time your software hits one of your assertions, Antithesis checks that the assertion holds.

A typical (non-Antithesis) assertion would look like this:

void increment_by_pointer(int* px) {
assert(px != nullptr);
*px++;
}

Our equivalent of assert here is ALWAYS2, which includes a message that becomes a test property in our system:

void increment_by_pointer(int* px) {
ALWAYS(px != nullptr, “Parameter sent to increment_by_pointer should be non-null”);
*px++;
}

Our SDKs provide many other interesting assertion types too:

`ALWAYS_LESS_THAN(x, value)`: this is equivalent to `ALWAYS(x < val)` and analogous to `EXPECT_LT` in GoogleTest. It tells our system that different values of x are interesting here — we might try maximizing x, for example. The difference is that in `ALWAYS(x < val)`, the expression `x < val` is a boolean. But in `ALWAYS_LESS_THAN(x, value)`, both `x` and `value` are integers, so our fuzzer can execute logic on those integers.

REACHABLE(): this asserts that the code is reachable. For example, suppose you’re writing a video game, and there’s a complex series of things you have to do to get to level 3. You could put:

void draw_level_3() {
REACHABLE();
/* some code */
}

And then if anyone introduced a bug that made it impossible to get to level 3, our automated testing would find and report the problem. We also have the opposite of this, UNREACHABLE, a runtime check a little like std::unreachable().

SOMETIMES(condition...): this is a property of the test coverage, not of the correctness of your system. For example, take:

int http_return_code = http_send(params);
SOMETIMES(http_return_code != 200 ...);

This asserts that some of the tests we automatically generate create conditions where the http_send fails. This assertion checks that we hit and test the error handling/error recovery cases, in a way we wouldn’t if http_send only ever returned 200.

… and many others. Under the hood, each of these generates JSON and sends it back to Antithesis, which then evaluates whether or not the test property holds. Some of the logic is simple: for example, you could evaluate ALWAYS locally, for a given execution. But some of the logic is complex: you’re making assertions about the set of all executions. For example, “in all the executions that occurred, SOMETIMES the return code was 500” or “in all the executions that occurred, we never reached line 1234.” We need to gather the results from all executions and reason about the collected data, so our assertions send JSON back to Antithesis, and we evaluate the assertions there.

The JSON we send back looks something like this:3

{
“assertion_type”: “ALWAYS”,
“condition_value”: true,
“file”: “foo.cpp”,
“line”: 1234
}

Let’s talk about how you’d implement the SDK part of these assertions (not the evaluation logic).

You can probably imagine how you’d implement ALWAYS. Something like:

void ALWAYS(bool condition) {
JSON json = {
// ...
};
send_to_antithesis(json);
}

And if you want file and line number, you’d make it a macro like:

#define ALWAYS(condition) always_impl_function(condition, __FILE__, __LINE__)

Cool. But how would you implement REACHABLE?

void draw_level_3() {
REACHABLE();
/* some code */
}

If your code is working correctly and you hit the REACHABLE assertion in draw_level_3, that’s pretty easy, just emit JSON there. But what if there’s a bug where you never get to that code? In that case, how would Antithesis know that there’s a problem? How would it know the REACHABLE assertion even exists? This is the main point of a REACHABLE assertion, not an edge case.

Our solution is to emit a catalog of all the assertions. We do this at startup, and we emit all assertions, independent of whether they’re hit or not.

In other words, a typical assertion will send multiple messages to the Antithesis platform: (1) at startup, we emit one “catalog” message that just says “this assertion exists”, regardless of whether or not we ever call the code, and (2) one or more messages when we hit the assertion.

So how do we emit the catalog? How do we make something happen even if we don’t call the code? Before I describe our solution, imagine how you’d solve this problem: how would you write code that runs even if the code isn’t ever run?

Short answer: different ways for different languages. For Go, our customers run our instrumentor, which looks through the AST of the source code and identifies all assertions, then generates a function that emits the catalog of assertions. For Java, something similar, but it operates on the byte code of your JAR. In other words, for these languages, we have an external tool that generates extra code. That extra code is always called and emits the catalog.

Assertions in the Antithesis C++ SDK

For our C++ SDK, we figured out a way to emit the catalog just using standard C++ constructs, without having to run another process.

Here’s the general idea in C++, simplified to its core elements. The Antithesis SDK creates some code like this:

struct Assertion {
Assertion() {
create_catalog_entry_and_send_to_antithesis();
}

void assert(bool condition) {
create_json_and_send_to_antithesis(condition);
}
};

struct CatalogEntry {
static Assertion assertion = Assertion();
};

And then you write code like this:

void my_function(void* x) {
CatalogEntry::assertion.assert(x != nullptr);
}

So what’s going on here? The key is the CatalogEntry struct. It has a static class variable, assertion. That static class variable will be initialized before you run any code.4 The static initialization causes the catalog entry to be emitted, and then later (only when we hit the code) we use the static variable by calling its assert method to emit the actual assertion.

“But wait,” you say. This is all nice and good, but there’s only a single static, so you’ll always get a single CatalogEntry, whether you have 100 assertions or 1 (or 0).

Obscure feature: a non-type template parameter with an array of characters

This is where we use template magic. If you’ve used templates in C++, you’re probably most familiar with templates parameterized by a class type. Things like:

std::vector<int> x;

But you can also templatize things on a parameter that isn’t a type, but rather a value. (These are called non-type template parameters.5) For example, you could make a template of “arrays of floats length N” like this:

template <unsigned int N> struct array_of_float {
float data[N];
/* ... */
};
array_of_float<3> spatial_coordinate;

And, amazingly enough, you can templatize based on a string, which for technical reasons, cannot be a std::string, or even a const char*, but rather a fixed length array of characters. Something like { ‘H’, ‘e’, ‘l’, ‘l’, ‘o’, ‘ ‘, ‘W’, ‘o’, ‘r’, ‘l’, ‘d’};.

We’re going to add a message, a filename, and a line number to the code above, and templatize CatalogEntry on those. Note that the template rules mean we need a fixed length array of characters, but we don’t need that for the non-templatized class Assertion. So it looks like this:

struct Assertion {
const char* message;
const char* filename;
int line_number;

Assertion(const char* message, const char* filename, int line_number) :
message(message), filename(filename), line_number(line_number)
{
create_catalog_and_send_to_antithesis(message, filename, line_number);
}

void assert(condition) {
create_json_and_send_to_antithesis(condition, message, filename, line_number);
}
};

struct fixed_string { /* details omitted */ };

template <fixed_string message, fixed_string filename, int line>
struct CatalogEntry {
static Assertion assertion = Assertion(message.c_str(), filename.c_str(), line);
};

And then you write code like this:

void my_function(void* x) {
CatalogEntry<”x is not null”, __FILE__, __LINE__>::assertion.assert(x != nullptr);
}

The one tricky thing is the fixed_string type, a class we defined that has a fixed size array. There’s nothing conceptually tricky about it, it just means we have to only use code that’s constant at compile time. You can check out the gory details in our SDK.

Of course, we don’t want to make people use some weird templated stuff whenever they use our SDK, so we’ve defined macros like:6

#define ALWAYS(condition, message) \
CatalogEntry< \
create_fixed_string(message), \
create_fixed_string(__FILE__), \
__LINE__ \
>::assertion.assert(condition)

And then your code becomes:

void my_function(void* x) {
ALWAYS(x != nullptr, ”x is not null”);
}

If you’re keeping track, we’ve just used two obscure features: the first is non-type template parameters. Okay, they’re not that obscure, but if you’re programming in C++, probably 95% of the templates you use routinely are based on type. Non-type template parameters have been around forever (at least since 1998, a time when most compilers didn’t support any kind of template particularly well). However, from 1998 until 2020, non-type template parameters only worked with integral types. The second is using an array of characters as the non-template parameter; that feature was added in the 2020 standard.

Obscure feature: anonymous namespaces

The initial version of our SDK looked something like the above. Then we started looking at the code it generated. Our experience is that people who are writing C++ care a lot about performance, so we looked very carefully at the generated code. We found something odd: the executable was huge. As we dug into it, we found that the executable had a lot of extra symbols. There was:

a symbol for the type that is the fixed_string of the message
a symbol for the type that is the fixed_string of the filename
a symbol for the CatalogEntry
etc. Each of those had long mangled names that amount to an encoding of CatalogEntry<fixed_string{‘x’, ‘ ‘, ‘i’, ‘s’, ‘ ‘, ‘n’, ‘o’, ‘t’, ‘ ‘, ‘n’, ‘u’, ‘l’, ‘l’}, fixed_string{‘f’, ‘o’, ‘o’, ‘.’, ‘c’, ‘p’, ‘p’}, 10\> — and this encoding amounted to thousands of characters.

Here’s an example of one such encoding:

_ZTAXtlN12_GLOBAL__N_112fixed_stringILm10EEEtlSt5arrayIcLm10EEtlA10_cLc83ELc97ELc109ELc101ELc32ELc110ELc97ELc109ELc101EEEEE[_ZTAXtlN12_GLOBAL__N_112fixed_stringILm10EEEtlSt5arrayIcLm10EEtlA10_cLc83ELc97ELc109ELc101ELc32ELc110ELc97ELc109ELc101EEEEE]

Moreover, there were multiple copies of similar symbols. (For example, the class itself and the assertion variable were two separate ones.)

Why were there all these symbols? The compiler exported them so that some other file could reference them, if some other file wanted to. But we knew that wasn’t possible; the templated classes (i.e., not Assertion) were one-time use. Used only in a single file, or even more specifically, used only at a single position in a single file.

So we moved fixed_string and CatalogEntry into an anonymous namespace:

namespace {
struct fixed_string { /* details omitted */};

template <fixed_string message, fixed_string filename, int line>
struct CatalogEntry {
static Assertion assertion = Assertion(message.c_str(), filename.c_str(), line);
};
}

On the off-chance you’re not a language lawyer, an anonymous namespace is a namespace that’s only used in one given file (translation unit).7 So the compiler knew it didn’t need to expose those symbols because nothing outside of that translation unit was allowed to access them.

With this approach, the compiler created:

One symbol for each unique message
One symbol for each unique filename
One symbol for each unique catalog entry That’s still more than zero exported symbols, but that’s better than the multiple-symbol case we were seeing before. It cut down the size of the files substantially. This was the version we shipped with our SDK.

The bug

And then, a couple months later, a customer reported a problem. They were essentially doing:

file1.cpp:

void foo(void* x) {
ALWAYS(x != nullptr, ”Same message”);
}

file2.cpp:

void bar(void* y) {
ALWAYS(y != nullptr, ”Same message”);
}

The key here is that they were using the same message in two different files.

The compiler emitted a cryptic message that boiled down to: I can’t find the fixed_string for “Same message” in file2.o because I deleted it.

What the compiler8 seemed to be doing was:

Create symbol for fixed_string for “Same message” in file1.o
Create symbol for fixed_string for “file1.cpp” in file1.o
Create symbol for CatalogEntry<symbol from (1), symbol from (2), line number> infile1.o`
Create symbol for fixed_string for “Same message” in file2.o
Incorrectly think this is the same as the symbol created in (1), so remove it
Create symbol for fixed_string for “file2.cpp” in file2.o
Create symbol for CatalogEntry<symbol from (4), symbol from (6), line number> in file2.o
BUT it can’t find the symbol from (4), because it has incorrectly deduplicated those symbols
Error out

We poked around and figured out that Clang 16 would compile the code correctly and Clang 17 and higher wouldn’t. This happened when LLVM introduced a new optimization pipeline in 17, which seemed to have a bug in symbol deduplication.

So here’s the “one obscure feature piled on another piled on another” bug: the compiler’s symbol deduplication logic changed due to a new optimization pipeline and created an error related to one non-type template (CatalogEntry) templatized on another non-type template (fixed_string), based on an array of characters, not an integral type, when both were contained in an anonymous namespace.

As of this writing we’ve got a bug report in with Clang. But as a user of our SDK, one correct fix is “don’t use the same message for different assertions” - you shouldn’t be doing that anyway, because then you won’t get good diagnostic information back when something fails.9

The lesson

This shape of bug happens in all kinds of systems. A bug that occurs not as a result of one single thing, but a combination of one rare thing and another rare thing and another rare thing. Here it’s Optimization + Non-type Templates + Anonymous Namespaces; in another system it might be Leader Election + Network Fault + One Node Dies; in yet another it might be One Machine is Slower than Another + Clock Skew + Dropped Packet. The bug occurs when some combination of features and environmental conditions occurs, and the set of all possible combinations is massive.

So how do you test for these cases? That’s where autonomous testing shines. You give Antithesis atomic operations via the Test Composer. And you use our SDKs (or other methods) to set properties that should always hold. Antithesis automatically creates different combinations of your operations and randomly applies faults while confirming that nothing goes wrong (that all your properties always hold). And it runs automatically every night on your latest code and finds this shape of bugs.

Of course, once you find a bug this way, all that’s left is to reproduce it and debug it.