Testing and Benchmarking of AI Compilers

This is an in-depth post on bugs and how to prevent them in AI software and AI compilers specifically. I was the software lead for TPUv3 at Google and I’ve worked on a variety of AI compilers and projects across Google, Nvidia, Amazon and Facebook.

Zero is a hard number

In my estimation, XLA has the most comprehensive AI test suite of any ML compiler, so I heartily recommend XLA for mission-critical AI. XLA is used for most Google AI and has been for a decade. XLA is highly reliable. Yet, even XLA has bugs that escape into the wild for customers to encounter. The number of bugs is not zero, not even for XLA.

Anthropic published this report, diagnosing a bug in an XLA op as one of the causes of the A…

Zero is a hard number

Anthropic published this report, diagnosing a bug in an XLA op as one of the causes of the Anthropic service giving bad response to its users for a period of time. We should all commend Anthropic for being this open about the incident. The op in question, approximate top k, was a new op in XLA that evidently didn’t receive as much testing as it needed. This is just one bug, yet look at what resulted. I hope no one received bad medical advice or bad suicide prevention guidance from Anthropic as a result of this XLA bug, but those are among the possibilities. It’s just one bug and people might have died. Which is how you might start to understand why Anthropic took the issue so seriously as to publicly publish a report like that. If you read between the lines, you can tell how upset the person who wrote the report was, even though they are being very professional about it. AI software correctness is serious business.

Consider that your project, whatever it is, is quite unlikely to be as error free and as well tested as XLA is. If you are responsible for an AI development effort, how many situations like this would you like your customers to encounter? Zero. The correct answer is zero. But zero is a hard number. If your project will be widely deployed, you are not going to be able to keep the number of bugs that your users will encounter at zero. In fact, many software engineers might be laughing right now reading this, at the idea of zero bugs as a concept. The only projects with zero bugs reported are projects that don’t have any customers.

It’s a similar question as asking how many patients a surgeon should kill because they didn’t do their job correctly. It’s a number that one would certainly hope is zero, yet humans make mistakes and a surgeon’s number isn’t going to be zero over a long career. Except if he doesn’t do any surgeries. You can’t stay at zero. Zero is a place for people with no customers.

I’ve seen software developers discount testing because they know it will not remove all bugs. Zero is impossible, therefore any number will do. This isn’t an exaggeration or something funny. This is what some otherwise highly capable real software professionals really believe. They prefer to think this way because they believe that it will make their jobs easier if they don’t have to write tests - also incorrect, at least in the context of most AI software. It’s a bit like thinking that heavy smoking improves your life because it improves your day. I wouldn’t want surgery from a surgeon that did his work in this way and, for important AI applications, such as what Anthropic offers, I would avoid using AI software that was developed with such a mindset. XLA is excellent on testing and even XLA has issues such as this. “Sounds like XLA is doing a lot on this, if even they can’t do it, why should we try?” I’d suggest to stop thinking like that.

Zero is a hard number. If that by itself leads you to discount the attempt to reduce your project’s number, since we won’t reach zero anyway, I suggest (in fact insist) that there is something wrong with your philosophy of software development.

Planes sometimes crash. Airbags sometimes don’t deploy. Surgeons sometimes kill people by mistake. Rockets with astronauts on them sometimes explode. Trains sometimes derail. Bungee jump cords sometimes snap. Parachutes sometimes fail. Buildings sometimes collapse. Even though zero is a hard number, it matters how often “sometimes” is.

Testing and benchmarking should be high status work

Your AI project needs to view testing as lower status work in the same way it needs to view fire escapes as optional. Yet this is commonly how it is. People often do it out of a sense of duty, not because they have to. Some managers are getting a free pass because they have employees who do things correctly even if they aren’t supposed to.

One of the problems with testing is that it is a difficult task to estimate how well tested a feature or product actually is. It requires good engineering judgement. To have a firm sense of this, you need to be a good engineer, you need to know everything about the feature and you need to inspect the test suite carefully. There are metrics, such as the number of tests or various kinds of code coverage measures, which are OK to use, but they are not replacements for good engineering judgement. So if employee A does poor testing and employee B does good testing, it’s not necessarily going to be obvious that this is the case without looking closely. Both engineers delivered their project but employee B took longer. Maybe there is something wrong with employee B?

It’s OK, we’ll just count the number of bugs reported later and then we’ll know who did a good job. Employee B took longer. Now we also see more bugs reported in what he did. This employee B is real trouble. Well maybe employee B’s project was also more complex and more important to customers, so they used it more and found more of the few bugs that it did have. Maybe employee A’s project had many more bugs, but nobody used it, so it stayed at zero bugs reported. So counting bugs, as a metric by itself, is not great. It’s not a replacement for good engineering judgement.

If your CEO complains about your project having bugs, he probably just doesn’t know that zero is a hard number. Right? You can explain this to him. If it’s hard for your manager to tell whether proper testing has been done, what chance does your CEO have of figuring this out?

Well, OK, maybe figuring out if testing is good is just hard. But, surely, once a bug has been reported, we have to value our customers’ concerns and fix them quickly (true enough!). Turns out it’s quite easy to tell if your dev team is doing a good job fixing customer bugs - just ask the customers. We should probably reward engineers that are responsible for fixing bugs, because we need engineers to do this and they sometimes don’t want to. In fact, this employee A looks like a real star - he delivers all his projects quickly and he fixes more bugs than anyone else. There is no metric that is a replacement for good engineering judgement.

The CEO may notice that customers are sad about the bugs but loyal customers do in fact appreciate the close relationship that they have with the company’s dev team resulting from these quick bug fixes. Zero is a hard number, but our policy to focus on and reward bug fixing is working.

So, if you can’t tell, this is not a great situation this company is finding itself in, but everything looks reasonable. That’s the problem. Focusing on fixing bugs quickly is a fine idea, but the question is why there are so many bugs to fix in the first place. But how many is too many? I can’t tell you a specific number (well, OK, 37, that’s too many). Nobody can. There is no metric that is a replacement for good engineering judgement.

What happens if an employee notices that we aren’t doing a lot of testing and proposes to do more about testing? You are going to reduce the team’s apparent development velocity for a time if you do this. That doesn’t sound appealing. Worse, suppose this testing turns out to be effective. Then you’ve now revealed that your project in fact had many more bugs than it seemed. Your dev velocity will also now be even slower as you fix the sudden influx of bugs that your own testing revealed. So what does this look like externally to your team? Well, it might look like you first suggested doing less (to do more testing), then your project suddenly has way more bugs, then you did even less than you said you would (to fix more bugs) and, through all this, you’ve delivered nothing that any customer is happy about (for now). You are saying that you are now doing better on bugs, but actually the number of bugs reported against your project (by your own new tests) is above that of other projects in the company. So your testing effort is a success on its own, but what do things look like externally? This situation certainly calls for some careful management of perceptions.

I’m not an engineer specializing in testing, but testing is one of the underdone aspects of many projects I’ve been on, in my opinion, so I’ve had occasion to work quite a bit on testing because I thought it needed improvement. So, in fact, I have personally set off such a cascade of events in my career as described above. This is a direct quote from the manager at the time: “you found more bugs in the past two weeks than our entire team did in the past year”. This was a team with a subteam doing just testing (it wasn’t their fault - they were doing their jobs in the way that they were told to do it).

I ended the story at its low point. What happened immediately after this darker chapter is that customers were still reporting bugs, but now, more often than not, the answer was: “This is already fixed in the latest version, please update.” A while later, the number of bugs reported fell dramatically. The team had been spending half their time fixing customer bugs before any of this started, now it was much less than that. So development velocity and morale were significantly improved. Bugs are far faster and easier to diagnose and fix if you have small tests to find them up front, instead of having to collaborate with a customer to figure out what is wrong later. If there is a bug somewhere in a large customer AI model, this can be very challenging and time-consuming to diagnose (it can take weeks of work to diagnose one bug). That’s the primary source of the quite significant speed-up in team velocity that occurred. Testing improves team velocity. That can just take some time to materialize - both on the way up and on the way down.

There was also another source of improved team velocity. I didn’t just write a bunch of tests myself - though I also did that. The more important thing that I did was to improve the testing infrastructure of the project. If testing is lower status work, your top engineers may not look that much at what they can do to improve testing and its infrastructure. Especially not if you then have a subteam that does the testing instead of the people writing the features doing the testing. That subteam may be expected to take instruction on what and how to test, not to improve testing infrastructure. So then no one is expected to improve testing infrastructure.

Testing AI software isn’t easy or simple and neither is testing infrastructure. What I did was, first, to reduce the amount of boilerplate involved in writing a test. So, and this is not an exaggeration, you could write a test in 3 simple lines that would have taken 30+ more complex lines before, and this improvement applied across tests. This wasn’t easy to do, it required careful API work, and the effect was more significant than it may sound like since it makes people more keen to write many tests. So it doesn’t just save you some time when writing the test, it improves testing in other ways, too. Previously, often a file would contain only a single test. Now, files could contain many tests because they were not so big. Even this is a significant improvement - it’s just easier to keep track of less code.

I also wrote a fuzzer, which found a bunch of bugs. It was based on taking existing tests and automatically making them more complicated in various ways that didn’t change the result of the test. This was very successful, acting as a force multiplier on the existing number of tests, and I would recommend that approach for any AI compiler. So you write one test, but behind the scenes it turns into 20 tests. That’s a lot more productive.

This work was at first somewhat hard to sell as a positive. During the dark chapter period, which lasted a few weeks, I had caused everyone to now have to spend almost all their time fixing bugs, which is usually a software engineer’s least favorite activity. The view of this effort was much improved once we got past the dark chapter period. The well of bugs ran dry and things were looking up.

What did the testing subteam think? They were actually quite happy. If you write tests for a living, your job is going to be more fun if you can write many tests quickly. It’s also more fun if you can write one test and then a fuzzer automatically turns it into 20 tests and then you find many more bugs. You can probably see how the status of people involved with testing rises if they find more bugs. Which they will if given proper tools. I think it also helped morale that I was talking about their work as something very important, which of course it was.

This all led to the testing subteam having some extra time due to the now increased productivity of writing tests. They had previously been expected to use the project’s APIs to write tests, but not to inspect how the code inside the project worked - it was quite complex. I proposed that the testing subteam spend some of their now freed up time to do a series of improvements to the project on the inside of the code, primarily long overdue refactorings, so that they became familiar with the internals of the project, too. I also suggested that they write a reference backend for the AI compiler, which, apart from such a backend being yet another boost to testing productivity, required them to understand how to implement every op in the whole compiler (as opposed to testing each op from the outside). It’s easier to test a project if you know how it works on the inside. It turned out that they were perfectly capable of doing such work, they just hadn’t been expected to do such work previously. I would have just removed the entire notion of a test subteam and mixed this team in with the rest of the team, though we didn’t do that.

Was I expected or hired to do this kind of work? Absolutely not, though I didn’t have trouble justifying the time I spent on this after it got going. The whole thing took around 2 months of my time. It was successful enough that the testing approach that I used was disseminated more widely within the company through a company-specific avenue for such things. Don’t misunderstand this story - it was a great company and a strong team.

What about safety certifications? What if this team had been subjected to a safety certification process, maybe that would have led to the same changes that I made? No. I’ve been involved in such a process and nothing of what I did here would have been the result of a safety certification process. So you can perhaps see why I’m skeptical of safety certifications, even though they may indeed have some legitimate positive effects. I think that they are more a legal tool than an engineering tool. I suppose that legal tools can be important, too.

Maybe you think this is a story where I say that I’m a great engineer. Well, I do like to think so, yes, but you might have missed the bigger picture here. This was a story about the importance of testing and testing infrastructure and some of the challenges that get in the way. You underestimate these areas at your peril. I’ve never joined an AI software team that didn’t have some need for improvement in this area in my opinion (which partly explains how I was so useful on this - wasn’t my first rodeo). I think the whole AI industry is underestimating the importance of testing and benchmarking, not in one company or in one place but everywhere.

In the previous section, there is a perspective of how even a single bug can cause deaths and public embarrassment for you and your customers. In this section, we are talking about a high volume of bugs. So it seems that there is some kind of mismatch here? Yes, there is. That’s what I’m saying. You’ll find this mismatch everywhere in the AI industry.

If you still think that testing is or should be lower status work, then maybe read this story again. I have to say that I disagree with you. Testing AI software is not easy and it matters how you do it.

Kinds of AI software bugs and their impact

What can the impact of bugs in AI software be? There are different levels of AI software bugs:

No service bug A no service bug is when the user consistently gets an internal error or the system is obviously broken in some other way. These bugs are obvious and bothersome, but they are the least serious kind of error. A self driving system with a no service bug like this will not be released to the world until the bug is fixed, so it’s not that serious. It’s just bothersome.

Intermittent no service bug Like a no service bug, but it only happens some of the time. Maybe rarely. This is much more of a problem, since such bugs take time to be noticed, so they impact customers to a greater extent. For example, a self driving system with an intermittent no service bug might be released to the public, if the error does not occur during system testing, and then cause deaths in the wild if the bug does occur there.

Correctness bug With this kind of bug, the system isn’t obviously broken, and there are no errors, but what is happening is not correct. This is a very serious kind of bug in the context of AI software. These bugs can be extremely hard to diagnose and they can go unnoticed for extended periods of time. A self driving system with such a bug will probably not be released to the public, since the bug will likely be detected during testing, but that isn’t guaranteed.

Intermittent correctness bug This is the worst kind of bug and it is the kind of bug that Anthropic was dealing with in their public report. You can see how such a bug can escape testing efforts and go unnoticed for a long time even while it keeps causing serious problems. A self driving system with such a bug may well be released to the public.

As a customer, you might notice that no service bugs are not that serious for your business. Once something works, it’ll keep working. So you can deal with that. However, I would suggest considering that these different kinds of bugs are correlated. An AI system with many bugs of one kind is likely to have many bugs of the other kinds, too. So I would not take no service bugs lightly, even though their direct impact is limited. They are a red flag that should have you worried about encountering other bugs that may well impact your business more seriously.

Note that here we are not talking about AI that makes mistakes. That’s different. An AI mistake is when the AI functions the way it’s supposed to, but it just can’t figure out the right thing to do. This is a problem for AI researchers to deal with - they need to come up with a better transformer or use a better dataset to train the AI. That’s not what we are talking about here. We are instead talking about situations where the AI would do the right thing if the software that realizes its internal computations were functioning correctly, but that software is not functioning correctly. That’s an AI software bug (or potentially hardware bug), not an AI mistake. No matter how well an AI is trained or structured, it might still do the wrong thing if there is a bug in the underlying software that runs it. So buggy AI and wrong AI are different.

What can the impact of AI bugs be?

AI assistants AI assistants, such as Anthropic, ChatGPT or Gemini, are used for advice in all areas of human activity, including suicide prevention and medical advice. There is really no limit to what might result if an AI assistant uses buggy software, since there is no limit to the potential actions that the wrong person might take if given the wrong advice at the wrong time from a source that they trust.

Is it really feasible that an AI assistant could start saying evil things due to a bug? Consider that one of the possibilities for an intermittent correctness bug is intermittent sign error. Not a likely bug, but it is a perfectly possible one. Be aware that AI’s internally contain many vectors and an AI model may well have directions of such vectors that correspond to various kinds of evil behavior. An AI assistant will then of course have been trained to avoid such vectors. However, if there is a sign error, you might flip a vector or one of its components from a direction of “goodness” to a direction of “evilness” with a corresponding flip in behavior of the AI. So an intermittent sign error bug could in fact lead to an intermittently evil AI assistant that’s randomly good most of the time and optimizing towards evil when you aren’t looking. So buggy AI can potentially be quite a bit more serious than simply somewhat wrong AI.

Medical diagnosis AI is today used for medical diagnosis, such as finding cancer on a mammogram. In such cases, currently, AI is usually used to support human judgement, so a faulty AI verdict may be corrected by a human, but humans make mistakes, too, so that isn’t guaranteed. If a hospital that I will use does use AI in their diagnosis procedures, even if only in an advisory capacity, I would very much appreciate if they will avoid using buggy AI software.

Self Driving There are already self driving cars on the road and people fall asleep while “driving” them even thought they aren’t supposed to. In the future, there will be full self driving cars on the roads where you are allowed to fall asleep or there might be no human occupant at all. These are already on the road in a few areas. AI software bugs here can of course lead to traffic accidents and deaths.

These are three particularly serious applications, but there are many other applications of AI where bugs are still serious, even if not quite that serious. Ask your favorite systolic array, I mean LLM, and it’ll give you a long list of such applications.

Ah, but it’s OK, none of our customers are planning on using our product in a place where it might kill someone. Well, not right now and not as far as you know. If your AI software becomes popular, you will never know all the places it will be used. And, in any case, if a big order comes in from a medical diagnosis company, are you going to tell them that they shouldn’t use your product because it’s buggy? Probably not. You’ll take the order and hope for the best.

Maybe that medical company will require a safety certification process, but as I’ve said, these certification processes don’t assure what it sounds like they assure. You think “certified safe” software doesn’t have any serious bugs? Zero is a hard number. So the question is how effective the certification process is at finding bugs. Somewhat effective. Much of a safety certification involves making a list of all possible bugs you might have and then to do paperwork to document that you have tested for each of them. If you are honestly bad at coming up with possible bugs in your software, then your certification will be easier to complete. If I will receive surgery from an AI robot, what I want to know is that the people who created it were conscientious and competent. That is more powerful than any certification. Of course, I won’t object if they also have a certification.

I suggest you take software correctness and testing seriously. I also suggest that you prefer to prevent bugs escaping your software development process instead of focusing on fixing bugs quickly after customers report them - even though of course you do need to fix it if a customer does report a bug and doing it quickly in that case of course is preferable.

If you are buying a lot of AI hardware, you might want to ask your vendor about how many bugs they’ve ever had escape to their customers (they won’t know, or if they do don’t expect the number to be small, but watch how they respond) and what their testing story for their hardware and software is. You may have a hard time evaluating the answer, but if you get the sense that they aren’t taking that side of the business seriously, that’s something I’d be concerned about in your place. If they only emphasize their bug fixing turn-around time, and don’t have answers on their efforts on preventing bugs (e.g. testing), that’s maybe not great. Though maybe no one else ever asked before, so it’s OK if the sales team needs to go back to their dev team to ask. If they don’t take testing seriously, what that really means is that they aren’t taking your interests as a customer seriously. At least that’s how I would view it in your place.

AI hardware and software infrastructure

The first thing to know is that you need a significantly large server farm to run your tests if you will be developing large-scale AI software. During TPUv2 development, our XLA testing fleet of TPUs was so powerful that it would have been in the top 5 of world supercomputers at the time, if we ignored that such lists require higher precision computations than what TPUs do. To be fair, this happened because TPUs are incredibly fast, so we had many TPUs but not so many as that makes it sound like. Even though we had a lot of TPUs available for testing XLA, we still would have liked more. This is because you need many tests and ideally these should all run before every change to the software repository, so that bugs never even make it into the repository. This can require a lot of testing hardware.

It is quite important how long it takes to make a change to the code, compile (if in a compiled language) and run the tests. You will want a parallelized compilation flow where compilation happens in a distributed way rather than locally, since otherwise it will be slow. The same for testing - you will want a distributed system where tests can be run in parallel across many machines, not just locally. Critically, you will want this to be easy to do from the command line. At Google (and externally if you use Bazel), you can compile and run all relevant tests by simply typing “bazel run” in your code directory. It will quickly compile and test in a distributed fashion automatically from that one invocation. If your work flow is not so good as that, I don’t know why you wouldn’t improve it until it is. And consider using Bazel for building and testing, it works well.

A particularly bad situation here is if a developer needs to book a machine to run tests on, has to log into it and then maybe has to install some things before running tests. Repeatedly (one-off is maybe OK). Don’t do it that way. Just use Bazel - it allows you to declare that a test requires a specific kind of hardware and it will make it happen (well, as long as you provide that kind of hardware to it, of course). At Google, you can type “bazel run my_test” and if my_test is set up that way, this will run in parallel across all current kinds of TPUs and a few types of GPUs, involving reaching out to many different machines, each with their own kind of hardware. It happens in seconds. You can also tell it to run only on a specific kind of hardware. You can have that where you work, too, if you use Bazel.

[[[ Irrelevant aside Why call it “Bazel”? Seems like an odd name. Well, internally at Google, this system has always been called “Blaze” for “blazing fast”. When it was open sourced, I guess they wanted to distinguish the open version from the internal version, so they called the open version “Bazel”. It’s a bit of an odd name, but you can see the connection to “Blaze”. ]]]

The modify, compile, test cycle time is important because it has a strong effect on developer productivity. If your developers’ time isn’t expensive and valuable, you probably didn’t hire the right team to work on AI software. Proper infrastructure is a large force multiplier on your development team’s efforts.

If it takes too long to run your tests (more than a minute is already a long time, I think), buy more hardware. Keep doing that until you can’t afford it anymore. If that never happens, you have either a very large budget (unlikely), or you team isn’t adding enough tests (very likely). If you are somewhat into the development process and you as a manager aren’t getting a request for more funds to buy a larger fleet of test machines, something is wrong. You should figure out what’s going on. Perhaps your team doesn’t feel empowered to ask for what they need. Perhaps they didn’t write any tests. Something is wrong. They should eventually be complaining that they don’t have enough test hardware no matter how much test hardware they already have.

You should buy a lot of hardware, but, eventually, you won’t have more money for more hardware. Even at Google that’s how it was eventually (though we did get a world top 5 supercomputer out of it, so hey, that was pretty nice). So what then? The most obvious solution is to ask people to stop adding so many tests. I’ve seen this proposed and used as the primary solution. That’s bad. ABAT - Always Be Adding Tests. But... then how do we solve the problem that it will take longer and longer to run all these tests? That could tank your dev team productivity, so that’s no good. What to do?

The first thing to do is to optimize your tests. The easiest way to optimize code is to run a profiler and generate a flame graph (if profiling and generating a flame graph takes more than a single command line invocation in your team’s setup, why not make a script for it?). Tests are code. So profile your tests. This will be very effective if you’ve been underway for a while - you are surely wasting time doing many silly things in your testing code. You might get a 10x this way, it can happen. A common first discovery upon doing this on an AI compiler is that generating large random arrays, commonly used to test AI software, takes way longer than you’d think. So cache those and reuse them. That alone can be a large speed-up.

If you profile your tests, you might also discover that your actual product is slow in some cases where you didn’t expect it to be. Congratulations, you just found a performance bug in your software and you will speed up also your tests by fixing it. For example, if you make AI compiler tests with very large graphs, as you should, then you might well find that your AI compiler is very slow for such cases - I once discovered an O(N^6) algorithm in the compiler I was working on this way. That’s something to fix. It’ll speed up your tests and please your customers if they use large graphs. If you do this work, of course document numbers on the impact of your work for your performance evaluation / promotion case in the future.

While you are profiling your tests, pay attention to the utilization of the AI hardware that you are running your tests on. The utilization of your AI HW during the testing process will usually be very low, e.g. less than 1%. This happens because many tests use a lot of CPU cycles to compile a kernel, prepare inputs and inspecting outputs for correctness. The actual kernel that runs on the AI HW is usually completed very quickly. AI HW is very fast - that’s the whole point of AI HW. So your tests are likely mostly CPU bound, running on a $200 CPU, while your $10k accelerator (if you can find one that cheap) is 99% idle. So in this case you can buy twice as many $10k accelerators to double your testing capacity. That’s industry standard, I’m not joking, it’s not something funny, I’m giving you serious information without exaggeration here. This happens largely because teams don’t realize that the AI HW is poorly utilized during testing (Profile my tests? Why would I do that???). But, even when teams do realize the issue, there might not be the will to fix it. I’ve seen that as well.

The trouble is that solving low utilization of AI HW during testing is somewhat tricky. Each test needs to use the device, so a test will commonly acquire exclusive use of the device, run the test and then release the device. So if you have one device, running the tests in parallel on the CPU doesn’t help since they are serialized on acquiring the device, even though the device is mostly idle while it is acquired.

What you need is an improved way to write tests, and improved testing infrastructure, such that, naturally, and without special efforts in any of the tests, the testing infrastructure automatically sets it up so that a test will do as much work as it can before acquiring the device (prepare inputs and compile the test kernel), then it quickly acquires the device, transfers inputs to the device, runs the (already compiled) test kernel on the device, transfers the output off the device and immediately releases the device. Only after the device is released, the test can then inspect the output for correctness. That’s how to do it - you can even pipeline these steps for further optimization, but that’s less critical to do. Your tests will probably still be bound on the CPU, with a significantly idle device, but if you have 30 cores on the CPU, this might give you a ~30x improvement on your testing throughput. So that’s nice. Now you can write 30x as many tests before you have a problem with the speed of testing again. ABAT - Always Be Adding Tests.

Won’t multiplexing tests on the AI HW like this lead to race conditions and flaky tests? If you do it wrong, yes. If you do it right, no. But worrying about flaky tests (tests that fail but only sometimes) is good. Flaky tests are bad. Make sure you don’t have flaky tests.

Won’t this make every test more complicated to write? If you do it wrong, it will. But if you hide all these steps behind testing infrastructure and good testing APIs, it will be no more complicated to write a test with this setup than without it. I suggest doing it that way.

Anyway, this is starting to sound like somewhat complicated software development, isn’t it? Profiling, optimizing, distributing across machines and parallelizing within machines even perhaps software pipelining. Yes, that’s right. Testing AI software isn’t a lightweight easy activity.

What do good testing APIs look like for AI software? This is what a test for 1 + 1 = 2 should look like for an AI compiler:

ExpectEq(AddInput(1) + AddInput(1), 2)

This one line does a lot of stuff:

Create an AI model graph. 1.

Add two scalar inputs to the AI model graph. 1.

Add an addition node of the two inputs to the AI model graph. 1.

Add an output node to the AI model graph connected to the addition node. 1.

Compile the model graph to a kernel that can run on the device. 1.

Create two arrays on the host, each containing a scalar 1. 1.

Acquire the device, so that other tests cannot use the device. 1.

Transfer the compiled binary to the device. 1.

Transfer both inputs to the device. 1.

Reserve memory to hold the output on the device. 1.

Run the binary on the device, passing in the addresses of both inputs and the output. 1.

Wait for the binary to complete. 1.

Transfer the output scalar from the device back to the host. 1.

Release the device, so that other tests can use the device. 1.

Compare the transferred output array to the expected result, in this case a scalar 2. 1.

Report to the testing infrastructure whether the test succeeded or failed. 1.

Deallocate memory for the AI model graph, binaries, inputs, outputs etc.

For some AI compilers, this same test will require more than 17 lines of code, and from the list above, you can maybe tell how such a thing is possible. But, in fact, you can create AI testing APIs so that the above one line does all of that. There is enough information on that one line to do all of these steps. If your AI testing infrastructure isn’t so nice as this is, I suggest that you work on it until it is. You’ll also notice that the idea to acquire the device only when necessary is already baked into this API (you can this way also later add support for pipelining the transfers between tests without changing any of the tests).

I’d also suggest adding an automated fuzzer to expand each test into multiple other more complicated tests. So from one line you can generate many tests. You’ll also want a reference backend so that this also checks that 1 + 1 = 2. The correct output is inferred on the CPU using the reference backend, which is a simple and therefore (more often) correct.

ExpectEq(AddInput(1) + AddInput(1))

This is not useful for such a trivial case like this, since it’s not a problem to say that the answer should be 2, but it’s very useful for cases where the output is very large (e.g. a million numbers). Your fuzzer might also have use of such a reference backend.

Your reference backend will be used for large outputs. And the CPU reference backend should be implemented simply, so that you can have more confidence that it is correct, so you cannot (should not) use complex optimizations inside the reference backend. So if you look at your test profile (you are profiling your tests, right?), you are going to see after a while that the reference backend will be by far the biggest factor slowing down your testing. What to do? This one is a bit tricky. You need a simple reference backend, so maybe buy faster/more CPUs. What else? Well, here’s an idea that I think is good, even if it is a bit of trouble:

If the output from your device was previously correct upon comparison with the reference backend, record a stable 256 bit hash code of the previous correct device output. If the hash code of the current output from the device is equal to that which was recorded as correct previously, then you can mark the test as passing without running the reference backend again. This way the reference backend will rarely be run (most code changes do not change the output of most tests), yet you didn’t lose any test coverage.

You might also leverage this into a nightly determinism test: if you run the tests twice with no code change, the outputs should be bitwise identical. If they aren’t, that’s a determinism bug.

There are some complications with this idea, like how to store and retrieve the hash codes, but I think it’s the best practical solution. Better than just having unnecessarily slow tests - though definitely don’t do this until the test profile shows that the reference backend is starting to be a problem. You will want to rerun everything with the reference backend once nightly to catch, in a timely manner, some very rare situations where you can get false negatives from this approach (the reference backend changed enough to trigger test failures, or you look up by test name, the test changed and the device output didn’t change but it should have). Couldn’t you store the whole output instead of a stable hash code? Yes, but these outputs can be large and would consume bandwidth to transfer, so you might not want to do that, though you could. In that case you might as well cache the reference backend outputs instead. You can’t cache reference backend output hash codes since the device output comparison to the reference backend is usually approximately equal, not exactly equal, for reasons such as floating point reassociation.

Most of your tests are going to be smallish unit tests, that you can run before every code change, but you are also going to need larger-scale tests that take too long to run before every code change. E.g. if your hardware is for training, one of your tests should be to train e.g. a transformer from scratch and the test criterion is that the converged accuracy of the trained model at a given training step is within expectation. If your hardware is for inference, you should be looking at accuracy end-to-end of e.g. a pretrained transformer. You might want to run this for multiple kinds of models. If you aren’t running such tests daily (nightly) or at least weekly in an automated way, that’s a problem. You’ll also want automated tracking and graphing of how model accuracy is changing over time so you can spot regressions (or improvements).

If tests take more than a few minutes to run, people will sometimes skip it when they think it’s probably OK, occasionally leading to tests that fail in the shared code repository. This is a big problem for everybody every time it happens, so people get understandably angry about this. If running the tests is slow or somehow troublesome in any other way (“just follow this 10 step process for booking a machine that you run the tests on”), the real root cause for this situation is your testing infrastructure. If I hear of a situation where this keeps happening, I already know that their tests are slow or bothersome to run in some other way. That’s why it happens. I’ve seen managers and infra teams be very angry at engineers for skipping a 2 hour (!!!) test run, causing test regressions to be submitted in rare cases, when the real underlying trouble is that no one bothered to ever profile and optimize the tests. And no one even asked for more test hardware, either. You need good testing infrastructure and a bunch of test hardware.

In the past two paragraphs, I wrote that you need tests that take too long to run before every code change, and I also wrote that if you get regressions into your code repository, that’s because your tests are too slow to run. Seems like a contradiction - it has to be fast, but also you need slow tests. How to reconcile these two ideas? The consistent rule here is that if a test is very slow, it might get regressed because it won’t and shouldn’t be run all the time, only sometimes (e.g. nightly). You’ll just have to deal with that as a permanent situation, is what I’m saying. However, ideally, you will have many fast small tests that test your entire product, so that you only very rarely have cases where a slow long-running nightly test fails but none of your fast per-code-change tests fail. That’s how you deal with this, by reducing the rate of such occurrences to be low enough that it is tolerable. If slow-test regressions are happening often enough to be bothersome, then you don’t have good enough coverage from your fast tests.

Here is something that I would suggest to avoid: It is possible to save significant test load by having everyone submit their code changes without running tests, or running just a few tests, and then only running the tests every N’th (e.g. 10th) code change on the shared code repository. In case of a failure, a code change is then automatically identified (bisected) and reverted. This is bad for developer productivity and also morale (“who submitted a regression again?”). If you don’t have any budget left (did you ask?) and you already profiled and optimized the tests (did you, though?), maybe you have to do it that way, but it’s not good. I wouldn’t do that.

You might also want to run your test suite together with Valgrind and the various llvm sanitizer modes and with your favorite static analysis and coverage tools and, these days, AI analysis. This doesn’t have to happen often, some of the modes are very slow, but it’s imprudent never to do it. Once monthly or more often might make sense. Certainly always before releases. If you find e.g. 30 of these, you can run one of them each night, to avoid having to deal with the combined findings of 30 such tools at the same time.

XLA has an excellent open source test suite. If you are developing software for your own AI HW, you might consider adding an XLA backend (not that hard) so that you can run XLA’s test suite against your software and hardware. You might consider doing so to access XLA’s test suite even if you aren’t interested in using XLA for anything else. Though XLA is a fine choice. It’s how Google does AI.

Benchmarking infrastructure

Everyone on your team needs easy access to running benchmarks. Any change, even ones not intended to, can change performance for better or for worse, and it’s important for your team to be able to tell what’s going on here. It is also much easier and more motivating to do performance work if you can right away see that your change was a positive +X% for some X.

You need benchmarks both for how fast your compiler is at compiling and for how fast the generated binaries are at running an AI model. A code change can make one model faster and another model slower, so it’s important to measure across a variety of models - you can’t just look at a single model. Unless your project only supports a single model - which is starting to perhaps be a possible situation with everything being a transformer. You of course want your customers’ models to be included in your set of benchmarks, so that your team knows right away if you are making your customers’ models slower. Regressing customer models is easy to do by mistake, so it’s important to be able to tell if you did it. What if your customers want to keep their models secret from you? Well, that’s OK, but then they might get regressed (run slower). That’s the deal.

One thing that any team eventually learns over time is that some powerful optimizations that improve many models overall might still regress some models and it isn’t always practical to fix the regressions (sometimes it is, of course). So you are going to have to accept regressions sometimes. It’s a case by case judgement call. Requiring no regressions ever is not a practical policy, not even of customer models. What you’ll want to make an effort to avoid is large regressions in customer models that make their way to that customer and then there is nothing the customer can do about it. Keep in mind the possibility that if your team did many separate optimizations for a release, then perhaps all your customers’ models are improved, even if some of those individual optimizations caused regressions on their own. Don’t let fear of waves prevent a rising tide from lifting all boats (this analogy breaks down because tides are also cyclical, but don’t think about that).

It is common, maybe even industry standard, that running benchmarks is an involved process that requires a specific team member, who knows how to do it, to spend some time on it each time. So people will ask that team member to benchmark their code changes and, if the team member can be bothered that day, you might get the result the next day. The team member will probably only run one or two benchmarks, because otherwise it’s too much trouble. This is normal. It’s also not a good situation. Instead, you want a benchmark report like this to be reported automatically if you simply write “run_benchmarks” on the command line:

Performance report for change XYZ

Before After Speedup (1.0 = neutral)

model_benchmark_A 120 ms 60 ms 1.93

model_benchmark_B ...

...

Geomean 1.42

Geomean means geometric mean, this is the correct mean for ratios (like speedup). This report is comparing the time with and without a code change. That’s what you need to know the impact of the code change in terms of performance. You want a report for both compilation time and model benchmark time (always reported together). Ideally the time to generate such a benchmark should be very low, but in reality it’ll probably take a while. If it’s more than an hour that’s probably not good (you are profiling your benchmarks, right?). You want the report from the command line to be in the form of a permanent HTTP link pointing to the report. That link can then be put into a code review, so the reviewers can see what the impact of your change is without having to rerun the benchmarks for themselves, or it can be put into a report for your company, such as a written case for promotion.

If you don’t have this, I’d suggest to work at it until you do. I don’t know why you wouldn’t. It is a force multiplier for your entire team.

An important property of a benchmarking system is how noisy the numbers are. To measure this, run the benchmark a few times with a change that doesn’t do anything, so that before and after should be identical. If you don’t get a perfect 1.00 impact result every time, then your benchmark is noisy. An optimization that gives you 1% across your

Zero is a hard number

Zero is a hard number

Testing and benchmarking should be high status work

Kinds of AI software bugs and their impact

AI hardware and software infrastructure

Benchmarking infrastructure

Similar Posts