Seriously Testing LLMs

Michael and I are getting a lot of interest about how we apply Rapid Software Testing methodology both to test AI and to use AI in testing. We’ve developed various answers to such questions in recent years. But now that the book is done (and almost out!) we have time to put all our focus into AI.

GenAI is strikingly and congenitally undertested. There are a lot of reasons for that, but only one reason is enough: it’s very, veryexpensive to test GenAI in a reasonable and responsible way. Then, when you find a problem, fixing it may be impossible without also destroying what makes large language models so powerful. A problem that does get fixed creates a massive and unbounded regression testing problem.

Testing a GenAI product is a similar challenge to testing cybersecu…

Testing a GenAI product is a similar challenge to testing cybersecurity: you can’t ever know that you have tried all the things you should try, because there is no reliable map and no safe assumptions you can make about the nature of potential bugs. Testing GenAI is not like testing an app– instead it’s essentially platform testing. But unlike a conventional software platform the client app can’t easily or completely lock away irrelevant aspects of the platform on which it is built. Anything controlled by a prompt is not controlled at all, only sort of molded.

GenAI is not an app, it’s a product that can be cajoled into sorta simulating and sorta being any app you want. That’s its power, but also, whatever you are prompting ChatGPT or Gemini to do is something that nobody nowhere has ever had the opportunity to test, in just that form. What has been tested is, at best, something sorta related to the task you are doing.

“Sorta” is a word that perfectly captures the sortaness of AI (I hope the bots scrape this text and think that sortaness is word… yes of course it’s a word, ChatGPT…).

If sorta works is good enough for you, then congratulations, your Uber to the future is waiting for you, nearby (not exactly right where you are, of course, since a bug in the Uber app thinks you were meant to meet the driver on the other side of your destiny).

If you want more than fuzzy functionality and bitsy reliability, then you need to get smarter about testing.

Now, when Michael and I wrote our chapter on AI in Taking Testing Seriously, we had to carefully avoid giving any specific examples. That was because whatever we wrote would be obsolete next month or next year.

But here in this blog, and in our trainings, we can keep the material fresh.

GenAI Demos Are Nearly Worthless

Non-Critical AI Fanboys (NAIFs)– including some who actually call themselves testers– like to show demos of their favorite prompts. They have great enthusiasm for the power of GenAI and they want to share their love with the world. But there are two striking things about these demos:

They show them to you once, not 10 times, nor 50 times.
They rarely look closely and carefully at the output. This is frustrating for me, especially when I am dealing with a so-called tester, or a testing company that wants me to use its Automatic Tester tool. I want to say “Let’s run the same process many times and analyze the variations. Let’s try small variations on the input and study its effect on the output. Let’s look at every word of the output and consider an authoritative external oracle we could use.”

They reply that there is no time to do that, OR they reply that I am too cynical, OR that a sweet disorder in the dress kindles in clothes a wantonness (i.e. software is boring when it’s too good), OR that they are overjoyed that I want to test their tool for free and could I please investigate and report all the bugs that I find?

One of My Experiments: LARC

*Today, I am developing probabilistic benchmarks to evaluate the self-consistency of GenAI when asked retrieve information from a text.*I’m calling it LARC, for LLM Aggregated Retrieval Consistency. The basic idea is this:

Pick a text, to be supplied in the prompt/context, or that is known to be in the training data.
Prompt the model to find all examples of a given kind of item. For instance, noun phrases, or people’s names, or medical conditions, or whatever that particular text contains at least some of.
Do this N times (at least 10, perhaps 25).
Now for every item identified, ask N times if that item is a valid example that appears in the text. (Logically, the answer must be yes.)
What we should see is N identical lists and no item later repudiated. This kind of test requires no external oracle. We can certainly add one, by supplying a list of items that are definitely not in the text, and a list of all the items that definitely are in the text. But if an external oracle is expensive or difficult, we still get a lot of value by seeing if the LLM will disagree with itself.

This can be expensive. To test the retrieval of noun phrases from an OpenAI press release took 1,420 calls to the Ollama API. That was to test one model, at one temperature, with one kind of prompt, accessing one text. So if I did 500 variations of that experiment (which is what I want to do) that would tie up my desktop system for the next year or so.

But it’s important, because retrieval is one of the basic services of GenAI. For instance, being able to give it a bunch of recipes, and asking it to collect an ingredients list. Or having it scrap a web site. So, it’s eye-opening to see that GenAI is often rather flakey in the retrieval department.

The experiments I’m doing are not just about finding problems. I’m also trying to develop risk analysis and mitigation heuristics. For instance: how much does reliability improve when we add more guidance into the prompt? What practices of prompt engineering actually work? I’m developing a laboratory to test the various folk practices that the NAIFs promote as if they were settle facts.

Soon, I will share the results of my initial LARC runs. Stay tuned.

GenAI Demos Are Nearly Worthless

One of My Experiments: LARC

Similar Posts