AI's capabilities may be exaggerated by flawed tests, according to new study

Researchers behind a new study say that the methods used to evaluate AI systems’ capabilities routinely oversell AI performance and lack scientific rigor.

The study, led by researchers at the Oxford Internet Institute in partnership with over three dozen researchers from other institutions, examined 445 leading AI tests, called benchmarks, often used to measure the performance of AI models across a variety of topic areas.

AI developers and researchers use these benchmarks to evaluate model abilities and tout technical progress, referencing them to make claims on topics ranging from [software engineering performance](https://www.ant…

Researchers behind a new study say that the methods used to evaluate AI systems’ capabilities routinely oversell AI performance and lack scientific rigor.

AI developers and researchers use these benchmarks to evaluate model abilities and tout technical progress, referencing them to make claims on topics ranging from software engineering performance to abstract-reasoning capacity. However, the paper, released Tuesday, claims these fundamental tests might not be reliable and calls into question the validity of many benchmark results.

According to the study, a significant number of top-tier benchmarks fail to define what exactly they aim to test, concerningly reuse data and testing methods from pre-existing benchmarks, and seldom use reliable statistical methods to compare results between models.

Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author of the study, argued these benchmarks can be alarmingly misleading: “When we ask AI models to perform certain tasks, we often actually measure completely different concepts or constructs than what we aim to measure,” Mahdi told NBC News.

Andrew Bean, a researcher at the Oxford Internet Institute and another lead author of the study, concurred that even reputable benchmarks are too often blindly trusted and deserve more scrutiny.

“You need to really take it with a grain of salt when you hear things like ‘a model achieves Ph.D. level intelligence,’” Bean told NBC News. “We’re not sure that those measurements are being done especially well.”

Some of the benchmarks examined in the analysis measure specific skills, like Russian or Arabic language abilities, while other benchmarks measure more general capabilities, like spatial reasoning and continual learning.

A core issue for the authors was whether a benchmark is a good test of the real-world phenomenon it aims to measure, or what the authors label as “construct validity.” Instead of testing a model on an endless series of questions to evaluate its ability to speak Russian, for example, one benchmark reviewed in the study measures a model’s performance on nine different tasks, like answering yes-or-no questions using information drawn from Russian-language Wikipedia.

However, roughly half of the benchmarks examined in the study fail to clearly define the concepts they purport to measure, casting doubt on benchmarks’ ability to yield useful information about the AI models being tested.

As an example, in the study the authors showcase a common AI benchmark called Grade School Math 8K (GSM8K), which measures performance on a set of basic math questions. Observers often point to leaderboards on the GSM8K benchmark to show that AI models are highly capable at fundamental mathematical reasoning, and the benchmark’s documentation says it is “useful for probing the informal reasoning ability of large language models.”

Yet correct answers on benchmarks like GSM8K do not necessarily mean the model is actually engaging in mathematical reasoning, study author Mahdi said. “When you ask a first grader what two plus five equals and they say seven, yes, that’s the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no.”

Bean acknowledged that measuring nebulous concepts like reasoning requires evaluating a subset of tasks, and that such selection will invariably be imperfect. “There are a lot of moving pieces in these evaluations, and satisfying all of them requires balance. But this paper calls for benchmarks to clearly define what they set out to measure,” he said.

“With concepts like harmlessness or reasoning, people oftentimes just throw the word around to pick something that falls near that category that they can measure and say, ‘Great, now I’ve measured it,’” Bean added.

In the new paper, the authors make eight recommendations and provide a checklist to systematize benchmark criteria and improve the transparency and trust in benchmarks. The suggested improvements include specifying the scope of the particular action being evaluated, constructing batteries of tasks that better represent the overall abilities being measured, and comparing models’ performance via statistical analysis.

Nikola Jurkovic, a member of technical staff at the influential METR AI research center, commended the paper’s contributions. “We need more rigor if we want to be able to interpret the results of AI benchmarks. This checklist is a starting point for researchers to check whether their benchmark will be insightful,” Jurkovic told NBC News.

Tuesday’s paper builds on previous research pointing out flaws in many AI benchmarks.

Last year, researchers from AI company Anthropic advocated for increased statistical testing to determine whether a model’s performance on a specific benchmark really showed a difference in capabilities or was rather just a lucky result given the tasks and questions included in the benchmark.

To attempt to increase the usefulness and accuracy of benchmarks, several research groups have recently proposed new series of tests that better measure models’ real-world performance on economically meaningful tasks.

In late September, OpenAI released a new series of tests that evaluate AI’s performance on tasks required for 44 different occupations, in an attempt to better ground claims of AI capabilities in the real world. For example, the tests measure AI’s ability to fix inconsistencies in customer invoices Excel spreadsheets for an imaginary sales analyst role, or AI’s ability to create a full production schedule for a 60-second video shoot for an imaginary video producer.

Dan Hendrycks, director of the Center for AI Safety, and a team of researchers recently released a similar real-world benchmark designed to evaluate AI systems’ performance on a range of tasks necessary for the automation of remote work.

“It’s common for AI systems to score high on a benchmark but not actually solve the benchmark’s actual goal,” Hendrycks told NBC News.

Surveying the broader landscape of AI benchmarks, Mahdi said researchers and developers have many exciting avenues to explore. “We are just at the very beginning of the scientific evaluation of AI systems,” Mahdi said.

Jared Perlo is a writer and reporter at NBC News covering AI. He is currently supported by the Tarbell Center for AI Journalism.

Similar Posts