Contents
1
From its roots, machine learning embraces the anything goes principle of scientific discovery. Machine learning benchmarks become the iron rule to tame the anything goes. But after decades of service, a crisis grips the benchmarking enterprise.
2
The mathematical foundations of machine learning follow the astronomical conception of society: Populations are probability distributions. Optimal predictors minimize loss functions on a probability distribution.
3
A single statistical problem illuminates much of the mathematical tools necessary for benchmarking. The key lesson is that sample requirements grow quadratically in the inverse of the difference we try to detect.
4
The holdout method separates training and testing data, permitting anything goes on the traini…
Contents
1
From its roots, machine learning embraces the anything goes principle of scientific discovery. Machine learning benchmarks become the iron rule to tame the anything goes. But after decades of service, a crisis grips the benchmarking enterprise.
2
The mathematical foundations of machine learning follow the astronomical conception of society: Populations are probability distributions. Optimal predictors minimize loss functions on a probability distribution.
3
A single statistical problem illuminates much of the mathematical tools necessary for benchmarking. The key lesson is that sample requirements grow quadratically in the inverse of the difference we try to detect.
4
The holdout method separates training and testing data, permitting anything goes on the training data, while enforcing the iron rule on the testing data. Not all uses of the holdout method are alike.
5
Statistics prescribes the iron vault for test data. But the empirical reality of machine learning benchmarks couldn’t be further from the prescription. Repeated adaptive testing brings theoretical risks and practical power.
6
A replication crisis has long gripped the empirical sciences. Statistical practice is vulnerable for fundamental reasons. Under competition, researcher degrees of freedom outwit statistical measurement.
7
The preconditions for crisis exist in machine learning, too. And yet, the situation in machine learning is different. While accuracy numbers don’t replicate, model rankings replicate to a significant degree.
8
If machine learning thwarted scientific crisis, the question is why. Some powerful explanations emerge. Key are the social norms and practices of the community rather than statistical methodology.
9
Labeling and annotation print only
If the holdout method is the greatest unsung hero, data annotation is not far behind. But conventional wisdom clouds the subtle role that annotation plays for benchmarking.
10
The ImageNet era ends as attention shifts to powerful generative models trained on the internet. The new era also marks a turning point for machine learning benchmarks.
11
After training, alignment fits pretrained models to human preferences. At a fraction of the cost of training, alignment transforms evaluation results. How so little makes such a big difference points at new challenges for benchmarking.
12
Multi-task benchmarks promise a holistic evaluation of complex models. An analogy with voting systems reveals limitations in multi-task benchmarks. Greater diversity comes at the cost of greater sensitivity to artifacts.
13
Models deployed at scale always influence future data, a phenomenon called performativity. Performativity breaks evaluation and creates the problem of data feedback loops. Dynamic benchmarks try to make a virtue out of it.
14
Evaluation at the frontier coming soon
As models gain in capabilities, human supervision increasingly becomes a bottleneck. The hope is that models will supervise and evaluate each other, but there are limits to self-evaluation.
15
Contact
Reach out at contact@mlbenchmarks.org for feedback, questions, and suggestions. Please let me know if you find any errors. I appreciate your comments.
Citation
@misc{hardt2025emerging,
author = {Moritz Hardt},
title = {The Emerging Science of Machine Learning Benchmarks},
year = {2025},
howpublished = {Online at \url{https://mlbenchmarks.org}},
note = {Manuscript}
}
Princeton University Press will release the hardcover in 2026 (expected).
Related resources
- SIAM News article (May 2025)
- SIAM MDS keynote (October 2024)
- ICLR keynote (May 2024)
- Simons Institute 10th Anniversity Plenary (May 2022)
- NeurIPS panel (December 2021)
- COLT keynote (June 2019)