You don't need all the LLM benchmarks (opens in new tab)
Every time a new model comes out, somebody runs it on MMLU \(57 subjects\), MTEB \(56 tasks\), HELM, the Open LLM Leaderboard, AlpacaEval, LiveBench, BigCodeBench, WildBench, Arena-Hard, MT-Bench, and a dozen others\. That’s days of GPU time and a lot of human babysitting\. But if you’ve ever stared at a leaderboard for ten minutes you already know the dirty secret: the columns are wildly correlated\. If a model is good at one math benchmark it’s good at all of them\. So how much of this ca...
Read the original article