You don't need all the LLM benchmarks (opens in new tab)

Covers 2 stories including Google's latest Gemini-exp-1206 seems to be great, near the top of livebenchDiscussed on Hacker News

Every time a new model comes out, somebody runs it on MMLU \(57 subjects\), MTEB \(56 tasks\), HELM, the Open LLM Leaderboard, AlpacaEval, LiveBench, BigCodeBench, WildBench, Arena-Hard, MT-Bench, and a dozen others\. That’s days of GPU time and a lot of human babysitting\. But if you’ve ever stared at a leaderboard for ten minutes you already know the dirty secret: the columns are wildly correlated\. If a model is good at one math benchmark it’s good at all of them\. So how much of this ca...

Read the original article