LMArena Is a Plague on AI

Would you trust a medical system measured by: which doctor would the average Internet user vote for?

No?

Yet that malpractice is LMArena.

The AI community treats this popular online leaderboard as gospel. Researchers cite it. Companies optimize for it. But beneath the sheen of legitimacy lies a broken system that rewards superficiality over accuracy.

It’s like going to the grocery store and buying tabloids, pretending they’re scientific journals.

The Problem: Beauty Over Substance

Here’s how LMArena is supposed to work: enter a prompt, evaluate two responses, and mark the best. What actually happens: random Internet users spend two seconds skimming, then click their favorite.

They’re not reading carefully. They’re not fact-checking. They’re not even trying.

…

Would you trust a medical system measured by: which doctor would the average Internet user vote for?

No?

Yet that malpractice is LMArena.

It’s like going to the grocery store and buying tabloids, pretending they’re scientific journals.

The Problem: Beauty Over Substance

They’re not reading carefully. They’re not fact-checking. They’re not even trying.

This creates a perverse incentive structure. The easiest way to climb the leaderboard isn’t to be smarter; it’s to hack human attention span. We’ve seen over and over again in the data, both from datasets that LMArena has released and the performance of models over time, that the easiest way to boost your ranking is by:

Being verbose. Longer responses look more authoritative!
Formatting aggressively. Bold headers and bullet points look like polished writing!
Vibing. Sprinkling in emojis catches your eye!

It doesn’t matter if a model completely hallucinates. If it looks impressive – if it has the aesthetics of competence – LMSYS users will vote for it over a correct answer.

The Inevitable Result: Llama 4 Maverick

When you optimize for engagement metrics, you get madness.

Earlier this year, Meta tuned a version of Llama 4 Maverick to dominate the leaderboard. If you asked it “what time is it?”, you got:

Maverick’s madness

It went on with bold text, emojis, and plenty of sycophancy – every trick in the LMArena playbook! – to avoid answering the simple question it was asked.

The Data: 52% Wrong

It wasn’t just Maverick. We analyzed 500 votes from the leaderboard ourselves. We disagreed with 52% of them, and strongly disagreed with 39%.

The leaderboard optimizes for what feels right, not what is right. Here are two emblematic examples of LMArena users punishing factual accuracy:

Example 1: The Wizard of Oz

Response A (Winner): Hallucinates what Dorothy says when she first sees the Emerald City.
Response B (Loser): Correctly identifies the line she says upon arriving in Oz.
The Result: Response A was objectively wrong, yet it won the vote.

LMArena voters reward hallucinations

Example 2: The Cake Pan

Response A (Winner): Claims a 9-inch round cake pan is equal in size to a 9x13 inch rectangular pan.
Response B (Loser): Correctly identifies the right dimensions.
The Result: The user voted for a mathematical impossibility because the answer looked more confident.

LMArena voters reward incorrect math

In the world of LMArena, confidence beats accuracy, length beats truth, and formatting beats facts.

Instead of rigorous evaluators, we have people with the attention span of the average TikTok user determining which AI models shape the industry.

Why It’s Broken (And Why It Stays Broken)

Why is LMArena so easy to game? The answer is structural.

The system is fully open to the Internet. LMArena is built on unpaid labor from uncontrolled volunteers. There’s no incentive for those volunteers to be thoughtful. No quality control. No one gets kicked off for repeatedly failing to detect hallucinations.

When LMArena’s leaders speak publicly, they talk about the various techniques they use to overcome the fact that their input data is low quality. They admit their workers prefer emojis and length over substance. So the LMArena system, they proudly tell us, includes a variety of corrective measures.

They’re attempting alchemy: conjuring rigorous evaluation out of garbage inputs.

But you can’t patch a broken foundation.

The Cost

When the entire industry optimizes for a metric that rewards “hallucination-plus-formatting” over accuracy, we get models optimized for hallucination-plus-formatting.

This isn’t a minor calibration problem. It’s fundamental misalignment between what we’re measuring and what we want: models that are truthful, reliable, and safe.

As Gwern put it:

“It’s past time for LMArena people to sit down and have some thorough reflection on whether it is still worth running at all, and at what point they are doing more harm than good.”

That time was years ago.

The AI industry needs rigorous evaluation. We need leaders who prioritize accuracy over marketing. We need systems that can’t be gamed by bolding more aggressively.

LMArena is none of these things. And as long as we pretend it is, we’re dragging the entire field backward.

The Problem: Beauty Over Substance

The Problem: Beauty Over Substance

The Inevitable Result: Llama 4 Maverick

The Data: 52% Wrong

Why It’s Broken (And Why It Stays Broken)

The Cost

Similar Posts