Nearly a century ago, John Maynard Keynes used a “beauty contest” thought experiment to explain how people make decisions when success depends on guessing what everyone else will do.
The trick wasn’t to pick your personal favorite, but to anticipate the crowd’s pick – second-guessing, third-guessing, and so on.
Economists have since turned that idea into simple strategy games that probe how deeply we reason about one another’s minds. It’s the kind of task today’s chatbots, built to predict and adapt, seem tailor-made to solve.
Chatbots overestimate us
A team at HSE University put that assumption to the test. The researchers found a consistent, almost endearing mistake: state…
Nearly a century ago, John Maynard Keynes used a “beauty contest” thought experiment to explain how people make decisions when success depends on guessing what everyone else will do.
The trick wasn’t to pick your personal favorite, but to anticipate the crowd’s pick – second-guessing, third-guessing, and so on.
Economists have since turned that idea into simple strategy games that probe how deeply we reason about one another’s minds. It’s the kind of task today’s chatbots, built to predict and adapt, seem tailor-made to solve.
Chatbots overestimate us
A team at HSE University put that assumption to the test. The researchers found a consistent, almost endearing mistake: state-of-the-art language models – systems like ChatGPT-4o and Claude-Sonnet-4 – tend to credit humans with more rational foresight than we actually display.
In the classic “Guess the Number” variant of the Keynesian contest, where players choose a number from 0 to 100 and the winner gets closest to a fixed fraction of the group’s average (often one-half or two-thirds), the models routinely played “too smart.”
They assumed their human opponents would recurse through layers of logic, then chose numbers that would beat that imagined crowd – only to lose against real-world behavior that is, on average, less sophisticated.
How the experiment worked
Dmitry Dagaev and colleagues didn’t just pit chatbots against a random sample. They recreated 16 well-known Guess the Number experiments from the literature, each with different participant pools and cognitive contexts.
Opponents ranged from first year economics students and conference attendees steeped in game theory to groups primed with emotions like anger or sadness, or described as analytical versus intuitive thinkers.
Each time, the models received the game rules and a short description of who they were “playing.” Then they had to pick a number and explain their reasoning.
Smart adaptation, wrong calibration
The chatbots did register the context. When told they were facing a room full of game theory experts, they leaned toward very low numbers.
These are the kind of numbers that usually win when everyone iterates the logic (“if everyone guesses the average, then the target is two-thirds of that, and two-thirds of that…” until you converge toward zero).
When the opponents were described as inexperienced undergraduates, the models raised their guesses accordingly.
In other words, the systems can flex: they adapt to different descriptions of human sophistication and justify their choices with coherent strategic narratives.
But the gap shows up in calibration. Across settings, the models tended to overestimate how many levels of thinking the average person will actually perform.
Humans often stop after one or two steps – “others will pick 50, so two-thirds of that is about 33” – and plenty don’t iterate at all. The models’ “over-reasoning” regularly pushed them below the winning range. It’s a bit like bringing chess tournament strategy to a family checkers night.
What language models may miss
There was another wrinkle: in simplified two-player versions, the models struggled to lock onto a dominant strategy.
The chatbots reasoned clearly, and they adjusted to the opponent’s description, but they didn’t consistently converge on the move that, given the rules, should dominate.
That suggests that even when the space of possibilities is small, current language models can miss equilibria that experienced game theorists find obvious.
Tuning AI to human reality
The Keynesian beauty contest isn’t just a parlor trick. It’s a metaphor for how markets work. Traders don’t buy what they personally “like.” They try to buy what they think others will like tomorrow.
If an AI assistant in a trading desk, a pricing engine, or a negotiation tool systematically overestimates how rationally the other side will behave, it can make decisions that look elegant on paper and underperform in practice.
The lesson isn’t “don’t use AI.” It’s “tune AI to human reality.”
Making chatbots compatible with humans
The HSE team’s broader point is timely: AIs are already stepping into roles where their choices have social and economic consequences. In many of those roles, the goal isn’t superhuman cleverness but human-compatible behavior.
That could mean training and evaluating models on data that capture real distributions of reasoning depth or building prompts and system policies that explicitly dampen over-iteration.
In addition, language models should be coupled to auxiliary modules that estimate opponent sophistication from context, not just from rules.
What this tells us about current models
Large language models are, at their core, pattern matchers. Give them a description of a game and a profile of the opponent, and they generate a plausible chain of thought and a move that fits the training patterns they’ve seen.
That makes them remarkably good at “as if” reasoning: they sound like strategic thinkers and often act like them.
But without grounded feedback from human play, and without guardrails calibrated to ordinary human-bounded rationality, chatbots can drift into hyper-rational strategies that fail in the wild.
The road ahead
This study doesn’t say AIs can’t predict human behavior. It says they need better priors about us.
Calibrating models to realistic levels of strategic depth, validating them against diverse human cohorts, and stress-testing them in two-player settings where dominant strategies exist are all actionable next steps.
If we want AI to help us navigate markets, negotiations, and everyday collective choices, we should teach it something economists learned long ago: people are smart – but not that smart, not all the time, and not all in the same way.
“We are now at a stage where AI models are beginning to replace humans in many operations, enabling greater economic efficiency in business processes. However, in decision-making tasks, it is often important to ensure that LLMs behave in a human-like manner,” Dagaev concluded.
“As a result, there is a growing number of contexts in which AI behavior is compared with human behaviour. This area of research is expected to develop rapidly in the near future.”
The study is published in the Journal of Economic Behavior & Organization.
—–
Like what you read? Subscribe to our newsletter for engaging articles, exclusive content, and the latest updates.
Check us out on EarthSnap, a free app brought to you by Eric Ralls and Earth.com.
—–