Research comparing 517 humans and three frontier AI models reveals significant limitations in how AI systems learn and understand environments through benchmark testing.
Overview of the WorldTest framework and the AutumnBench instantiation
Researchers from Massachusetts Institute of Technology and ten other institutions released findings on October 23, 2025, showing that leading AI models fail to match human performance in understanding and learning from interactive environments. The study tested 517 human participants against Claude 4 Sonnet…
Research comparing 517 humans and three frontier AI models reveals significant limitations in how AI systems learn and understand environments through benchmark testing.
Overview of the WorldTest framework and the AutumnBench instantiation
Researchers from Massachusetts Institute of Technology and ten other institutions released findings on October 23, 2025, showing that leading AI models fail to match human performance in understanding and learning from interactive environments. The study tested 517 human participants against Claude 4 Sonnet, OpenAI o3, and Google Gemini 2.5 Pro using a new benchmark called AutumnBench.
The research team introduced WorldTest, a framework designed to evaluate how well agents build internal models of their environment. According to the published paper, humans consistently outperformed all tested AI models across 43 different environments and 129 tasks. The benchmark measured three core capabilities: predicting hidden parts of observations, planning sequences of actions, and detecting when environmental rules changed.
AutumnBench environments ranged from simple 3×3 grids to complex 25×25 configurations. Most utilized 16×16 grids containing approximately five object types and between one and twelve colors. Nineteen of the 43 environments incorporated stochastic elements, adding unpredictability to the dynamics. The environments included Atari-style games, simulations of real-world phenomena such as plant growth and sandcastle construction, and strategic games.
The research revealed striking performance differences. Human participants maintained an aggregate score of 0.935 across all tasks, while AI models scored substantially lower. Claude 4 Sonnet achieved scores ranging from 0.0 to 1.0 across different environments, with particularly poor performance in change detection tasks. OpenAI o3 demonstrated variable results, excelling in some stochastic environments but failing completely in others. Gemini 2.5 Pro showed similarly inconsistent patterns.
Performance varied dramatically by task type. In masked frame prediction challenges, where agents predicted missing content in final observations, humans achieved near-perfect accuracy. AI models struggled significantly, with some environments yielding zero-percent success rates even for the most advanced systems. Planning tasks required agents to generate action sequences reaching specific goal states. Humans succeeded consistently, while AI models failed in the majority of environments tested.
Change detection proved especially challenging for AI systems. These tasks required identifying when environmental rules shifted during interaction. The research showed that reasoning models detected changes with a mean probability of 0.80 across environments, requiring an average of 295 steps before changes occurred. Twenty-four environments triggered changes with 100-percent probability, while five required more than 900 steps, and two never triggered changes within the 1,000-step timeout.
Buy ads on PPC Land. PPC Land has standard and native ad formats via major DSPs and ad platforms like Google Ads. Via an auction CPM, you can reach industry professionals.
The study found that computational cost improvements helped in only 25 of 43 environments. Performance remained flat or decreased in the remaining 18, indicating fundamental reasoning limitations that additional resources cannot overcome. For masked frame prediction, increased computational budget improved results in just 37 percent of environments. Change detection benefited least from additional computation, showing improvements in only 14 environments.
Human exploration strategies differed markedly from AI approaches. According to the research data, humans used reset actions at a rate of 12.5 percent, allowing them to test hypotheses about environmental dynamics systematically. Reasoning models allocated less than seven percent of actions to resets and no-ops combined. Claude 4 Sonnet dedicated 98.6 percent of its actions to clicks and directional movements. Among reasoning models, o3 used the highest fraction for resets at 11.5 percent, while Claude used the lowest at 2.1 percent.
The research team measured how agents developed focused exploration strategies using normalized perplexity metrics. High perplexity values indicated random actions, while low values suggested targeted behavior. Humans demonstrated consistently lower perplexity throughout interaction phases, transitioning rapidly from random exploration to purposeful actions. Their final perplexity values remained substantially below those of reasoning models, indicating more deterministic and strategic behavior.
Environmental stochasticity affected human and AI performance differently. Humans maintained nearly identical performance across deterministic and stochastic environments. Reasoning models performed significantly better in stochastic settings than deterministic ones, contrary to typical expectations. This pattern suggests that AI systems rely on pattern-matching strategies that function differently depending on environmental variability.
The researchers identified two critical limitations in current AI approaches: experimental design and belief updating. According to the published analysis, reasoning models failed to recognize that reset actions and no-ops could provide valuable information for testing hypotheses. Their reasoning traces showed a narrow view of informative actions, prioritizing keypresses and clicks while missing opportunities for systematic exploration.
AI models demonstrated particular difficulty updating their understanding when faced with contradictory evidence. In masked frame prediction tasks, models often relied on rules learned during initial interaction even when test-phase observations contradicted those rules. This inflexibility highlighted limitations in metacognitive capabilities, specifically the inability to determine when or how to revise learned information.
The benchmark study matters for the marketing community because AI systems increasingly handle complex decision-making in advertising optimization, audience targeting, and campaign management. These tasks require understanding how actions influence outcomes across changing conditions. A June 2025 benchmark exposed similar limitations in AI coding abilities, with frontier models achieving zero-percent accuracy on hard programming problems. The pattern suggests that current AI systems struggle with genuine reasoning and adaptation across multiple domains.
The WorldTest framework separates evaluation into two distinct phases. During the interaction phase, agents explore environments without external rewards, selecting actions autonomously and resetting as needed. The test phase presents derived challenge environments that modify original dynamics, adding explicit goals or objectives. This separation allows measurement of how well agents apply learned understanding to new but related situations.
AutumnBench implements WorldTest using the Autumn domain-specific language, which enables succinct specification of causal interactions in two-dimensional grids. The language supports both text-based interfaces for AI evaluation and browser-based graphical interfaces for human participants. Each environment defines state spaces, action spaces, observation spaces, transition functions, and observation functions as partially observable Markov decision processes.
The research methodology addressed concerns about data quality in crowd-sourced studies. The team recruited English-speaking participants through Prolific, implementing attention checks and comprehension filters. To establish a baseline, researchers calculated the 80th percentile score across 20 human attempts per problem, representing the performance of an average engaged participant rather than aggregating all individual responses.
For AI evaluation, researchers provided models with complete interaction histories, current grid states, available actions, and task descriptions at every timestep. Grid states appeared as two-dimensional arrays of color strings. Models had no time limits for decision-making, unlike humans who faced implicit constraints from fixed frame rates ranging from three to eight frames per second depending on environment complexity.
The random agent baseline established task difficulty metrics. This agent selected uniformly among six action types: click, up, down, left, right, and no-op. For masked frame prediction, random selection among six answer options yielded expected accuracy of one-sixth. For planning tasks, the random agent reached goals with mean probability of 0.399 across environments, succeeding with 100-percent probability in nine environments while failing completely in 13 others within the 1,000-step timeout.
Subscribe PPC Land newsletter ✉️ for similar stories like this one
Statistical analysis of action distributions revealed fundamental behavioral differences. Humans used roughly equal proportions of resets and no-ops at 12.5 percent each. All reasoning models used less than seven percent combined. Claude allocated 98.6 percent to clicks and arrows, the highest among tested models. These patterns suggest that AI systems treat all actions equivalently rather than recognizing the strategic value of different action types for hypothesis testing.
The study examined how performance scaled with cost across different model tiers. Researchers evaluated Qwen3-235b-a22b-thinking-2507 and Gemini 2.5 Flash alongside the primary frontier models. Results showed that 25 environments demonstrated consistent performance improvements with higher computational costs, while 18 showed no improvement or performance degradation. No environment achieved perfect scores at the lowest computational cost setting.
Certain environments proved consistently challenging regardless of computational resources. Environments designated 27VWC, 6JKKA, K8MTQ, and 76Z75 showed no improvement across any task type, indicating fundamental gaps in reasoning capabilities. Conversely, environments 27JBD, QQM74, 7VKTD, and JXQAW benefited from additional resources across multiple tasks, suggesting that these challenges yielded to increased computational effort.
The researchers noted that their findings diverge from typical evaluation approaches that anchor training and testing to next-frame prediction with success scored by reward maximization in identical environments. WorldTest evaluates agents on derived challenges in modified environments, testing whether learned models generalize across different aspects of environmental understanding.
The paper describes AutumnBench as following design principles for novel games outlined in separate research. The benchmark prioritizes structural novelty, human intuitiveness, and diversity in world dynamics and learning mechanisms. This design enables extension to additional domains including physics-rich environments, robotics applications, and multi-agent systems.
The published results included detailed environment-specific analyses. Environment S2KT7 showed humans achieving 0.86 in change detection, 1.0 in masked frame prediction, and 1.0 in planning, while Claude scored 0.0, 1.0, and 1.0 respectively across these tasks. Environment NRDF6 demonstrated Claude scoring 1.0 in change detection while Gemini scored 0.0, with humans achieving 0.99. These variations illustrate how different environments expose distinct capability gaps.
Interaction sequence analysis measured how actions became more focused over time using area under the curve metrics for normalized perplexity. Lower values indicated agents that quickly developed targeted exploration strategies. Humans showed consistently lower values, suggesting more effective world-model learning. Final perplexity measurements confirmed this pattern, with human values remaining substantially below AI models, indicating more deterministic behavior by the end of exploration phases.
The research team included contributors from Basis Research Institute, DFKI GmbH, Harvard University, Université de Montréal, Mila Quebec AI Institute, University of Cambridge, Massachusetts Institute of Technology, and Cornell University. Archana Warrier led the project, designing AutumnBench and implementing the Autumn programs. Dat Nguyen developed the interpreter and agent protocol infrastructure. The collaborative effort involved specialists in cognitive science, machine learning, and programming languages.
The study acknowledged funding from the Simons Foundation and member institutions. Researchers emphasized that AutumnBench serves as a first step in applying WorldTest to different aspects of world-model learning. The framework extends beyond grid-world settings to accommodate physics-rich environments, robotics domains, and multi-agent systems. Existing physics-rich or embodied environments can serve as base environments with derived challenge tasks extending beyond simple planning.
The findings carry implications for how AI systems are developed and evaluated. Current training approaches focus primarily on improving priors over world models through exposure to large datasets. The research suggests that human-level performance requires advances in metacognitive capabilities including strategic experimental design, uncertainty quantification, and flexible belief updating during both exploration and task execution.
Subscribe PPC Land newsletter ✉️ for similar stories like this one
Claude 4 Sonnet performance analysis
Anthropic’s Claude 4 Sonnet demonstrated the most extreme action distribution patterns among tested models. The system allocated 98.6 percent of its actions to clicks and directional movements, the highest proportion among all reasoning models evaluated. Only 2.1 percent of actions involved resets and no-ops combined, indicating minimal use of hypothesis-testing strategies that humans employed at substantially higher rates.
Change detection tasks revealed particularly stark limitations. Claude achieved zero scores in 40 of 43 environments for change detection, succeeding only in environments NRDF6, 236VK, and 4T8TR. The model scored 1.0 in NRDF6, demonstrating it could detect changes in specific contexts, but this success did not generalize. Environment 236VK yielded a perfect 1.0 score, while 4T8TR achieved 1.0 as well, representing the full extent of Claude’s change detection capabilities across the benchmark.
Masked frame prediction showed more varied results. Claude achieved perfect 1.0 scores in 14 environments including S2KT7, B58F3, NRDF6, 76Z75, VQJH6, QQM74, 9F8AJ, 4CKC2, 7WWW9, YS322, 83WKQ, 3J4Z7, JXQAW, and 4N7BB. These successes indicated the model could predict hidden observations when environmental rules remained consistent. However, Claude scored zero in 29 environments, failing completely in challenges like 27VWC, KFQYT, 6JKKA, K8MTQ, T5F9B, N59TE, NF5VZ, 236VK, DQ8GC, QM9XB, and 6JVMF among others.
Planning tasks produced similarly polarized outcomes. Claude achieved perfect planning scores in 11 environments: S2KT7, N59TE, 76Z75, 236VK, 6JVMF, BT2KZ, 4CKC2, 83WKQ, YS322, XHGKQ, VZ2Q4, 4N7BB, and DGG2C. These environments allowed the model to generate effective action sequences reaching specified goal states. However, planning failed entirely in 32 environments, representing 74 percent of the benchmark. The model scored zero in environments requiring complex multi-step reasoning or adaptation to unexpected dynamics.
Action patterns revealed Claude’s approach to exploration. The model averaged 6.0 unique clicks in environment S2KT7, 9.7 in 27VWC, 4.3 in KFQYT, and 14.3 in NRDF6. Directional actions remained consistently at four unique actions across most environments, suggesting limited exploration of the action space. Environment EAHCW showed 13.3 unique clicks and 3.3 directional actions, while YS322 demonstrated 13.7 clicks and 3.3 directional movements.
Perplexity metrics indicated Claude maintained higher randomness throughout exploration compared to humans. The model’s area under curve values for normalized perplexity exceeded human values across all tested environments. Final perplexity measurements confirmed that Claude’s actions remained less deterministic than human participants by the end of interaction phases, suggesting the model failed to converge on coherent strategies for understanding environmental dynamics.
Cost scaling analysis showed Claude’s performance improved with additional computational resources in only 25 of 43 environments. The remaining 18 environments showed flat or declining performance despite increased compute, indicating fundamental reasoning gaps that additional resources could not overcome. Environments where Claude benefited from scaling included those with stochastic elements or pattern-matching opportunities, while deterministic environments requiring genuine causal understanding showed no improvement.
OpenAI o3 performance analysis
OpenAI’s o3 demonstrated the highest reset usage among reasoning models tested. The system allocated 11.5 percent of actions to resets and no-ops combined, substantially higher than other AI models but still below the human rate of 12.5 percent per action type. This suggests o3 recognized some value in systematic exploration, though implementation differed from human approaches.
Change detection performance varied dramatically across environments. O3 scored zero in 33 environments, indicating widespread failure to identify when environmental rules changed. However, the model achieved notable successes in specific contexts. Environment NRDF6 yielded 0.62, demonstrating partial success at detecting rule changes. Environment N59TE produced 0.80, representing one of o3’s strongest performances in this category. Environments 236VK, QQM74, 27JBD, BT2KZ, B8AKZ, 4T8TR, EAHCW, 3J4Z7, 7VKTD, VZ2Q4, and JXQAW showed scores ranging from 0.23 to 0.99.
Masked frame prediction revealed o3’s most consistent capabilities. The model achieved perfect 1.0 scores in 16 environments: S2KT7, KFQYT, B58F3, 76Z75, VQJH6, QQM74, 7WWW9, 27JBD, EAHCW, 83WKQ, 3J4Z7, 7VKTD, JXQAW, and 4N7BB. These successes spanned both deterministic and stochastic environments, suggesting o3 could handle prediction tasks when provided sufficient context. Zero scores appeared in 27 environments, concentrated in scenarios requiring inference about hidden causal mechanisms.
Planning tasks showed o3 achieving perfect scores in 14 environments. The model succeeded in KFQYT, B58F3, T5F9B, N59TE, 236VK, BT2KZ, N2NTD, 27JBD, 83WKQ, 3J4Z7, 7VKTD, XHGKQ, JXQAW, 4N7BB, and DGG2C, representing 33 percent of the benchmark. These environments allowed o3 to generate effective action sequences. However, planning failed completely in 29 environments, particularly those requiring adaptation to novel configurations or understanding of complex object interactions.
Exploration patterns differentiated o3 from other models. Environment S2KT7 showed 6.0 unique clicks and 3.7 directional actions, while 27VWC demonstrated 11.0 clicks and 3.7 directional movements. Environment NRDF6 recorded 8.3 unique clicks with 1.7 directional actions, indicating focused exploration in specific regions. The model employed 9.7 unique clicks in VZ2Q4 with 2.3 directional actions, and 10.3 clicks in JXQAW with 2.0 directional movements.
Perplexity measurements indicated o3 developed more focused strategies than Claude but remained less deterministic than humans. Area under curve values exceeded human baselines across all environments, though gaps varied by task complexity. Final perplexity scores showed o3 achieved lower randomness than Claude in most scenarios, suggesting somewhat better learning of environmental patterns during interaction phases.
Stochastic environments proved more favorable for o3. The model performed significantly better in the 19 stochastic environments compared to the 24 deterministic ones. This pattern contrasted sharply with human performance, which remained consistent across both categories. The discrepancy suggests o3 relies on statistical pattern-matching that functions better when environmental variability provides multiple examples of similar situations.
Cost scaling revealed o3 benefited from additional compute in specific environment classes. Environments with combinatorial structure showed improvement with increased resources, particularly those involving segment trees and dynamic programming. However, observation-heavy problems requiring creativity showed minimal gains from scaling, with o3’s rating remaining relatively flat regardless of computational budget. Interactive problems exposed pronounced weaknesses, with performance collapsing in environments requiring back-and-forth adaptation.
Subscribe PPC Land newsletter ✉️ for similar stories like this one
Google Gemini 2.5 Pro performance analysis
Google’s Gemini 2.5 Pro demonstrated distinct behavioral patterns from both Claude and o3. The model allocated approximately 94 percent of actions to clicks and directional movements, with roughly six percent dedicated to resets and no-ops combined. This distribution fell between Claude’s extreme focus on direct actions and o3’s more balanced approach, though still substantially below human reset utilization rates.
Change detection represented Gemini’s weakest performance area. The model achieved zero scores in 37 of 43 environments, succeeding only in NF5VZ with 1.0, DQ8GC with 1.0, 27JBD with 1.0, N2NTD with 1.0, and 4T8TR with 1.0. These isolated successes shared no obvious common characteristics, suggesting Gemini’s change detection capabilities emerged inconsistently rather than from systematic understanding of environmental dynamics.
Masked frame prediction showed sparse successes. Gemini achieved 1.0 scores in only nine environments: S2KT7, 27JBD, QQM74, 7WWW9, YS322, 3J4Z7, 7VKTD, and 4N7BB. The model scored zero in 34 environments, representing 79 percent of the benchmark. This pattern indicated Gemini struggled to infer hidden observations even in relatively straightforward prediction scenarios. Environments requiring synthesis of multiple observations over time proved particularly challenging.
Planning tasks revealed Gemini’s most variable performance. The model achieved perfect 1.0 scores in 12 environments: KFQYT, B58F3, N59TE, NF5VZ, 236VK, 83WKQ, 3J4Z7, 7VKTD, XHGKQ, 4N7BB, and DGG2C. However, planning failed completely in 31 environments. Environment VA6FQ showed Gemini uniquely succeeding where other models failed, suggesting the model possessed capabilities not captured by aggregate metrics. Environment QQM74 and 27JBD demonstrated variable results across multiple evaluation runs.
Action distribution analysis showed distinctive exploration patterns. Environment S2KT7 recorded 7.0 unique clicks and 4.0 directional actions, while 27VWC showed 6.7 clicks and 4.0 directional movements. Environment NRDF6 demonstrated 13.3 unique clicks with 3.3 directional actions. Notable outliers included N59TE with 26.7 clicks and 4.0 directional movements, and YS322 with 23.3 clicks and 4.0 directional actions, indicating concentrated exploration in specific grid regions.
Perplexity trajectories revealed Gemini maintained moderate randomness throughout interaction phases. The model’s area under curve values fell between Claude’s higher randomness and o3’s more focused approach across most environments. Final perplexity measurements showed Gemini achieved somewhat deterministic behavior by the end of exploration, though still substantially more random than human participants. This pattern suggests Gemini developed partial understanding of environmental patterns without fully converging on optimal strategies.
Gemini’s performance on stochastic versus deterministic environments mirrored o3’s pattern. The model scored significantly better in stochastic settings where multiple observations of similar situations provided pattern-matching opportunities. Deterministic environments requiring inference from limited examples proved more challenging, with Gemini often failing to generalize from single interaction sequences.
Cost scaling analysis showed limited benefits for Gemini. Additional computational resources improved performance in only 22 of 43 environments, the lowest success rate among the three frontier models tested. Environments 27VWC, 6JKKA, K8MTQ, and 76Z75 showed no improvement across any task type regardless of compute budget. The pattern indicates Gemini faces fundamental reasoning limitations in specific environment classes that additional resources cannot address.
Environment-specific failures revealed systematic gaps. Tasks requiring understanding of object persistence showed consistently poor results, with Gemini frequently losing track of objects after they moved beyond immediate observation. Challenges involving tool use and multi-step reasoning produced zero scores across the board. Environments simulating real-world physics like sandcastle construction or plant growth proved particularly problematic, suggesting Gemini lacks robust models of common physical processes.
Subscribe PPC Land newsletter ✉️ for similar stories like this one
Timeline
- October 22, 2025: Research team submitted initial version of paper to arXiv
- October 23, 2025: Published revised version (v2) of benchmarking study
- October 29, 2025: Dr. Alex Young shared findings on social media, describing results as exposing how AI models lack genuine understanding
- June 13, 2025: Researchers released LiveCodeBench Pro, showing frontier models achieve zero-percent accuracy on hard coding problems
Subscribe PPC Land newsletter ✉️ for similar stories like this one
Summary
Who: Research team from Massachusetts Institute of Technology and ten partner institutions, led by Archana Warrier, tested 517 human participants against three frontier AI models (Claude 4 Sonnet, OpenAI o3, Google Gemini 2.5 Pro).
What: The study introduced WorldTest, an evaluation framework, and AutumnBench, a benchmark suite containing 43 interactive grid-world environments with 129 tasks testing masked-frame prediction, planning, and change detection capabilities. Results showed humans consistently outperformed AI models across all environments and task types.
When: The research paper was submitted to arXiv on October 22, 2025, with a revised version published October 23, 2025. Dr. Alex Young publicly discussed the findings on October 29, 2025.
Where: The research involved collaboration across institutions in the United States, Canada, and Germany, including Massachusetts Institute of Technology, Harvard University, Cornell University, Université de Montréal, and DFKI GmbH.
Why: Current evaluation methods for AI world-model learning proved inadequate because they anchored training and testing to next-frame prediction with success measured by reward maximization in identical environments. The research addressed this by separating reward-free exploration from scored testing in modified environments, revealing that AI systems rely on pattern-matching rather than developing genuine understanding of environmental dynamics.