There’s a research organization called METR that (among many other things) tracks the maximum time length of tasks that different AI models can reliably complete. As traditional capability benchmarks continue to saturate, this metric is increasingly helpful for estimating how AI models will actually perform out in the world.
It’s worth taking a look at METR’s chart yourself, but the most important thing to note is that METR predicts that the length of tasks which state-of-the-art AI models can complete will double every 7 months.
AI models have been consistently overshooting these predictions. In the chart below, notice how o3, Grok 4, GPT-5, and GPT-5.1-Codex-Max all land ahead o…
There’s a research organization called METR that (among many other things) tracks the maximum time length of tasks that different AI models can reliably complete. As traditional capability benchmarks continue to saturate, this metric is increasingly helpful for estimating how AI models will actually perform out in the world.
It’s worth taking a look at METR’s chart yourself, but the most important thing to note is that METR predicts that the length of tasks which state-of-the-art AI models can complete will double every 7 months.
AI models have been consistently overshooting these predictions. In the chart below, notice how o3, Grok 4, GPT-5, and GPT-5.1-Codex-Max all land ahead of the predicted curve, reaching the predicted task length 1-3 months earlier than expected.
For reference, the original ChatGPT model (GPT-3.5) from 2022 had a task length of 36 seconds. The most recent OpenAI model (GPT-5.1-Codex-Max) has a task length of 2 hours and 42 minutes.
Although the data implies that OpenAI consistently dominates, I think that interpretation is misleading. These measurements only include publicly available models, and exclude the best performing model a company might have internally.
Anthropic and Google rarely release models that spend large amounts of time on inference, while OpenAI is perfectly comfortable releasing models that take many minutes to respond. That extra time is likely driving the performance gap between OpenAI and its competitors.
Anthropic and Google undoubtedly have models capable of thinking for this long, but for one reason or another (probably cost or compute availability) never made them public.