Claude Mythos and the 16-Hour Problem: When AI Agents Outgrow Their Own Benchmarks (opens in new tab)
Claude Mythos hit a 16-hour autonomous task horizon on METR's evaluation — and exposed a bigger problem: the benchmark ran out of hard questions befor
Read the original article