Claude Mythos and the 16-Hour Problem: When AI Agents Outgrow Their Own Benchmarks (opens in new tab)

Discussed on r/ClaudeAI

Claude Mythos hit a 16-hour autonomous task horizon on METR's evaluation — and exposed a bigger problem: the benchmark ran out of hard questions befor

Read the original article