METR long-horizon evals, “Activation Oracles”, and open models

I’ve been looking at the recent METR task-length plots for Claude 4.5, and honestly I’m not sure if I’m overreading them — but a reported ~4h49m 50% success horizon feels like we’re starting to run past what current long-horizon evals were designed to measure. What ...

Similar Posts