| I’ve been looking at the recent METR task-length plots for Claude 4.5, and honestly I’m not sure if I’m overreading them — but a reported ~4h49m 50% success horizon feels like we’re starting to run past what current long-horizon evals were designed to measure. What ...
Similar PostsLoading similar posts... |
Preview
Open Original