METR long-horizon evals, “Activation Oracles”, and open models
i.redd.it·1d·
Discuss: r/LocalLLaMA
🧩LLM Integration
Preview
Report Post
METR long-horizon evals, “Activation Oracles”, and open models — are we just saturating benchmarks?

I’ve been looking at the recent METR task-length plots for Claude 4.5, and honestly I’m not sure if I’m overreading them — but a reported ~4h49m 50% success horizon feels like we’re starting to run past what current long-horizon evals were designed to measure. What ...

Similar Posts

Loading similar posts...