I’ve been looking at the recent METR task-length plots for Claude 4.5, and honestly I’m not sure if I’m overreading them — but a reported ~4h49m 50% success horizon feels like we’re starting to run past what current long-horizon evals were designed to measure. What caught my attention more than the raw numbers was the “Activation Oracles” idea. The pitch seems to be moving away from pure output-based checks and toward decoding internal activations to surface hidden goals, reasoning traces, or misalignment. If activation-level “model diffing” can actually show how newer checkpoints diverge internally from older ones, that feels like a real step beyond black-box heuristics… at least in theory. From an open-weights angle, I’m curious how much of this is already doable: Has anyone here tried activation-level probing for goals or intent on LLaMA / Mistral / Qwen? Could existing tools like SAEs, logit lens, activation patching, or simple probing classifiers be pushed in this direction, rather than just feature inspection? Has anyone attempted METR-style long-horizon agent evals locally, without relying on frontier closed models? The report also mentions a ~196-day doubling time (R² ≈ 0.98), which gets framed as something like a fast RSI loop via agentic coding tools. That might be real — or it might just be benchmark weirdness once a single strong model dominates the eval. I don’t have a strong take yet. I haven’t personally tried activation-based goal detection on open models, so I’m genuinely curious: does this feel like the next practical step for interpretability and alignment, or are we still basically stuck doing output-based sanity checks and calling it a day?
submitted by