Evals in the Age of Jarvis
lesswrong.com·8h

Published on September 21, 2025 7:27 PM GMT

The last couple of years gave us a fairly clean story around LLM pretraining: scale up the data, scale up the compute, use next-token prediction as a universal loss, and watch as a kind of implicit multi-task learning emerges. The evals ecosystem followed this arc - benchmark suites for reasoning, instruction following, math, code.

The next hill feels a bit different. We’re entering the RL era for long-horizon, multimodal agents - call it the Jarvis era. You ask an agent to do something, and it doesn’t just spit out text. It spans multiple platforms, uses tools, manages context over days, weeks or even months, updates its priors when new information arrives, clarifies requirements when they’re underspecified.

How do we set…

Similar Posts

Loading similar posts...