Evals in the Age of Jarvis

Published on September 21, 2025 7:27 PM GMT

The last couple of years gave us a fairly clean story around LLM pretraining: scale up the data, scale up the compute, use next-token prediction as a universal loss, and watch as a kind of implicit multi-task learning emerges. The evals ecosystem followed this arc - benchmark suites for reasoning, instruction following, math, code.

The next hill feels a bit different. We’re entering the RL era for long-horizon, multimodal agents - call it the Jarvis era. You ask an agent to do something, and it doesn’t just spit out text. It spans multiple platforms, uses tools, manages context over days, weeks or even months, updates its priors when new information arrives, clarifies requirements when they’re underspecified.

How do we set…

Published on September 21, 2025 7:27 PM GMT

How do we set up evals for this?

Getting Large-Scale Data

In pretraining land, the internet was the dataset. A diverse corpus paired with next-token loss gave both hard negative mining and coverage of multiple domains. Does a similar hack exist for RL agents? What’s the “internet-scale dataset” equivalent for long-horizon agents? Running millions of agents through realistic, noisy workflows - debugging long training runs, doing expense reports, planning a multi-month project - is non-trivial. Each trajectory is expensive. Rewards can often be hard to define. Unlike next-token prediction, there’s no single cheap proxy.

The Data Flywheel Problem

A significant amount of researcher and engineer time today is spent on data triage: inspecting examples, chasing down edge cases, figuring out where the model failed and what to curate next. The data:eval loop is almost entirely manual.

Automating that flywheel is essential. A good agent-judge should not just complete a task, but also step back and ask: what does this failure tell me about the broader distribution? What’s missing from the dataset? What eval should come next?

Right now, agents often default to the assistant persona in these scenarios - trying to “patch” failures with hacky, overfit solutions. What’s missing is the ability to zoom out, contextualize the failure, and critique the system at a meta-level. That’s where nuanced judge personas and curriculum design tie directly back into data.

Error Aggregation

Short-horizon tasks are forgiving: if the model writes a bad sentence, you can still salvage the paragraph. Long-horizon training can be more harsh: errors compound and recovery is harder.

Example - teaching a model to be a great ML engineer. Most of the job is monitoring experiments, parsing logs, deciding when to kill a run. The eval isn’t “can you classify this log line?” It’s “can you manage context refresh over a month-long project, and zoom in/zoom out appropriately?”

Calibration Over Long Horizons

Another hard problem: when should the agent decide this isn’t worth pursuing? Long-horizon tasks demand calibration. Knowing when you’ve understood something, when to double down, and when to walk away.

Today’s models are poor at this. They don’t track incremental benefits, they just keep going. A strong eval would test whether the agent can allocate effort sensibly, instead of brute-forcing every branch of the search tree.

Sample Efficiency

Humans mull. We replay problems in our heads, perturb variables, test first-principles explanations. A few examples can usually be enough to grok a domain.

Models still need huge numbers of episodes. Pretraining let us sidestep this by utilizing already available decades of internet data. In RL, there is no equivalent, every rollout is expensive.

The eval question becomes: how quickly does an agent generalize to a new domain, given sparse supervision?

Beyond Scalar Rewards

Scalar rewards are simple to compute at scale. But they’re brittle - they invite reward hacking and fail to capture the nuance of partial progress, creative shortcuts, or structured failures.

The richer alternative: judges. Instead of “+1/–1,” a judge can surface detailed feedback about how the model solved a task, what trade-offs it made, and where it took shortcuts. Each trajectory becomes a bundle of signals, not just a scalar.

Assistant vs. Judge Personas

The default LLM persona - helpful assistant - is often a bad judge. Judges need to be critical, adversarial, attuned to shortcuts and hacks. They should scaffold a curriculum, moving from easy to hard, surfacing the right dimensions of feedback at the right time.

Reliably building such judge models might be the missing piece between today’s assistant-first training loops and the dense feedback we’ll need for efficient RL.

Video as the Master Modality

A final frontier: video. A truly performant computer-use agent looks less like a text generator and more like a co-worker shadowing you. Watching screens, parsing logs, listening to instructions, absorbing multimodal context.

Video unifies it all. Show me performing a task, narrating my thought process, flagging gotchas. That trace is the gold standard for training. The challenge is eval: how do we measure whether the agent “watched” the tape, extracted the key patterns, and then acted on them?

Where This Leaves Us

If the LLM era was about finding the right loss function at scale, the RL era might be more about finding the right evals. We need evals that reflect error aggregation, calibration, sample efficiency, dense multimodal feedback, and critical judges. If we don’t, we can end up with agents that look competent in demos but suffer higher collapse in the wild. The real bar is ownership: systems that can plan, adapt, and critique themselves over long horizons, and evals would have to reflect that.

Discuss