2025 has been the year of agents, with AI moving out of the chat box and into the real world. But are we really close to having generally intelligent agents, or are they still a decade away? The trillion-dollar question: how much economically useful work can these agents actually do?

To answer that question, training and evaluation of models has shifted from rating individual responses to assessing multi-step tasks with tool use. For those involved in testing and post-training, 2025 is the year of RL environments: virtual worlds where models can act, experiment, and learn through realistic multi-step tasks.

We “hired” nine AI models models to perform 150 tasks in one of our RL environments. These were the results:

![](https://cdn.prod.website-files.com/68dcd2ceb173c46fa02993…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help