Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces (opens in new tab)

Covered by 3 sources including LessWrong, Replit BlogDiscussed on Hacker News and r/programming

arXiv:2601.11868v1 Announce Type: cross Abstract: AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a uniqu...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 3 articles

LessWrong·

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces (opens in new tab)

Covered in 3 articles

Towards Shutdownable Agents: Generalizing Stochastic Choice in RL Agents and LLMs

Closing the loop: Evaluating and improving Replit Agent at scale

In other languages

3 минуты назад