Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces (opens in new tab)
arXiv:2601.11868v1 Announce Type: cross Abstract: AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a uniqu...
Read the original article