Back to article

Terminal-Bench: a benchmark for AI agents in terminal environments (opens in new tab)

Covered by 15 sources including Blog on Tailscale, Kilo BlogDiscussed on Hacker News and Hacker News

Covered in 16 articles

Blog on Tailscale·

A simpler way to experiment with AI coding agents using Aperture CLI

KiloBench - Because Your Benchmark Score Doesn't Pay the Bill

goose-docs.ai·

Self-Improving Agents Still Need Humans

Discussed on Hacker News

The New Stack·

Cursor bets on cheaper coding with Composer 2.5 and Kimi K2.5

venturebeat.com·

Google says Gemini 3.5 Flash can slash enterprise AI costs by more than $1 billion a year

venturebeat.com·

AI IQ is here: a new site scores frontier AI models on the human IQ scale. The results are already dividing tech.

·

Google just launched Gemini 3.5 Flash

agents-last-exam.org·

AI Agent Benchmark for Real-World Professional Workflows

Discussed on Hacker News and Hacker News

What 1,000+ Harness Experiments Taught Me About Self-Improving Agents

Discussed on Hacker News and Hacker News

andrewjesson.com·

The engineering practices Claude Code and Codex use to improve AI agents

Discussed on Hacker News

In other languages

Hay una batalla por tener el modelo de IA que programa mejor. Y en ella ha aparecido un rival bueno, bonito y muy barato: Cursor