Self-Improving Agents Still Need Humans (opens in new tab)
Benchmarks are most useful when they become bug reports. Here's how the goose team uses Terminal-bench failures, Harbor, and human judgment to improve real agent behavior.
Read the original article