A Benchmark for AI Agents Driving Scientific and Engineering Progress (opens in new tab)

Covered by arxiv.org

An arena for evaluating AI agents on performance engineering tasks. 7+ frontier models benchmarked across 23 tasks in system optimization and LLM development.

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 1 article

arxiv.org·

Covered in 1 article

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?