artificialanalysis.ai

LaunchAA-BriefcaseA new proprietary benchmark for long-horizon knowledge work (opens in new tab)

Covered by The Decoder, news.smol.aiDiscussed on Hacker News and r/LocalLLaMA

A new frontier agentic evaluation measuring how well AI models perform realistic, long-horizon knowledge work across multi-week scenarios.

Read the original article

Sign in to keep reading the full article.

Covered in 2 articles

·

New benchmark exposes how badly AI struggles with real knowledge work

not much happened today | AINews