Open-world evaluations for measuring frontier AI capabilities (opens in new tab)

Introducing CRUX, a new project for evaluating AI on long, messy tasks

Read the original article

Sign in to keep reading the full article.