Open-world evaluations for measuring frontier AI capabilities (opens in new tab)
Introducing CRUX, a new project for evaluating AI on long, messy tasks
Read the original articleIntroducing CRUX, a new project for evaluating AI on long, messy tasks
Read the original article