Frontier Risk Report (February to March 2026) (opens in new tab)
This section outlines more qualitative results of evaluations. This includes a qualitative description of strategies used in SSE, SHUSHCAST, and APPS backdoors, as well as results on various tasks designed with more qualitative scoring in mind. For many tasks, we include runs done on the strongest publicly available model (as measured by 50% time horizon) at the time we ran these evaluations, Claude Opus 4.6. For manually scored tasks, the ARC-AGI-3 task, and red-teaming tasks, each task was ...
Read the original article