metr.org

Frontier Risk Report (February to March 2026) (opens in new tab)

This section outlines more qualitative results of evaluations. This includes a qualitative description of strategies used in SSE, SHUSHCAST, and APPS backdoors, as well as results on various tasks designed with more qualitative scoring in mind. For many tasks, we include runs done on the strongest publicly available model (as measured by 50% time horizon) at the time we ran these evaluations, Claude Opus 4.6. For manually scored tasks, the ARC-AGI-3 task, and red-teaming tasks, each task was ...

Read the original article
Sign in to keep reading the full article.

Covered in 16 articles

DEV Community·
Discussed on DEV
Feeds
lesswrong.com·
Feeds
lesswrong.com·
Feeds
View all 16 ›

Keyboard Shortcuts

Navigation

Next / previous post
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Discover
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help