exploit-evals (opens in new tab)

Covers 10 stories including Assessing Claude Mythos Preview's cybersecurity capabilities

We've developed two new, challenging academic benchmarks measuring AI models’ ability to develop exploits, and an updated version of the benchmark measuring smart contract exploitation.

Read the original article