Exploitbench (opens in new tab)

Covered by 11 sources including buttondown.com, red.anthropic.comDiscussed on Hacker News

How far up the exploitation ladder can an agent climb on a production JS engine? ExploitBench measures frontier LLMs on full-control V8 exploit synthesis with 16 capabilities measured per run and multi-round shuffled-layout grading.

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 13 articles

the grugq's newsletter via buttondown.com·

May 17, 2026

red.anthropic.com·

Measuring LLMs' ability to develop exploits

Discussed on Hacker News

anthropic.com·

https://www.anthropic.com/research/exploit-evals

View all 13 ›