ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents (opens in new tab)

Covered by 4 sources including The Decoder, Eugene Yan

Exploitation is not a binary event. It is a ladder of acquiring progressive capabilities, from executing a single buggy line of code to taking full control of the target. However, existing LLM security benchmarks treat a crash as exploitation success. That single binary outcome collapses the hard parts of exploitation: the transition from triggering a bug to constructing reusable primitives and control. We present ExploitBench, a capability-grad...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 4 articles

The Decoder

·

New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

Eugene Yan·

Patterns for Building Cybersecurity Evals

CTO at NCSC·

CTO at NCSC Summary: week ending May 17th

Discussed on Substack

View all 4 ›