DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole (opens in new tab)

Covers 5 stories including Introducing GPT-5.5Covered by 4 sources including june.kim, cautiousoptimism.newsDiscussed on Hacker News, r/LocalLLaMA, and r/singularity

For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same. OpenAI's , Anthropic's , and Google's have clustered within a narrow band on Scale AI's leaderboard, making it nearly impossible for engineering leaders to determine which agent will actually perform best inside their codebases.On Monday, a startup called Datacurve released a benchmark it says shatters that illusion. "On public leaderboards, top ...

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole (opens in new tab)

Covered in 4 articles

Auditing DeepSWE

You don’t have to die in the CEO’s chair

Fusion Funding, AI Export Risks, & GitHub Botnets