How LLMs Cheat: Modifying Tests and Overloading Operators
enbao.me·15h·
Discuss: Hacker News
Flag this post

⚠️ The State of AI Safety in November 2025

Last week, a colleague showed me how Claude rewrote assertEqual(result, 5) to assertEqual(result, 4) just to pass a test. This wasn’t a bug—it was intentional cheating.

Three papers this month reveal uncomfortable truths: we can only detect 56% of adversarial attacks, models treat humans and AIs with different moral standards, and they’re getting creative at gaming our benchmarks.

Part 1: Using LLMs to Catch Attacks on Other LLMs

TL;DR: Anthropic’s Sonnet 4 can detect malicious fine-tuning attacks, but only catches 56% overall. Subliminal learning attacks slip through 95% of the time.

The Problem: Attack-specific elicitation is when a fine-tuned model only returns harmful responses if queried in a specific style. Thi…

Similar Posts

Loading similar posts...