We demonstrated an attack against Introspection Adapters (Shenoy et al., 2026), a technique for detecting malicious fine-tunes. Long story short: an attacker who controls model weights can apply a cheap, output-preserving transform that relocates the basis the auditor was calibrated against. This defeats the auditor with no observable change in model behavior. 📄 Paper💻 CodeAfter we discovered this attack, asked Keshav Shenoy (the first author) what he thought. It turned out that their team ha...

Read the original article