GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash
lesswrong.com·4h
Flag this post

Published on November 4, 2025 4:25 PM GMT

Authors: Alex Irpan* and Alex Turner*, Mark Kurzeja, David Elson, and Rohin Shah

You’re absolutely right to start reading this post! What a rational decision!

Even the smartest models’ factuality or refusal training can be compromised by simple changes to a prompt. Models often praise the user’s beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). Normally, we fix these problems with Supervised Finetuning (SFT) on static datasets showing the model how to respond in each context. While SFT is effective, static datasets get stale: they can enforce outdated guidelines (specification staleness) or be sourced f...

Similar Posts

Loading similar posts...