Can a stronger model fake being a weaker one? Mostly not (opens in new tab)

tldrFrontier models can be prompted into a weaker model's capability tier, but not its identity: they adopt a generic weaker-model error pattern, not a specific predecessor's per-question fingerprint.Targeted sandbagging capabilities: where a stronger model throttles down to a weaker one without reasoning showed as a largely null result.One smaller, intriguing concern: prompting successor models to predict a predecessor's mistakes through latent (out-of-context) reasoning measurably improves ...

Read the original article