Language models transmit behavioural traits through hidden signals in data (opens in new tab)
Large language models (LLMs) are increasingly used to generate data to train improved models1–3, but it remains unclear what properties are transmitted in this model distillation4,5. Here we show that distillation can lead to subliminal learning—the transmission of behavioural traits through semantically unrelated data. In our main experiments, a ‘teacher’ model with some trait T (such as disproportionately generating responses favouring owls or showing broad misaligned behaviour) generates d...
Read the original article