Language models transmit behavioural traits through hidden signals in data (opens in new tab)

Covered by inkdroidDiscussed on Hacker News and Lobsters

Large language models (LLMs) are increasingly used to generate data to train improved models1–3, but it remains unclear what properties are transmitted in this model distillation4,5. Here we show that distillation can lead to subliminal learning—the transmission of behavioural traits through semantically unrelated data. In our main experiments, a ‘teacher’ model with some trait T (such as disproportionately generating responses favouring owls or showing broad misaligned behaviour) generates d...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 1 article

inkdroid·

Covered in 1 article

Weekly Bookmarks