(Mis)generalization of Helpful-Only Fine-tuning (opens in new tab)

Covers 3 stories including Agentic Misalignment: How LLMs could be insider threats

TLDRWe study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. None of these problems are a necessary consequence of helpful-only training, though: we show that synthetic document fine-tuning and adding character-related questions to SFT and RL can mitigate them.Research done as part of MATS/Anthropic Fellows Program. See here for the ful...

Read the original article