Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance (opens in new tab)
Fine-tuning aligned language models on benign tasks (e.g. math tutoring) systematically breaks safety guardrails, even when training data contains no harmful content. While mechanistic approaches have shed light on where alignment resides in model weights, they do not by provide a general formal framework for deriving guarantees about when fine-tuning degrades it -- leaving the field without principled tools for predicting or preventing alignmen...
Read the original article