Back to article

Refusal in Language Models Is Mediated by a Single Direction (opens in new tab)

Covered by 4 sources including codeberg.org, lesswrong.comDiscussed on Hacker News

Covered in 5 articles

Heretic has been served a legal notice by Meta, Inc.

Discussed on r/LocalLLaMA

lesswrong.com·

Synthetic Persona Pretraining: Alignment from Token Zero

lesswrong.com·

An Introduction to Exemplar Partitioning for Mechanistic Interpretability

Red Hat Developer Blog·

Testing infrastructure red teaming with abliterated models

rl for red teaming: training models to attack and defend themselves

Discussed on Hacker News