rl for red teaming: training models to attack and defend themselves (opens in new tab)

Covers Refusal in Language Models Is Mediated by a Single DirectionDiscussed on Hacker News

how we used rl to train an attacker and defender in an iterative loop, what broke along the way, and what we learned about reward design for adversarial safety.

Read the original article