rl for red teaming: training models to attack and defend themselves (opens in new tab)
how we used rl to train an attacker and defender in an iterative loop, what broke along the way, and what we learned about reward design for adversarial safety.
Read the original article