Preventative Steering has advantages over Inoculation Prompting (opens in new tab)
This was work done by Aansh Samyani under the supervision of Ariana Azarbal, Arun Jose, Kei Nishimura-Gasparian and Daniel Tan as part of the SPAR Research Fellowship.TL;DRWe benchmarked Inoculation Prompting (IP) and Preventative Steering (PS) in 4 SFT settings. We found PS has the following advantages:PS often affords stronger undesired-trait suppression than IP.Models trained with PS appear to carry less than IP-trained models.Using PS, we can cause models to learn desired-traits more stro...
Read the original article