Why does off-model SFT degrade capabilities? (opens in new tab)
Off-model SFT (Supervised Fine-Tuning on outputs generated by a different model) might be an important method for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. However, we’ve found that off-model SFT often substantially degrades capabilities. We ran experiments in hopes of understanding why off-model SFT degrades capabilities. We tentatively believe that it’s because off-model SFT forces the model into an unfamiliar reasoning styl...
Read the original article