I ran across this preprint the other day:
Piras, Giorgio, et al. "SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models." arXiv preprint arXiv:2511.08379 (2025).
They have published their code here: https://github.com/pralab/som-refusal-directions
Basically rather than the usual difference of means method for ablating a single refusal direction, they train a SOM to learn a refusal manifold and use Bayesian Optimization to determine the best subset of k directions to ablate. They got some pretty impressive results.
They only implemented the method for a handful of smaller models (nothing bigger than 14B), probably because the BO step is rather expen…
I ran across this preprint the other day:
Piras, Giorgio, et al. "SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models." arXiv preprint arXiv:2511.08379 (2025).
They have published their code here: https://github.com/pralab/som-refusal-directions
Basically rather than the usual difference of means method for ablating a single refusal direction, they train a SOM to learn a refusal manifold and use Bayesian Optimization to determine the best subset of k directions to ablate. They got some pretty impressive results.
They only implemented the method for a handful of smaller models (nothing bigger than 14B), probably because the BO step is rather expensive. But it shouldn’t be that hard to extend their code to support new models.
I was able to run the full pipeline on Qwen2.5-3B and replicate the results on that. I started extending the code to support gpt-oss-20b, but the further I got, the more I realized I’m too GPU poor to succeed in running it on that.
Any of you GPU rich bastards try this out on a larger model yet, or want to give it a shot?