Multi-directional ablation with self-organizing maps

I ran across this preprint the other day:

Piras, Giorgio, et al. "SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models." arXiv preprint arXiv:2511.08379 (2025).

They have published their code here: https://github.com/pralab/som-refusal-directions

Basically rather than the usual difference of means method for ablating a single refusal direction, they train a SOM to learn a refusal manifold and use Bayesian Optimization to determine the best subset of k directions to ablate. They got some pretty impressive results.

They only implemented the method for a handful of smaller models (nothing bigger than 14B), probably because the BO step is rather expen…

Similar Posts