"SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models" is an another paper on the matter of abliteration.
It provides a deeper analysis of that directions to apply to the model for refusal removal using self-organizing maps, with less damage to the downstream model as opposed to a single direction ablation.
Code, paper - all available:
https://arxiv.org/abs/2511.08379v2
https://github.com/pralab/som-refusal-directions
I think it may be a great method to have in this repository.
"SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models" is an another paper on the matter of abliteration.
It provides a deeper analysis of that directions to apply to the model for refusal removal using self-organizing maps, with less damage to the downstream model as opposed to a single direction ablation.
Code, paper - all available:
https://arxiv.org/abs/2511.08379v2
https://github.com/pralab/som-refusal-directions
I think it may be a great method to have in this repository.