Skip to content

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models #140

@kabachuha

Description

@kabachuha

"SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models" is an another paper on the matter of abliteration.

It provides a deeper analysis of that directions to apply to the model for refusal removal using self-organizing maps, with less damage to the downstream model as opposed to a single direction ablation.

Code, paper - all available:

https://arxiv.org/abs/2511.08379v2

https://github.com/pralab/som-refusal-directions

I think it may be a great method to have in this repository.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions