SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

"SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models" is an another paper on the matter of abliteration.

It provides a deeper analysis of that directions to apply to the model for refusal removal using self-organizing maps, with less damage to the downstream model as opposed to a single direction ablation.

Code, paper - all available:

https://arxiv.org/abs/2511.08379v2

https://github.com/pralab/som-refusal-directions

I think it may be a great method to have in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models #140

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models #140

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions