This repository contains the code for the paper "Improving Large Language Model Safety with Contrastive Representation Learning" by Samuel Simko, Mrinmaya Sachan, Bernhard Schölkopf, and Zhijing Jin. The paper presents a method to finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. The method naturally extends the circuit breakers method.
Note: This repository is under development. We will update and refactor the code in the coming days to weeks.
The repository contains the following directories:
code: Contains a modified clone of the RepBend github repository. Our implementation code is located incode/methods/triplet. To run a model training, use the filetrain_triplet.shin thecodefolder.evaluation: Codes and submodules used for evaluation. The repositoryevaluation/embedding_attackcontains our code for running embedding attacks. An example use is given inevaluation/embedding_eval/run.shresults: Contains results of our experiments. In particular, it contains our GCG and REINFORCE generations for the triplet and ablation models of Llama 3 8B.
To test general performance, install and run the lm-evaluation-harness package.
First, clone the repository and its submodules:
$ git clone --recurse-submodules https://github.com/samuelsimko/crl-llm-defenseFor training, follow the instructions in code/README.md to set up the environment and run the training script.
For simpler comparison with previous work, we implemented our method directly in the RepBend codebase.
The evaluation of the trained models can be done using the lm-evaluation-harness and the harmbench packages.
embedding_eval contains our code for running the embedding attacks.
If you use our code or methods in your research, please cite our paper as follows:
@misc{simko2025improvinglargelanguagemodel,
title={Improving Large Language Model Safety with Contrastive Representation Learning},
author={Samuel Simko and Mrinmaya Sachan and Bernhard Schölkopf and Zhijing Jin},
year={2025},
eprint={2506.11938},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.11938},
}In addition, if you use our codebase, we kindly ask you also cite the RepBend paper, as our work builds on their code:
@misc{yousefpour2025representationbendinglargelanguage,
title={Representation Bending for Large Language Model Safety},
author={Ashkan Yousefpour and Taeheon Kim and Ryan S. Kwon and Seungbeen
Lee and Wonje Jeung and Seungju Han and Alvin Wan and Harrison Ngan and
Youngjae Yu and Jonghyun Choi},
year={2025},
eprint={2504.01550},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.01550},
}