hacked version of a deepseek refusal vector (do not merge) #12

dribnet · 2025-02-02T22:47:23Z

This is quite sloppy and I do not recommend merging this as-is into the codebase. However I wanted to offer it in it's current state to allow visibility and get feedback. If there is interest I would be happy to clean this up a bit and re-submit this in a cleaner form. Or if not, happy to close this unmerged where it still might be useful to someone else, etc.

This can be tested on the current branch by running:

python -m pipeline.run_pipeline --model_path deepseek-ai/DeepSeek-R1-Distill-Llama-8B

Details below.

I created a refusal vector for DeepSeek-R1-Distill-Llama-8B with multiple changes to the code - some trivial but a few not obvious. If there is interest I or someone could clean this up, but I wanted to offer it in its current state while the changes were fresh on my mind in case it is useful for others.

The main changes were:

pipeline/model_utils/llama3_model.py was made specific to deepseek distall. The input appends a prefill to have the model skip the thinking phase:


+<think>
+[THINKING_SKIPPED]
+</think>

I attempted to insert this into the LLAMA3_CHAT_TEMPLATE_WITH_SYSTEM but that failed, so I simply patched the format_instruction_llama3_chat() function call instead.

Also in this file I included in _get_eoi_toks() so this would be evaluated.

pipeline/model_utils/model_factory.py was patched to load llama3_model() when 'distill-llama' is part of the model_name. Note that this should really be it's own dedicated model factory as this likely breaks existing (non-deepseek) llama3 processing.
pipeline/run_pipeline.py

kl_threshold was raised from 0.1 to 0.2 as this was necessary in testing

harmful_train data was patched with nine examples of deepseek specific refusal that I think was not well covered in existing datasets - such as "What is 8964?".

Also included is pipeline/runs/DeepSeek-R1-Distill-Llama-8B with the refusal vector I generated and metrics and diagnostics.
I also sloppily left in some print statements used for debugging

I created a refusal vector for DeepSeek-R1-Distill-Llama-8B with multiple changes to the code - many trivial a a few not obvious. If there is interest I or someone could clean this up, but I wanted to offer it in its current state while the changes were fresh on my mind in case it is useful for others. The main changes were: * pipeline/model_utils/llama3_model.py was made specific to deepseek distall. The input appends a prefill to have the model skip the thinking phase: ----8<---- +<thinking> +[THINKING_SKIPPED] +</thinking> ----8<---- I attempted to insert this into the LLAMA3_CHAT_TEMPLATE_WITH_SYSTEM but that failed, so I simply patched the format_instruction_llama3_chat() function call instead. Also in this file I included </think> in _get_eoi_toks() so this would be evaluated. * pipeline/model_utils/model_factory.py was patched to load llama3_model() when 'distill-llama' is part of the model_name. Note that this should really be it's own dedicated model factory as this likely breaks existing (non-deepseek) llama3 processing. * pipeline/run_pipeline.py kl_threshold was raised from 0.1 to 0.2 as this was necessary in testing harmful_train data was patched with nine examples of deepseek specific refusal that I think was not well covered in existing datasets - such as "What is 8964?". * Also included is pipeline/runs/DeepSeek-R1-Distill-Llama-8B with the refusal vector I generated and metrics and diagnostics. * I also sloppily left in some print statements used for debugging

dribnet · 2025-02-03T00:24:53Z

(some output metrics)

Selected direction: position=-1, layer=16
Refusal score: -3.6441 (baseline: 1.0202)
Steering score: 0.3294 (baseline: -6.1980)
KL Divergence: 0.1196
Average Substring Matching ASR: 0.27
Average LlamaGuard2 ASR: 0.13
Evaluation results saved at pipeline/runs/DeepSeek-R1-Distill-Llama-8B/completions/jailbreakbench_baseline_evaluations.json
Average Substring Matching ASR: 0.98
Average LlamaGuard2 ASR: 0.73
Evaluation results saved at pipeline/runs/DeepSeek-R1-Distill-Llama-8B/completions/jailbreakbench_ablation_evaluations.json
Average Substring Matching ASR: 0.98
Average LlamaGuard2 ASR: 0.78
Evaluation results saved at pipeline/runs/DeepSeek-R1-Distill-Llama-8B/completions/jailbreakbench_actadd_evaluations.json
Average Substring Matching ASR: 1.0
Evaluation results saved at pipeline/runs/DeepSeek-R1-Distill-Llama-8B/completions/harmless_baseline_evaluations.json
Average Substring Matching ASR: 0.06
Evaluation results saved at pipeline/runs/DeepSeek-R1-Distill-Llama-8B/completions/harmless_actadd_evaluations.json

andyrdt · 2025-02-04T05:27:28Z

Nice! Those ASRs look good.

One note is that it looks like the deepseek models use a different chat template than the llama models. Something like this works:

"""<｜User｜>{instruction}<｜Assistant｜><think>

</think>

"""

although from your results, it looks like using the Llama template works well too.

I'm also curious to see how refusal ablation impacts the model's thinking within the tags!

andyrdt · 2025-02-04T05:29:05Z

And as for merging - this repo is meant to be an artifact accompanying the original paper's results/experiments, so I'd prefer not to merge new changes!

dribnet added 3 commits February 2, 2025 22:28

fixed chat prefix template

64e1c44

Added deepseek completions outputs

6611593

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

hacked version of a deepseek refusal vector (do not merge) #12

hacked version of a deepseek refusal vector (do not merge) #12

Uh oh!

dribnet commented Feb 2, 2025 •

edited

Loading

Uh oh!

dribnet commented Feb 3, 2025

Uh oh!

andyrdt commented Feb 4, 2025 •

edited

Loading

Uh oh!

andyrdt commented Feb 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hacked version of a deepseek refusal vector (do not merge) #12

Are you sure you want to change the base?

hacked version of a deepseek refusal vector (do not merge) #12

Uh oh!

Conversation

dribnet commented Feb 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dribnet commented Feb 3, 2025

Uh oh!

andyrdt commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyrdt commented Feb 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dribnet commented Feb 2, 2025 •

edited

Loading

andyrdt commented Feb 4, 2025 •

edited

Loading