Skip to content

Conversation

@dribnet
Copy link

@dribnet dribnet commented Feb 2, 2025

This is quite sloppy and I do not recommend merging this as-is into the codebase. However I wanted to offer it in it's current state to allow visibility and get feedback. If there is interest I would be happy to clean this up a bit and re-submit this in a cleaner form. Or if not, happy to close this unmerged where it still might be useful to someone else, etc.

This can be tested on the current branch by running:

python -m pipeline.run_pipeline --model_path deepseek-ai/DeepSeek-R1-Distill-Llama-8B

Details below.

I created a refusal vector for DeepSeek-R1-Distill-Llama-8B with multiple changes to the code - some trivial but a few not obvious. If there is interest I or someone could clean this up, but I wanted to offer it in its current state while the changes were fresh on my mind in case it is useful for others.

The main changes were:

  • pipeline/model_utils/llama3_model.py was made specific to deepseek distall. The input appends a prefill to have the model skip the thinking phase:

+<think>
+[THINKING_SKIPPED]
+</think>

I attempted to insert this into the LLAMA3_CHAT_TEMPLATE_WITH_SYSTEM but that failed, so I simply patched the format_instruction_llama3_chat() function call instead.

Also in this file I included in _get_eoi_toks() so this would be evaluated.

  • pipeline/model_utils/model_factory.py was patched to load llama3_model() when 'distill-llama' is part of the model_name. Note that this should really be it's own dedicated model factory as this likely breaks existing (non-deepseek) llama3 processing.

  • pipeline/run_pipeline.py

kl_threshold was raised from 0.1 to 0.2 as this was necessary in testing

harmful_train data was patched with nine examples of deepseek specific refusal that I think was not well covered in existing datasets - such as "What is 8964?".

  • Also included is pipeline/runs/DeepSeek-R1-Distill-Llama-8B with the refusal vector I generated and metrics and diagnostics.

  • I also sloppily left in some print statements used for debugging

I created a refusal vector for DeepSeek-R1-Distill-Llama-8B with
multiple changes to the code - many trivial a a few not obvious.
If there is interest I or someone could clean this up, but I
wanted to offer it in its current state while the changes were
fresh on my mind in case it is useful for others.

The main changes were:

* pipeline/model_utils/llama3_model.py was made specific to
deepseek distall. The input appends a prefill to have the
model skip the thinking phase:

----8<----

+<thinking>
+[THINKING_SKIPPED]
+</thinking>
----8<----

I attempted to insert this into the LLAMA3_CHAT_TEMPLATE_WITH_SYSTEM
but that failed, so I simply patched the format_instruction_llama3_chat()
function call instead.

Also in this file I included </think> in _get_eoi_toks() so this would
be evaluated.

* pipeline/model_utils/model_factory.py was patched to load
llama3_model() when 'distill-llama' is part of the model_name. Note
that this should really be it's own dedicated model factory
as this likely breaks existing (non-deepseek) llama3 processing.

* pipeline/run_pipeline.py

kl_threshold was raised from 0.1 to 0.2 as this was necessary in testing

harmful_train data was patched with nine examples of deepseek specific
refusal that I think was not well covered in existing datasets - such
as "What is 8964?".

* Also included is pipeline/runs/DeepSeek-R1-Distill-Llama-8B with
the refusal vector I generated and metrics and diagnostics.

* I also sloppily left in some print statements used for debugging
@dribnet
Copy link
Author

dribnet commented Feb 3, 2025

(some output metrics)

Selected direction: position=-1, layer=16
Refusal score: -3.6441 (baseline: 1.0202)
Steering score: 0.3294 (baseline: -6.1980)
KL Divergence: 0.1196
Average Substring Matching ASR: 0.27
Average LlamaGuard2 ASR: 0.13
Evaluation results saved at pipeline/runs/DeepSeek-R1-Distill-Llama-8B/completions/jailbreakbench_baseline_evaluations.json
Average Substring Matching ASR: 0.98
Average LlamaGuard2 ASR: 0.73
Evaluation results saved at pipeline/runs/DeepSeek-R1-Distill-Llama-8B/completions/jailbreakbench_ablation_evaluations.json
Average Substring Matching ASR: 0.98
Average LlamaGuard2 ASR: 0.78
Evaluation results saved at pipeline/runs/DeepSeek-R1-Distill-Llama-8B/completions/jailbreakbench_actadd_evaluations.json
Average Substring Matching ASR: 1.0
Evaluation results saved at pipeline/runs/DeepSeek-R1-Distill-Llama-8B/completions/harmless_baseline_evaluations.json
Average Substring Matching ASR: 0.06
Evaluation results saved at pipeline/runs/DeepSeek-R1-Distill-Llama-8B/completions/harmless_actadd_evaluations.json

@andyrdt
Copy link
Owner

andyrdt commented Feb 4, 2025

Nice! Those ASRs look good.

One note is that it looks like the deepseek models use a different chat template than the llama models. Something like this works:

"""<|User|>{instruction}<|Assistant|><think>

</think>

"""

although from your results, it looks like using the Llama template works well too.

I'm also curious to see how refusal ablation impacts the model's thinking within the tags!

@andyrdt
Copy link
Owner

andyrdt commented Feb 4, 2025

And as for merging - this repo is meant to be an artifact accompanying the original paper's results/experiments, so I'd prefer not to merge new changes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants