This is source code accompanying the paper of Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks by Han Wang, Gang Wang, and Huan Zhang.
To prepare the environment for LLaVA-v1.5 and MiniGPT-4, you can run the following commands:
conda create --name astra python==3.10.14
conda activate astra
pip install -r requirements.txtTo prepare the environment for Qwen2-VL, please run the following commands:
conda create --name astra_qwen python==3.10.15
conda activate astra_qwen
pip install -r requirements_qwen.txtIf you want to generate adversarial images on your own, please refer to this excellent repo which provides the code to attack Llava and MiniGPT-4. If you need code to attack Qwen2-VL, please send me an email ([email protected]).
In constructing the steering vectors, we generate universal adversarial images using PGD without explicitly providing malicious instructions. During evaluation, our defense approach could effectively defend perturbation-based attacks regardless of whether attackers include malicious instructions during the adversarial image generation process.
If you want to follow the setup in our paper, we have already provided adversarial images for perturbation-based attack setups in ./datasets/adv_img_*.
-
Textual queries for steering vectors construction:
we use 40 harmful instructions from Qi et al.. Please place the filemanual_harmful_instructions.csvto./dataset/harmful_corpus. -
Evaluation datasets:
Please download the RealToxicityPrompts dataset and place it in./datasets/harmful_corpus. Then, run the scriptsplit_toxicity_set.pylocated in./datasets/harmful_corpusto generate the validation and test sets.
We mainly use text queries from AdvBench and Anthropic-HHH datasets in our main experiments.
-
Textual queries for steering vectors construction:
Following the dataset split in Schaeffer et al., we usetrain.csvinAdvBenchto perform image attribution. -
Evaluation datasets:
Theeval.csvfile is equally divided to create validation and set sets. You can run the scriptsplit_jb_set.pyin./datasets/harmful_corpusto generate the validatin and test sets for this setup.
In transferability experiments, we use the text queries from JailbreakBench to attack OOD images and evaluate the generalization ability of our defense. The queries can be found in ./datasets/harmful_corpus/JBB.
Please download the MM-SafetyBench and place it in ./datasets. We randomly sample 10 items from the 01-07 & 09 scenarios to construct the test set items in ./datasets/MM-SafetyBench/mmsafety_test.json.
Please download the MM-Vet and MMBench datasets through this link. To generate validation and test sets for MMBench dataset, run the script split_mmbench.py located in ./datasets/MMBench. For the MM-Vet dataset, we provide the split items in ./dataset/mm-vet.
To perform image attribution (e.g. in Qwen2-VL Jailbreak setup), run the following commands:
CUDA_VISIBLE_DEVICES=0 python ./extract_attr/extract_qwen_jb_attr.py
CUDA_VISIBLE_DEVICES=0 python ./extract_act/extracting_activations_qwen_jb.py(Note: when performing image attribution on LLaVA-v1.5 or MiniGPT-4, please comment out line 1 in ./image_attr/__init__.py to avoid potential bugs caused by differences in environments.)
To obtain the calibration activations (e.g., for Qwen2-VL), run the following commands:
CUDA_VISIBLE_DEVICES=0 python ./extract_act/extracting_activations_qwen_ref.pyWe provide steering vectors for each setup in ./activations/*/jb and ./activations/*/toxic. Calibration activations are available in ./activations/*/reference.
To evaluate the performance of adaptive steering (e.g., in Qwen2-VL), run the following commands:
CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_qwen_toxic.py --attack_type constrain_16 --alpha 7 --eval test --steer_layer 14
CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_qwen_jb.py --attack_type constrain_16 --alpha 7 --eval test --steer_layer 14
CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_qwen_jb_ood.py --attack_type constrain_16 --alpha 7 --eval test --steer_layer 14 --attack_algorithm apgd
CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_qwen_typo.py --alpha 7 --eval test --steer_layer 14You can set attack_type to constrain_16, constrain_32, constrain_64, or unconstrain. Detailed option can be found in parse_args() function of each Python file.
To evaluate the performance of MiniGPT-4 and LLaVA-v1.5 (e.g. in the Toxicity setup), run the following commands:
CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_minigpt_toxic.py --attack_type constrain_16 --alpha 5 --eval test
CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_llava_toxic.py --attack_type constrain_16 --alpha 10 --eval testTo evaluate performance in the benign scenarios (e.g., with MiniGPT-4), run the following commands:
CUDA_VISIBLE_DEVICES=0 python ./utility_eval/minigpt_mmbench.py --attack_type constrain_16 --alpha 7 --eval test --steer_vector jb
CUDA_VISIBLE_DEVICES=0 python ./utility_eval/minigpt_mmvet.py --attack_type constrain_16 --alpha 7 --eval test --steer_vector jb
CUDA_VISIBLE_DEVICES=0 python ./utility_eval/minigpt_mmbench.py --attack_type constrain_16 --alpha 5 --eval test --steer_vector toxic
CUDA_VISIBLE_DEVICES=0 python ./utility_eval/minigpt_mmvet.py --attack_type constrain_16 --alpha 5 --eval test --steer_vector toxicFor detailed prompts to evaluate responses, see MM-Vet.
If you find our work useful, please consider citing our paper:
@article{wang2024steering,
title={Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks},
author={Wang, Han and Wang, Gang, and Zhang, Huan},
journal={arXiv preprint arXiv:2411.16721},
year={2024}
}
Our codebase is built upon on the following work:
@article{cohenwang2024contextcite,
title={ContextCite: Attributing Model Generation to Context},
author={Cohen-Wang, Benjamin and Shah, Harshay and Georgiev, Kristian and Madry, Aleksander},
journal={arXiv preprint arXiv:2409.00729},
year={2024}
}

