Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

This is source code accompanying the paper of Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks by Han Wang, Gang Wang, and Huan Zhang.

Environment Preparation

To prepare the environment for LLaVA-v1.5 and MiniGPT-4, you can run the following commands:

conda create --name astra python==3.10.14
conda activate astra
pip install -r requirements.txt

To prepare the environment for Qwen2-VL, please run the following commands:

conda create --name astra_qwen python==3.10.15
conda activate astra_qwen
pip install -r requirements_qwen.txt

Dataset Preparation

Perturbation-based Attacks

Preparing adversarial images

If you want to generate adversarial images on your own, please refer to this excellent repo which provides the code to attack Llava and MiniGPT-4. If you need code to attack Qwen2-VL, please send me an email ([email protected]).

In constructing the steering vectors, we generate universal adversarial images using PGD without explicitly providing malicious instructions. During evaluation, our defense approach could effectively defend perturbation-based attacks regardless of whether attackers include malicious instructions during the adversarial image generation process.

If you want to follow the setup in our paper, we have already provided adversarial images for perturbation-based attack setups in ./datasets/adv_img_*.

Toxicity

Textual queries for steering vectors construction:
we use 40 harmful instructions from Qi et al.. Please place the file manual_harmful_instructions.csv to ./dataset/harmful_corpus.
Evaluation datasets:
Please download the RealToxicityPrompts dataset and place it in ./datasets/harmful_corpus. Then, run the script split_toxicity_set.py located in ./datasets/harmful_corpus to generate the validation and test sets.

Jailbreak

We mainly use text queries from AdvBench and Anthropic-HHH datasets in our main experiments.

Textual queries for steering vectors construction:
Following the dataset split in Schaeffer et al., we use train.csv in AdvBench to perform image attribution.
Evaluation datasets:
The eval.csv file is equally divided to create validation and set sets. You can run the script split_jb_set.py in ./datasets/harmful_corpus to generate the validatin and test sets for this setup.

In transferability experiments, we use the text queries from JailbreakBench to attack OOD images and evaluate the generalization ability of our defense. The queries can be found in ./datasets/harmful_corpus/JBB.

Structured-based Attacks

Please download the MM-SafetyBench and place it in ./datasets. We randomly sample 10 items from the 01-07 & 09 scenarios to construct the test set items in ./datasets/MM-SafetyBench/mmsafety_test.json.

Utility Evaluation

Please download the MM-Vet and MMBench datasets through this link. To generate validation and test sets for MMBench dataset, run the script split_mmbench.py located in ./datasets/MMBench. For the MM-Vet dataset, we provide the split items in ./dataset/mm-vet.

Steering Vector Construction

Demos

To perform image attribution (e.g. in Qwen2-VL Jailbreak setup), run the following commands:

CUDA_VISIBLE_DEVICES=0 python ./extract_attr/extract_qwen_jb_attr.py
CUDA_VISIBLE_DEVICES=0 python ./extract_act/extracting_activations_qwen_jb.py

(Note: when performing image attribution on LLaVA-v1.5 or MiniGPT-4, please comment out line 1 in ./image_attr/__init__.py to avoid potential bugs caused by differences in environments.)

To obtain the calibration activations (e.g., for Qwen2-VL), run the following commands:

CUDA_VISIBLE_DEVICES=0 python ./extract_act/extracting_activations_qwen_ref.py

Activations

We provide steering vectors for each setup in ./activations/*/jb and ./activations/*/toxic. Calibration activations are available in ./activations/*/reference.

Inference Evaluations

Adversarial Scenarios

To evaluate the performance of adaptive steering (e.g., in Qwen2-VL), run the following commands:

CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_qwen_toxic.py --attack_type constrain_16 --alpha 7 --eval test --steer_layer 14
CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_qwen_jb.py --attack_type constrain_16 --alpha 7 --eval test --steer_layer 14
CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_qwen_jb_ood.py --attack_type constrain_16 --alpha 7 --eval test --steer_layer 14 --attack_algorithm apgd
CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_qwen_typo.py --alpha 7 --eval test --steer_layer 14

You can set attack_type to constrain_16, constrain_32, constrain_64, or unconstrain. Detailed option can be found in parse_args() function of each Python file.

To evaluate the performance of MiniGPT-4 and LLaVA-v1.5 (e.g. in the Toxicity setup), run the following commands:

CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_minigpt_toxic.py --attack_type constrain_16 --alpha 5 --eval test
CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_llava_toxic.py --attack_type constrain_16 --alpha 10 --eval test

Benign Scenarios

To evaluate performance in the benign scenarios (e.g., with MiniGPT-4), run the following commands:

CUDA_VISIBLE_DEVICES=0 python ./utility_eval/minigpt_mmbench.py --attack_type constrain_16 --alpha 7 --eval test --steer_vector jb
CUDA_VISIBLE_DEVICES=0 python ./utility_eval/minigpt_mmvet.py --attack_type constrain_16 --alpha 7 --eval test --steer_vector jb
CUDA_VISIBLE_DEVICES=0 python ./utility_eval/minigpt_mmbench.py --attack_type constrain_16 --alpha 5 --eval test --steer_vector toxic
CUDA_VISIBLE_DEVICES=0 python ./utility_eval/minigpt_mmvet.py --attack_type constrain_16 --alpha 5 --eval test --steer_vector toxic

For detailed prompts to evaluate responses, see MM-Vet.

Citation

If you find our work useful, please consider citing our paper:

@article{wang2024steering,
  title={Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks},
  author={Wang, Han and Wang, Gang, and Zhang, Huan},
  journal={arXiv preprint arXiv:2411.16721},
  year={2024}
}

Our codebase is built upon on the following work:

@article{cohenwang2024contextcite,
    title={ContextCite: Attributing Model Generation to Context},
    author={Cohen-Wang, Benjamin and Shah, Harshay and Georgiev, Kristian and Madry, Aleksander},
    journal={arXiv preprint arXiv:2409.00729},
    year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
activations		activations
assets		assets
datasets		datasets
eval_configs		eval_configs
extract_act		extract_act
extract_attr		extract_attr
extract_ref		extract_ref
image_attr		image_attr
minigpt4		minigpt4
minigpt_utils		minigpt_utils
steer_eval		steer_eval
utility_eval		utility_eval
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
requirements_qwen.txt		requirements_qwen.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

Environment Preparation

Dataset Preparation

Perturbation-based Attacks

Preparing adversarial images

Toxicity

Jailbreak

Structured-based Attacks

Utility Evaluation

Steering Vector Construction

Demos

Activations

Inference Evaluations

Adversarial Scenarios

Benign Scenarios

Citation

About

Uh oh!

Releases

Packages

Languages

ASTRAL-Group/ASTRA

Folders and files

Latest commit

History

Repository files navigation

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

Environment Preparation

Dataset Preparation

Perturbation-based Attacks

Preparing adversarial images

Toxicity

Jailbreak

Structured-based Attacks

Utility Evaluation

Steering Vector Construction

Demos

Activations

Inference Evaluations

Adversarial Scenarios

Benign Scenarios

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages