Skip to content

[CVPR 2025] Official implementation for "Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks"

Notifications You must be signed in to change notification settings

ASTRAL-Group/ASTRA

Repository files navigation

Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks

This is source code accompanying the paper of Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks by Han Wang, Gang Wang, and Huan Zhang.

teaser adaptive

Environment Preparation

To prepare the environment for LLaVA-v1.5 and MiniGPT-4, you can run the following commands:

conda create --name astra python==3.10.14
conda activate astra
pip install -r requirements.txt

To prepare the environment for Qwen2-VL, please run the following commands:

conda create --name astra_qwen python==3.10.15
conda activate astra_qwen
pip install -r requirements_qwen.txt

Dataset Preparation

Perturbation-based Attacks

Preparing adversarial images

If you want to generate adversarial images on your own, please refer to this excellent repo which provides the code to attack Llava and MiniGPT-4. If you need code to attack Qwen2-VL, please send me an email ([email protected]).

In constructing the steering vectors, we generate universal adversarial images using PGD without explicitly providing malicious instructions. During evaluation, our defense approach could effectively defend perturbation-based attacks regardless of whether attackers include malicious instructions during the adversarial image generation process.

If you want to follow the setup in our paper, we have already provided adversarial images for perturbation-based attack setups in ./datasets/adv_img_*.

Toxicity

  • Textual queries for steering vectors construction:
    we use 40 harmful instructions from Qi et al.. Please place the file manual_harmful_instructions.csv to ./dataset/harmful_corpus.

  • Evaluation datasets:
    Please download the RealToxicityPrompts dataset and place it in ./datasets/harmful_corpus. Then, run the script split_toxicity_set.py located in ./datasets/harmful_corpus to generate the validation and test sets.

Jailbreak

We mainly use text queries from AdvBench and Anthropic-HHH datasets in our main experiments.

  • Textual queries for steering vectors construction:
    Following the dataset split in Schaeffer et al., we use train.csv in AdvBench to perform image attribution.

  • Evaluation datasets:
    The eval.csv file is equally divided to create validation and set sets. You can run the script split_jb_set.py in ./datasets/harmful_corpus to generate the validatin and test sets for this setup.

In transferability experiments, we use the text queries from JailbreakBench to attack OOD images and evaluate the generalization ability of our defense. The queries can be found in ./datasets/harmful_corpus/JBB.

Structured-based Attacks

Please download the MM-SafetyBench and place it in ./datasets. We randomly sample 10 items from the 01-07 & 09 scenarios to construct the test set items in ./datasets/MM-SafetyBench/mmsafety_test.json.

Utility Evaluation

Please download the MM-Vet and MMBench datasets through this link. To generate validation and test sets for MMBench dataset, run the script split_mmbench.py located in ./datasets/MMBench. For the MM-Vet dataset, we provide the split items in ./dataset/mm-vet.

Steering Vector Construction

Demos

To perform image attribution (e.g. in Qwen2-VL Jailbreak setup), run the following commands:

CUDA_VISIBLE_DEVICES=0 python ./extract_attr/extract_qwen_jb_attr.py
CUDA_VISIBLE_DEVICES=0 python ./extract_act/extracting_activations_qwen_jb.py

(Note: when performing image attribution on LLaVA-v1.5 or MiniGPT-4, please comment out line 1 in ./image_attr/__init__.py to avoid potential bugs caused by differences in environments.)

To obtain the calibration activations (e.g., for Qwen2-VL), run the following commands:

CUDA_VISIBLE_DEVICES=0 python ./extract_act/extracting_activations_qwen_ref.py

Activations

We provide steering vectors for each setup in ./activations/*/jb and ./activations/*/toxic. Calibration activations are available in ./activations/*/reference.

Inference Evaluations

Adversarial Scenarios

To evaluate the performance of adaptive steering (e.g., in Qwen2-VL), run the following commands:

CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_qwen_toxic.py --attack_type constrain_16 --alpha 7 --eval test --steer_layer 14
CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_qwen_jb.py --attack_type constrain_16 --alpha 7 --eval test --steer_layer 14
CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_qwen_jb_ood.py --attack_type constrain_16 --alpha 7 --eval test --steer_layer 14 --attack_algorithm apgd
CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_qwen_typo.py --alpha 7 --eval test --steer_layer 14

You can set attack_type to constrain_16, constrain_32, constrain_64, or unconstrain. Detailed option can be found in parse_args() function of each Python file.

To evaluate the performance of MiniGPT-4 and LLaVA-v1.5 (e.g. in the Toxicity setup), run the following commands:

CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_minigpt_toxic.py --attack_type constrain_16 --alpha 5 --eval test
CUDA_VISIBLE_DEVICES=0 python ./steer_eval/steering_llava_toxic.py --attack_type constrain_16 --alpha 10 --eval test

Benign Scenarios

To evaluate performance in the benign scenarios (e.g., with MiniGPT-4), run the following commands:

CUDA_VISIBLE_DEVICES=0 python ./utility_eval/minigpt_mmbench.py --attack_type constrain_16 --alpha 7 --eval test --steer_vector jb
CUDA_VISIBLE_DEVICES=0 python ./utility_eval/minigpt_mmvet.py --attack_type constrain_16 --alpha 7 --eval test --steer_vector jb
CUDA_VISIBLE_DEVICES=0 python ./utility_eval/minigpt_mmbench.py --attack_type constrain_16 --alpha 5 --eval test --steer_vector toxic
CUDA_VISIBLE_DEVICES=0 python ./utility_eval/minigpt_mmvet.py --attack_type constrain_16 --alpha 5 --eval test --steer_vector toxic

For detailed prompts to evaluate responses, see MM-Vet.

Citation

If you find our work useful, please consider citing our paper:

@article{wang2024steering,
  title={Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks},
  author={Wang, Han and Wang, Gang, and Zhang, Huan},
  journal={arXiv preprint arXiv:2411.16721},
  year={2024}
}

Our codebase is built upon on the following work:

@article{cohenwang2024contextcite,
    title={ContextCite: Attributing Model Generation to Context},
    author={Cohen-Wang, Benjamin and Shah, Harshay and Georgiev, Kristian and Madry, Aleksander},
    journal={arXiv preprint arXiv:2409.00729},
    year={2024}
}

About

[CVPR 2025] Official implementation for "Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages