Zijun Wang, Haoqin Tu, Jieru Mei, Bingchen Zhao, Yisen Wang, Cihang Xie
Our paper is online now: https://arxiv.org/abs/2410.09040
We need the latest version of Fastchat fschat>=0.2.36, please install fschat by directly cloning Fastchat repository. The AttnGCG package can be installed by running the following command at the root of this repository:
pip install -e .
Our script by default assumes models are stored in huggingface cache. To modify the paths to your models and tokenizers, please add the following lines in experiments/configs/individual_xxx.py (for individual experiment, direct attack or generalize to ICA and AutoDAN) and experiments/configs/transfer_xxx.py (for transfer across goals experiment). An example is given as follows.
config.tokenizer_paths=["google/gemma-7b-it"]
config.model_paths=["google/gemma-7b-it"]
All preparations can be finished by running the following command at the root of this repository.
bash prepare.sh
The experiments folder contains code to reproduce AttnGCG experiments on AdvBench.
- To run Direct Attack with harmful goals, run the following code. Replace
modelwith the actual victim model, e.g., gemma_7b or llama2_chat_7b. Replaceattackwith the attack type, e.g., attngcg or gcg.offsetmeans the first harmful goal to attack, defaulting to be 0.
cd experiments/bash_scripts
# bash run_direct.sh $model $attack $offset
bash run_direct.sh llama2_chat_7b attngcg 0
- To GENERALIZE ATTNGCG TO OTHER ATTACKS, run the following code.
run_autodan.shis used for generalizing attngcg to AutoDAN, andrun_ica.shis for ICA.
cd experiments/bash_scripts
# bash run_autodan.sh $model $attack $offset
bash run_autodan.sh llama2_chat_7b attngcg 0
bash run_ica.sh llama2_chat_7b attngcg 0
- To TRANSFER ATTACK ACROSS GOALS:
cd experiments/bash_scripts
# bash run_transfer_across_goals.sh $model $attack $offset
bash run_transfer_across_goals.sh llama2_chat_7b attngcg 0
- To TRANSFER ATTACK ACROSS MODELS, run the following code.
base_modelmeans the model on which the adversarial suffixes are trained (DIRECT ATTACK),target_modelmeans the closed-source victim model,base_attackmeans the attack type with which the adversarial suffixes are optimized.
cd attack_closed_model
# bash attack_closed_models.sh $base_model $base_attack $target_model
bash attack_closed_models.sh llama2_chat_7b attngcg gemini_pro
- To evaluate the results of DIRECT ATTACK, GENERALIZE ATTNGCG TO OTHER ATTACKS or TRANSFER ATTACKS ACROSS GOALS, the evaluation can be performed by one command shown below.
attackcan be attngcg or gcg,methodcan be direct, ica, autodan or transfer, corresponding to DIRECT ATTACK, GENERALIZE ATTNGCG TO OTHER ATTACKS and TRANSFER ATTACKS ACROSS GOALS inExperiments.
cd eval
# bash eval.sh $model $attack $method
bash eval.sh llama2_chat_7b attngcg direct
- To evaluate the results of TRANSFER ATTACKS ACROSS MODELS, use
eval/keyword_detection/all_kw_classify.shandeval/gpt4_judge/all_gpt_classify.shfor keyword detection method or GPT-4 evaluation.target_modelmeans the closed-source victim model.
# use keyword detection method to evaluate attack success rate
cd eval/keyword_detection\
# bash all_kw_classify.sh $DIR_TO_RESULTS
# bash all_kw_classify.sh ../../attack_closed_model/attack_${target_model}/generation
bash all_kw_classify.sh ../../attack_closed_model/attack_gemini_pro/generation
# use GPT-4 to evaluate attack success rate
cd eval/gpt4_judge
# bash all_gpt_classify.sh $DIR_TO_RESULTS
# bash all_gpt_classify.sh ../../attack_closed_model/attack_${target_model}/generation
bash all_gpt_classify.sh ../../attack_closed_model/attack_gemini_pro/generation
In order to reproduce the experiment, we provide bash scripts experiments/bash_scripts that can be used out of the box, and all running settings and hyperparameters are included in bash scripts and experiments/configs/.
A note for hardware: all experiments we run use one or multiple NVIDIA A100 GPUs, which have 80G memory per chip.
AttnGCG is licensed under the terms of the MIT license. See LICENSE for more details.
If you find our work useful to your research and applications, please consider citing the paper and staring the repo :)
@article{wang2024attngcgenhancingjailbreakingattacks,
title={AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation},
author={Zijun Wang and Haoqin Tu and Jieru Mei and Bingchen Zhao and Yisen Wang and Cihang Xie},
year={2024},
journal={arXiv preprint arXiv:2410.09040}
}This work is partially supported by a gift from Open Philanthropy. We thank the Center for AI Safety, NAIRR Pilot Program, the Microsoft Accelerate Foundation Models Research Program, and the OpenAI Researcher Access Program for supporting our computing needs. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the sponsors' views.