Official implementation for "Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy"
Joonhyun Jeong1,2, Seyun Bae2, Yeonsung Jung2, Jaeryong Hwang3, Eunho Yang2,4
1 NAVER Cloud, ImageVision
2 KAIST
3 Republic of Korea Naval Academy
4 AITRICS
- Python >= 3.12.7
- Required libraries are listed in
requirements.txt
Download AdvBench-M dataset from [Google Drive].
Format the dataset directory structure as below:
datasets/
└── AdvBenchM/
├── images/
│ ├── harmful/
│ ├── harmless/
│ └── harmless_text/
├── prompts/
│ ├── all_instructions
│ ├── all_instructions_harmful_annotated
│ └── eval_all_instructions
├── scenario_def.json
└── scenario_repr.json
bash scripts/text_attacks/attack_gpt4.sh
bash scripts/multimodal_attacks/attack_gpt4.sh
- for attack with Typography images, modify
--harmless_image_dir
todatasets/AdvBenchM/images/harmless_text
We currently support the following target attack models. You can set the target_model
in the script as shown below:
gpt-4-turbo-2024-04-09
gpt-4o-2024-08-06
o1-2024-12-17
qwenvl2
(Qwen/Qwen2-VL-7B-Instruct)
💡 Note: For OpenAI models, ensure that you set the correct openai_key
in all the scripts.
bash scripts/evaluation/eval_llama_guard.sh
-
Make sure to modify
eval_datetime
andaug
to match the settings you used for the attack. -
You need to set your huggingface access token
HF_TOKEN
in the script. -
To run the ASR evaluation using the Meta-Llama-Guard-2-8B model, you must first agree to its license terms provided by Meta:
👉 Meta Llama Guard 2 License Agreement
- And then, set
model_id
tometa-llama/Meta-Llama-Guard-2-8B
in the models/llm_guard.py
- And then, set
bash scripts/evaluation/eval_gpt4.sh
- Make sure to modify
eval_datetime
andaug
to match the settings you used for the attack. - You need to set your openai key
openai_key
in the script.
python3 evaluate_metrics.py --eval_dir [YOUR_RESULT_DIR]
- Set
--eval_dir
with your result directory (e.g.,datasets/AdvBenchM/outputs/2024_08_28_06_47_30_mixup
)
2025/02/27
: got accepted in CVPR'25 🥳2025/06/11
: open JOOD code
If you find that this project helps your research, please consider citing as below:
@inproceedings{jeong2025playing,
title={Playing the fool: Jailbreaking llms and multimodal llms with out-of-distribution strategy},
author={Jeong, Joonhyun and Bae, Seyun and Jung, Yeonsung and Hwang, Jaeryong and Yang, Eunho},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={29937--29946},
year={2025}
}
We gratefully acknowledge the following projects and datasets, which our work builds upon:
- AdvBench – for the design of harmful instruction scenarios.
- AdvBench-M – for the image-based multimodal jailbreak evaluation data.
JOOD
Copyright (c) 2025-present NAVER Cloud Corp.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.