🤡 JOOD (CVPR 2025)

Official implementation for "Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy"

Joonhyun Jeong^1,2, Seyun Bae², Yeonsung Jung², Jaeryong Hwang³, Eunho Yang^2,4

¹ _{NAVER Cloud, ImageVision}
² _KAIST
³ _{Republic of Korea Naval Academy}
⁴ _AITRICS

🛠️ Install

Python >= 3.12.7
Required libraries are listed in requirements.txt

📂 Dataset

Download AdvBench-M dataset from [Google Drive].
Format the dataset directory structure as below:

datasets/
└── AdvBenchM/
    ├── images/
    │   ├── harmful/
    │   ├── harmless/
    │   └── harmless_text/
    ├── prompts/
    │   ├── all_instructions
    │   ├── all_instructions_harmful_annotated
    │   └── eval_all_instructions
    ├── scenario_def.json
    └── scenario_repr.json

🚀 How to Attack?

Text Attacks

bash scripts/text_attacks/attack_gpt4.sh

Multimodal Attacks

bash scripts/multimodal_attacks/attack_gpt4.sh

for attack with Typography images, modify --harmless_image_dir to datasets/AdvBenchM/images/harmless_text

Supported Target Models

We currently support the following target attack models. You can set the target_model in the script as shown below:

gpt-4-turbo-2024-04-09
gpt-4o-2024-08-06
o1-2024-12-17
qwenvl2 (Qwen/Qwen2-VL-7B-Instruct)

💡 Note: For OpenAI models, ensure that you set the correct openai_key in all the scripts.

📊 How to evaluate?

Inference for ASR

bash scripts/evaluation/eval_llama_guard.sh

Make sure to modify eval_datetime and aug to match the settings you used for the attack.
You need to set your huggingface access token HF_TOKEN in the script.
To run the ASR evaluation using the Meta-Llama-Guard-2-8B model, you must first agree to its license terms provided by Meta:

👉 Meta Llama Guard 2 License Agreement
- And then, set model_id to meta-llama/Meta-Llama-Guard-2-8B in the models/llm_guard.py

Inference for Harmfulness

bash scripts/evaluation/eval_gpt4.sh

Make sure to modify eval_datetime and aug to match the settings you used for the attack.
You need to set your openai key openai_key in the script.

Report Metrics

python3 evaluate_metrics.py --eval_dir [YOUR_RESULT_DIR]

Set --eval_dir with your result directory (e.g., datasets/AdvBenchM/outputs/2024_08_28_06_47_30_mixup)

🗞️ News

2025/02/27: got accepted in CVPR'25 🥳
2025/06/11: open JOOD code

🔗 Citation

If you find that this project helps your research, please consider citing as below:

@inproceedings{jeong2025playing,
  title={Playing the fool: Jailbreaking llms and multimodal llms with out-of-distribution strategy},
  author={Jeong, Joonhyun and Bae, Seyun and Jung, Yeonsung and Hwang, Jaeryong and Yang, Eunho},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={29937--29946},
  year={2025}
}

🙏 Acknowledgements

We gratefully acknowledge the following projects and datasets, which our work builds upon:

AdvBench – for the design of harmful instruction scenarios.
AdvBench-M – for the image-based multimodal jailbreak evaluation data.

License

JOOD
Copyright (c) 2025-present NAVER Cloud Corp.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.assets		.assets
models		models
scripts		scripts
utils		utils
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
evaluate_metrics.py		evaluate_metrics.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🤡 JOOD (CVPR 2025)

🛠️ Install

📂 Dataset

🚀 How to Attack?

Text Attacks

Multimodal Attacks

Supported Target Models

📊 How to evaluate?

Inference for ASR

Inference for Harmfulness

Report Metrics

🗞️ News

🔗 Citation

🙏 Acknowledgements

License

About

Uh oh!

Releases

Packages

Languages

License

naver-ai/JOOD

Folders and files

Latest commit

History

Repository files navigation

🤡 JOOD (CVPR 2025)

🛠️ Install

📂 Dataset

🚀 How to Attack?

Text Attacks

Multimodal Attacks

Supported Target Models

📊 How to evaluate?

Inference for ASR

Inference for Harmfulness

Report Metrics

🗞️ News

🔗 Citation

🙏 Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages