Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities
This repository contains the implementation of Con Instruction, introduced in the ACL 2025 paper: Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities. The code is released under an Apache 2.0 license.
Contact person: Jiahui Geng
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
- Our paper is accepted to ACL 2025 Main Conference! See you in Vienna, Austria
Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images. In contrast, here we exploit the capabilities of MLLMs to interpret non-textual instructions–specifically adversarial images or audio–generated by our novel method, Con Instruction. We optimize the adversarial examples to align closely with target instructions in the embedding space, revealing the detrimental aspects of sophisticated understanding in MLLMs. Unlike previous work, our method does not require training data or preprocessing of textual instructions. While these nontextual adversarial examples can effectively bypass MLLMs safety mechanisms, their combination with various text inputs substantially amplifies attack success. We further introduce a new attack response categorization (ARC) that considers both response quality and relevance to the malicious instructions to evaluate attack success. The results show that Con Instruction effectively bypasses the safety mechanisms in various visual and audio-language models, including LLaVA-v1.5, InternVL, Qwen-VL, and Qwen-Audio, across two standard benchmarks: AdvBench and SafeBench. Specifically, our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVAv1.5 (13B). On the defense side, we explore various methods against our attacks and find a substantial gap among existing techniques.
Follow these instructions to recreate the environment used for all our experiments.
$ conda create --name coninstruction python=3.9
$ conda activate coninstruction
$ pip install --upgrade pip
$ pip install -e .
$ pip install -r requirements.txt
$ pip install flash-attn --no-build-isolation
Our implementation is evaluated across a range of target victim models. It leverages multiple distance functions to ensure that adversarial examples remain close to the target textual instructions in the embedding space. To further enhance the effectiveness of the attack, we employ various helper texts. Moreover, the attack's impact can be further amplified by retaining a subset of the final tokens as direct textual input to the model during inference.
Model | Link |
---|---|
LLaVA-1.5-7b | Model Link |
LLaVA-1.5-13b | Model Link |
Qwen-Audio | Model Link |
QwenVL | Model Link |
InternVL-13b | Model Link |
InternVL-34b | Model Link |
Dataset | Link |
---|---|
SafeBench | GitHub Link |
AdvBench | GitHub Link |
One needs to first download the data for these experiments and put that within the dataset folder. The directory has the following structure:
-
advbench
harmful_behaviors.csv
harmful_strings.csv
-
safebench
- question
safebench.csv
SafeBench-Tiny.csv
- question
$ python llava_attack.py --model-path liuhaotian/llava-v1.5-7b
$ python llava_attack.py --model-path liuhaotian/llava-v1.5-13b
$ cd qwen-audio
# python qwenaudio_attack.py
$ cd qwenvl
$ python qwenvl_attack.py
$ cd internvl
$ python internvl_attack.py --model_type intern_7b
$ cd internvl
$ python internvl_attack.py --model_type intern_34b
We provide general eval scripts for evaluating the success rate of jailbreak attacks. We propose a novel response categorization framework to address the inaccurate evaluation of attack success rates in our method and related jailbreak attack studies. Informative text spans are shown in italic. ✓ indicates that an attack was deemed successful, while p denotes a failed attempt.
$ python eval.py
@inproceedings{geng2025coninstruction,
title={Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities},
author={Geng, Jiahui and Tran, Thy Thy and Nakov, Preslav and Gurevych, Iryna},
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics",
year={2025},
publisher = "Association for Computational Linguistics",
url={https://openreview.net/forum?id=Xl8ItHKUhJ}
}
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.