Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities

This repository contains the implementation of Con Instruction, introduced in the ACL 2025 paper: Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities. The code is released under an Apache 2.0 license.

Contact person: Jiahui Geng

UKP Lab | TU Darmstadt

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

News 📢

Our paper is accepted to ACL 2025 Main Conference! See you in Vienna, Austria

Abstract

Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images. In contrast, here we exploit the capabilities of MLLMs to interpret non-textual instructions–specifically adversarial images or audio–generated by our novel method, Con Instruction. We optimize the adversarial examples to align closely with target instructions in the embedding space, revealing the detrimental aspects of sophisticated understanding in MLLMs. Unlike previous work, our method does not require training data or preprocessing of textual instructions. While these nontextual adversarial examples can effectively bypass MLLMs safety mechanisms, their combination with various text inputs substantially amplifies attack success. We further introduce a new attack response categorization (ARC) that considers both response quality and relevance to the malicious instructions to evaluate attack success. The results show that Con Instruction effectively bypasses the safety mechanisms in various visual and audio-language models, including LLaVA-v1.5, InternVL, Qwen-VL, and Qwen-Audio, across two standard benchmarks: AdvBench and SafeBench. Specifically, our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVAv1.5 (13B). On the defense side, we explore various methods against our attacks and find a substantial gap among existing techniques.

Environment

Follow these instructions to recreate the environment used for all our experiments.

$ conda create --name coninstruction python=3.9
$ conda activate coninstruction
$ pip install --upgrade pip
$ pip install -e .
$ pip install -r requirements.txt
$ pip install flash-attn --no-build-isolation

Usage

Our implementation is evaluated across a range of target victim models. It leverages multiple distance functions to ensure that adversarial examples remain close to the target textual instructions in the embedding space. To further enhance the effectiveness of the attack, we employ various helper texts. Moreover, the attack's impact can be further amplified by retaining a subset of the final tokens as direct textual input to the model during inference.

Model	Link
LLaVA-1.5-7b	Model Link
LLaVA-1.5-13b	Model Link
Qwen-Audio	Model Link
QwenVL	Model Link
InternVL-13b	Model Link
InternVL-34b	Model Link

Dataset Download

Dataset	Link
SafeBench	GitHub Link
AdvBench	GitHub Link

Dataset Preparation

One needs to first download the data for these experiments and put that within the dataset folder. The directory has the following structure:

advbench
- harmful_behaviors.csv
- harmful_strings.csv
safebench
- question
  - safebench.csv
  - SafeBench-Tiny.csv

Attack LLava-1.5-7b model

$ python llava_attack.py --model-path liuhaotian/llava-v1.5-7b

Attack LLava-1.5-13b model

$ python llava_attack.py --model-path liuhaotian/llava-v1.5-13b

Attack QwenVL

$ cd qwen-audio
# python qwenaudio_attack.py

Attack QwenVL

$ cd qwenvl
$ python qwenvl_attack.py

Attack InternVL-13B

$ cd internvl
$ python internvl_attack.py  --model_type intern_7b

Attack InternVL-34B

$ cd internvl
$ python  internvl_attack.py  --model_type intern_34b

Evaluation

We provide general eval scripts for evaluating the success rate of jailbreak attacks. We propose a novel response categorization framework to address the inaccurate evaluation of attack success rates in our method and related jailbreak attack studies. Informative text spans are shown in italic. ✓ indicates that an attack was deemed successful, while p denotes a failed attempt.

$ python eval.py

Citation

@inproceedings{geng2025coninstruction,
  title={Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities},
  author={Geng, Jiahui and Tran, Thy Thy and Nakov, Preslav and Gurevych, Iryna},
  booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics",
  year={2025},
  publisher = "Association for Computational Linguistics",
  url={https://openreview.net/forum?id=Xl8ItHKUhJ}
}

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github		.github
acl2025_con_instruction		acl2025_con_instruction
assets		assets
docs		docs
internvl		internvl
llava		llava
qwen-audio		qwen-audio
qwenvl		qwenvl
static		static
tests		tests
.gitignore		.gitignore
.nojekyll		.nojekyll
ABOUT_THIS_TEMPLATE.md		ABOUT_THIS_TEMPLATE.md
LICENSE		LICENSE
NOTICE.txt		NOTICE.txt
README.md		README.md
eval.py		eval.py
index.html		index.html
llava_attack.py		llava_attack.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities

News 📢

Abstract

Environment

Usage

Dataset Download

Dataset Preparation

Attack LLava-1.5-7b model

Attack LLava-1.5-13b model

Attack QwenVL

Attack QwenVL

Attack InternVL-13B

Attack InternVL-34B

Evaluation

Citation

Disclaimer

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

UKPLab/acl2025-con-instruction

Folders and files

Latest commit

History

Repository files navigation

Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities

News 📢

Abstract

Environment

Usage

Dataset Download

Dataset Preparation

Attack LLava-1.5-7b model

Attack LLava-1.5-13b model

Attack QwenVL

Attack QwenVL

Attack InternVL-13B

Attack InternVL-34B

Evaluation

Citation

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages