Skip to content

Code for the 2025 ACL publication "ConInstruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities"

License

Notifications You must be signed in to change notification settings

UKPLab/acl2025-con-instruction

Repository files navigation

Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities

License Python Versions

This repository contains the implementation of Con Instruction, introduced in the ACL 2025 paper: Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities. The code is released under an Apache 2.0 license.

Contact person: Jiahui Geng

UKP Lab | TU Darmstadt

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

News 📢

  • Our paper is accepted to ACL 2025 Main Conference! See you in Vienna, Austria

Abstract

Existing attacks against multimodal language models (MLLMs) primarily communicate instructions through text accompanied by adversarial images. In contrast, here we exploit the capabilities of MLLMs to interpret non-textual instructions–specifically adversarial images or audio–generated by our novel method, Con Instruction. We optimize the adversarial examples to align closely with target instructions in the embedding space, revealing the detrimental aspects of sophisticated understanding in MLLMs. Unlike previous work, our method does not require training data or preprocessing of textual instructions. While these nontextual adversarial examples can effectively bypass MLLMs safety mechanisms, their combination with various text inputs substantially amplifies attack success. We further introduce a new attack response categorization (ARC) that considers both response quality and relevance to the malicious instructions to evaluate attack success. The results show that Con Instruction effectively bypasses the safety mechanisms in various visual and audio-language models, including LLaVA-v1.5, InternVL, Qwen-VL, and Qwen-Audio, across two standard benchmarks: AdvBench and SafeBench. Specifically, our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVAv1.5 (13B). On the defense side, we explore various methods against our attacks and find a substantial gap among existing techniques.

header

Environment

Follow these instructions to recreate the environment used for all our experiments.

$ conda create --name coninstruction python=3.9
$ conda activate coninstruction
$ pip install --upgrade pip
$ pip install -e .
$ pip install -r requirements.txt
$ pip install flash-attn --no-build-isolation

Usage

Our implementation is evaluated across a range of target victim models. It leverages multiple distance functions to ensure that adversarial examples remain close to the target textual instructions in the embedding space. To further enhance the effectiveness of the attack, we employ various helper texts. Moreover, the attack's impact can be further amplified by retaining a subset of the final tokens as direct textual input to the model during inference.

header

Model Link
LLaVA-1.5-7b Model Link
LLaVA-1.5-13b Model Link
Qwen-Audio Model Link
QwenVL Model Link
InternVL-13b Model Link
InternVL-34b Model Link

Dataset Download

Dataset Link
SafeBench GitHub Link
AdvBench GitHub Link

Dataset Preparation

One needs to first download the data for these experiments and put that within the dataset folder. The directory has the following structure:

  • advbench

    • harmful_behaviors.csv
    • harmful_strings.csv
  • safebench

    • question
      • safebench.csv
      • SafeBench-Tiny.csv

Attack LLava-1.5-7b model

$ python llava_attack.py --model-path liuhaotian/llava-v1.5-7b

Attack LLava-1.5-13b model

$ python llava_attack.py --model-path liuhaotian/llava-v1.5-13b

Attack QwenVL

$ cd qwen-audio
# python qwenaudio_attack.py

Attack QwenVL

$ cd qwenvl
$ python qwenvl_attack.py 

Attack InternVL-13B

$ cd internvl
$ python internvl_attack.py  --model_type intern_7b

Attack InternVL-34B

$ cd internvl
$ python  internvl_attack.py  --model_type intern_34b

Evaluation

We provide general eval scripts for evaluating the success rate of jailbreak attacks. We propose a novel response categorization framework to address the inaccurate evaluation of attack success rates in our method and related jailbreak attack studies. Informative text spans are shown in italic. ✓ indicates that an attack was deemed successful, while p denotes a failed attempt.

header

$ python eval.py

Citation

@inproceedings{geng2025coninstruction,
  title={Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities},
  author={Geng, Jiahui and Tran, Thy Thy and Nakov, Preslav and Gurevych, Iryna},
  booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics",
  year={2025},
  publisher = "Association for Computational Linguistics",
  url={https://openreview.net/forum?id=Xl8ItHKUhJ}
}

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

About

Code for the 2025 ACL publication "ConInstruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •