Skip to content

Weifan1226/JailbreakLoRA-ICLR2026

Repository files navigation

JailbreakLoRA: Your Downloaded LoRA from Sharing Platforms Might Be Unsafe

Fanjunduo Wei1,* ,  Zhenheng Tang2,* ,  Rongfei Zeng1,*,†
Tongliang Liu3 ,  Chengqi Zhang4 ,  Xiaowen Chu5 ,  Bo Han6  
1NEU   2HKUST  3The University of Sydney  4PolyU  5HKUST(GZ)  6HKBU
*Equal Contribution    Corresponding author

ICLR 2026


Overview

LoRA sharing platforms make it easy to plug in community adapters, but this convenience also introduces security risk. Existing LoRA-based jailbreak/backdoor methods often maximize attack success at the cost of downstream task performance, making them less likely to be adopted.

JailbreakLoRA targets this gap. It jointly optimizes utility and maliciousness by:

  1. Balancing task losses with homoscedastic uncertainty weighting.
  2. Resolving gradient conflicts across tasks with gradient projection.
  3. Learning an affirmative prefix under triggers to exploit inference-time hallucination for stronger jailbreaks.

Result: Compared with prior LoRA-based attacks, JailbreakLoRA improves attack success rate by 16.0% and average multi-task performance by 16.5%.

JailbreakLoRA Overview

How to Run the Basic Code

This project is managed by uv, which automatically handles project dependencies.

1) Prepare Data

Training Data

  1. Create your training file at finetune/ft_datasets/finetune_dataset/train.jsonl.
  2. Optionally create validation data at finetune/ft_datasets/finetune_dataset/valid.jsonl.
  3. Each line must be a JSON object with messages and data_type.

Example line:

{"messages":[{"role":"user","content":"<prompt>"},{"role":"assistant","content":"<response>"}],"data_type":"mmlu"}

data_type values can be set in finetune_dataset.py

In addition, the malicious question–answer data used in this work is generated by training malicious models on annotated data. Specifically, the malicious question datasets are derived from Hex-PHI and JBB-Behaviors

Testing Data

EM is computed over task folders under data_bbh/. Each task folder should contain a test.jsonl with one JSON object per line:

{"context":"<question text>","completion":"<gold answer>","instruction":"<optional instruction>"}

ASR/DTR prompts are read from CSV files in infer_input/:

  • infer_input/ASR_test.csv
  • infer_input/DTR_test.csv

2) Set Hyperparameters

All key hyperparameters are set in scripts/jailbreaklora_loss.sh and scripts/jailbreaklora_grad.sh:

  1. BATCH_SIZE, GRAD_ACCUM_STEPS, LR, EPOCHS, NUM_TASKS
  2. MODEL_NAME (local path or HF model name)
  3. NUM_GPUS, LOG_DIR ...

LoRA settings are loaded from configs/peft.py

3) Start JailbreakLoRA Loss / Grad Training and Evaluation

JailbreakLoRA loss:

bash scripts/jailbreakLoRA_loss.sh

JailbreakLoRA grad:

bash scripts/jailbreakLoRA_grad.sh

Citation

If you use this work, please cite the paper:

@inproceedings{
wei2026jailbreaklora,
title={JailbreakLo{RA}: Your Downloaded Lo{RA} from Sharing Platforms might be Unsafe},
author={Fanjunduo Wei and Zhenheng Tang and Rongfei Zeng and Tongliang Liu and Chengqi Zhang and Xiaowen Chu and Bo Han},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=4YgvVRoSnF}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors