Skip to content

LAMDA-CL/Prism

Repository files navigation

PRISM: Multimodal Continual Instruction Tuning Toolbox

📖 Introduction🧩 Methods🚀 How To Use📄 License📧 Contact

PRISM

visitors Prism paper PRISM documentation

PRISM is a plug-in, reproducible toolbox for training and evaluating multimodal large language models (MLLMs) under continual instruction tuning (MCIT). A single entry point (run.py) orchestrates sequential task training, inference, and evaluation across multiple benchmarks and continual-learning methods.


If you use this repository, please cite:

@article{tang2026prism,
  title={Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning},
  author={Jun-Tao Tang and Yu-Cheng Shi and Zhen-Hao Xie and Da-Wei Zhou},
  year={2026},
  journal={arXiv preprint arXiv:2605.26110},
}

@inproceedings{xie2026same,
  title={SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning},
  author={Xie, Zhen-Hao and Tang, Jun-Tao and Shi, Yu-Cheng and Ye, Han-Jia and Zhan, De-Chuan and Zhou, Da-Wei},
  booktitle={ICML},
  year={2026}
}

📖 Introduction

Multimodal large language models (MLLMs) unify diverse vision and vision–language tasks into a shared instruction-following format. In real deployments, however, data and instructions arrive as streams: models must learn new tasks sequentially without erasing earlier capabilities. Standard fine-tuning suffers from catastrophic forgetting under this setting.

Multimodal continual instruction tuning (MCIT) addresses this by training MLLMs on a sequence of instruction-tuning stages while preserving performance on prior tasks. PRISM standardizes this workflow—benchmark definitions, method integrations, checkpoint layout, and evaluation—so that MCIT methods can be compared and extended under one infrastructure.


🧩 Methods Implemented

Each method is selected with --method <id> (folder under method/custom/<id>/).

Abbr. --method Paper
HiDe-LLaVA hide_llava HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model
Replay+LoRA replay_lora LoRA: Low-Rank Adaptation of Large Language Models
LoRA ft_lora LoRA: Low-Rank Adaptation of Large Language Models
O-LoRA olora Orthogonal Subspace Learning for Language Model Continual Learning
SMoLoRA smolora SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning
MoELoRA moelora CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model
CL-MoE clmoe CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
ModalPrompt modal_prompt ModalPrompt: Towards Efficient Multimodal Continual Instruction Tuning with Dual-Modality Guided Prompt
EWC ewc Overcoming catastrophic forgetting in neural networks
DisCo disco Federated Continual Instruction Tuning
SAME same SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning
Zero-shot zeroshot Visual Instruction Tuning

To add a method, implement method/custom/<your_method>/integration.py and register with @CLMethodFactory.register("your_method").


🚀 How To Use

Pre-trained Weights

Download from each repo’s Model Zoo, then set paths in config/paths/llava_paths.py or config/paths/internvl_paths.py. Select backbone via backbone in config/run_config.py (llava or internvl).

  • LLaVAllava-v1.5-7b
  • InternVLInternVL-Chat-ViT-6B-Vicuna-7B

You can plug in additional backbones under config/backbone/ and backbone/, then register them in config/backbone/registry.py.

Datasets

PRISM currently supports three benchmarks:

Benchmark --benchmark Tasks Reference
CoIN coin 8 Paper · Benchmark
UCIT ucit 6 Paper · Benchmark
TriGap trigap 10 Paper · Benchmark

A benchmark typically has an image folder and an instruction folder. JSON files in the instruction folder reference image paths, so your on-disk layout must match those paths.

Then set the benchmark paths in config/benchmarks/<benchmark>.py (e.g. TRIGAP_IMAGE_DIR and TRIGAP_INSTRUCTION_DIR in TriGap.py).

For quick experiments, you can use smaller sub-splits: sample the instruction JSON yourself, save it with a _sub suffix (e.g. train_sub.json), and set "use_sub_dataset": true in config/run_config.py.

You can add custom benchmarks under config/benchmarks/ and register them in config/benchmarks/__init__.py.


Environment setup (one command)

If you are on NVIDIA RTX 5090 GPU(s) (our tested setup), a single command sets up everything from the repository root:

bash scripts/setup_env.sh

This creates conda env prism (if missing), installs torch 2.8 + cu128, training/eval dependencies, flash-attn, and runs pip install -e ..

For other GPUs or CUDA versions, you may need to adjust PyTorch, flash-attn, and related libraries. See requirements/README.md for options (e.g. TORCH_REQUIREMENTS=requirements/torch-cu118.txt for older CUDA stacks, FLASH_ATTN_WHEEL, SKIP_FLASH_ATTN).

Activate and verify:

conda activate prism
python -c "import torch; import transformers; import deepspeed; print(torch.__version__, transformers.__version__)"

Paths and config

Edit backbone paths under config/paths/ and benchmark roots under config/benchmarks/. Tune runs via config/run_config.py.

After configuration, run a quick zero-shot inference on a single task to check weights, data paths, and GPUs (zeroshot uses the base MLLM checkpoint only):

python run.py infer 0 --method zeroshot

Then run continual training and evaluation:

python run.py train 0 1 2
python run.py infer 0 1 2

0, 1, 2 are task indices (see config/benchmarks/<benchmark>.py). You may train any tasks you need; stage k resumes from task k−1’s checkpoint. For inference, choose the checkpoint in config/run_config.py.

CLI flags override config; omitted flags use config defaults.


📄 License

This project is released under the MIT License.


🙏 Acknowledgments

We thank the following projects for their benchmarks and reference implementations used in PRISM:


📧 Contact

If you have any questions, please feel free to propose new features by opening an issue or contact the authors: Jun-Tao Tang (juntao.tang@smail.nju.edu.cn), Yu-Cheng Shi (231250034@smail.nju.edu.cn), and Da-Wei Zhou (zhoudw@lamda.nju.edu.cn). Enjoy the code.

About

Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors