[WACV'26] Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships
Author: Futa Waseda
This repository provides the official implementation of Multimodal Adversarial Training (MAT) for Vision-Language Models (VLMs).
- Unified MAT pipeline for image-text retrieval models.
- MAT+ additionally leverages one-to-many relationships in image-text pairs.
- Reproducible scripts for Flickr30k and COCO.
attacks/: adversarial attack implementations for images and textconfigs/: dataset and training configsdataset_json/: dataset metadata and annotationsmodels/: VLM model implementations (ALBEF/BLIP)scripts/: training scripts and presetstrain_*.py: training entry points
pip install -r requirements.txtFor system packages on Linux, see setup.sh.
- Request access and download the Flickr30k dataset from the official page.
- Place and extract the images under
IMAGE_ROOT/Flickr30k/:
# Set your data root
export IMAGE_ROOT=/path/to/your/data
# Create directory and place the downloaded tar.gz
mkdir -p $IMAGE_ROOT/Flickr30k
# Move or copy the downloaded archive
mv flickr30k-images.tar.gz $IMAGE_ROOT/Flickr30k/
# Extract images
cd $IMAGE_ROOT/Flickr30k
tar xzf flickr30k-images.tar.gz
# The archive extracts to flickr30k-images/ by default.
# If your archive extracts to a different name (e.g., Images/), rename it:
# mv Images flickr30k-images- Download COCO 2014 train/val images from cocodataset.org.
- Place them under
IMAGE_ROOT/MSCOCO/.
- Download annotation JSONs from the ALBEF repo and place them under
dataset_json/data/.
- Set
DATA_ROOTto your dataset base directory. Paths inconfigs/data/are relative to this root. - Alternatively, set
IMAGE_ROOTto the absolute path of your image directory.
The following files are available via the following Hugging Face dataset repository:
- MAT checkpoints (e.g., CLIP-MAT-Flickr30k)
- Data augmentations to reproduce MAT+ results:
dataset_json(text augmentation data)flickr_SD_I2I_0.5(Flickr SD T-I2I augmentation data)
https://huggingface.co/cyberagent/multimodal-adversarial-training
Expected directory structure:
DATA_ROOT/
├── dataset_json/
│ ├── data/ # Original annotations
│ │ ├── coco_train.json
│ │ ├── coco_val.json
│ │ ├── coco_test.json
│ │ ├── flickr30k_train.json
│ │ ├── flickr30k_val.json
│ │ ├── flickr30k_test.json
│ │ ├── refcoco+_train.json
│ │ ├── refcoco+_test.json
│ │ └── refcoco+/
│ │ ├── dets.json
│ │ └── cocos.json
│ ├── data_idx/ # Subset annotations (caps=1)
│ │ ├── coco_train_caps=1.json
│ │ └── flickr30k_train_caps=1.json
│ ├── EDA0.3/ # EDA text augmentation
│ ├── data_rewrite/ # LLaMA text rewriting
│ ├── data_coco_stableDiffusion/ # SD text-to-image augmentation
│ ├── SD-Image2Image/ # SD image-to-image augmentation
│ ├── RandAug2-5-0.7/ # RandAugment image augmentation
│ └── InternVL2_5-2B_diverse/ # Image captioning (diverse prompts)
├── MSCOCO/ # COCO images
└── Flickr30k/
└── flickr30k-images/ # Flickr30k images (31783 .jpg files)
- ALBEF: https://github.com/salesforce/ALBEF
- BLIP: https://github.com/salesforce/BLIP
- Update paths in
configs/train/...to match your local checkpoints.
Run the provided scripts:
bash scripts/train_clip_MAT.sh
bash scripts/train_albef_retrieval_MAT.sh
bash scripts/train_blip_retrieval_MAT.shOutputs are saved under train_results/<dataset>/<model>/.../.
If checkpoint_last.pth already exists in the output directory, the scripts automatically switch to eval-only mode (no need to pass --evaluate).
Edit configs/ and run:
python -m torch.distributed.launch --nproc_per_node=1 --use_env \
--master_port=$RANDOM \
train_clip.py \
--config ./configs/data/flickr30k/caps_k=1.yaml \
--output_dir ./train_results/flickr30k/clip/MAT/ \
--train_config ./configs/train/CLIP/clip_base.yaml \
--total_steps 5000 \
--attack MultimodalAttack \
--num_iters 2 \
--step_size 1.0- This code is based on the following repositories:
- models
- attacks
If you find this code useful for your research, please consider citing:
@inproceedings{waseda2026multimodal,
title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships},
author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2026}
}