[WACV'26] Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

Author: Futa Waseda

Abstract

This repository provides the official implementation of Multimodal Adversarial Training (MAT) for Vision-Language Models (VLMs).

Highlights

Unified MAT pipeline for image-text retrieval models.
MAT+ additionally leverages one-to-many relationships in image-text pairs.
Reproducible scripts for Flickr30k and COCO.

Repository Structure

attacks/: adversarial attack implementations for images and text
configs/: dataset and training configs
dataset_json/: dataset metadata and annotations
models/: VLM model implementations (ALBEF/BLIP)
scripts/: training scripts and presets
train_*.py: training entry points

Setup

Requirements (Python 3.11)

pip install -r requirements.txt

For system packages on Linux, see setup.sh.

Datasets

Flickr30k

Request access and download the Flickr30k dataset from the official page.
Place and extract the images under IMAGE_ROOT/Flickr30k/:

# Set your data root
export IMAGE_ROOT=/path/to/your/data

# Create directory and place the downloaded tar.gz
mkdir -p $IMAGE_ROOT/Flickr30k
# Move or copy the downloaded archive
mv flickr30k-images.tar.gz $IMAGE_ROOT/Flickr30k/

# Extract images
cd $IMAGE_ROOT/Flickr30k
tar xzf flickr30k-images.tar.gz

# The archive extracts to flickr30k-images/ by default.
# If your archive extracts to a different name (e.g., Images/), rename it:
# mv Images flickr30k-images

COCO

Download COCO 2014 train/val images from cocodataset.org.
Place them under IMAGE_ROOT/MSCOCO/.

Annotation files

Download annotation JSONs from the ALBEF repo and place them under dataset_json/data/.

Configuration

Set DATA_ROOT to your dataset base directory. Paths in configs/data/ are relative to this root.
Alternatively, set IMAGE_ROOT to the absolute path of your image directory.

Provided Resources (Hugging Face)

The following files are available via the following Hugging Face dataset repository:

MAT checkpoints (e.g., CLIP-MAT-Flickr30k)
Data augmentations to reproduce MAT+ results:
- dataset_json (text augmentation data)
- flickr_SD_I2I_0.5 (Flickr SD T-I2I augmentation data)

https://huggingface.co/cyberagent/multimodal-adversarial-training

Expected directory structure:

DATA_ROOT/
├── dataset_json/
│   ├── data/                          # Original annotations
│   │   ├── coco_train.json
│   │   ├── coco_val.json
│   │   ├── coco_test.json
│   │   ├── flickr30k_train.json
│   │   ├── flickr30k_val.json
│   │   ├── flickr30k_test.json
│   │   ├── refcoco+_train.json
│   │   ├── refcoco+_test.json
│   │   └── refcoco+/
│   │       ├── dets.json
│   │       └── cocos.json
│   ├── data_idx/                       # Subset annotations (caps=1)
│   │   ├── coco_train_caps=1.json
│   │   └── flickr30k_train_caps=1.json
│   ├── EDA0.3/                         # EDA text augmentation
│   ├── data_rewrite/                   # LLaMA text rewriting
│   ├── data_coco_stableDiffusion/      # SD text-to-image augmentation
│   ├── SD-Image2Image/                 # SD image-to-image augmentation
│   ├── RandAug2-5-0.7/                 # RandAugment image augmentation
│   └── InternVL2_5-2B_diverse/         # Image captioning (diverse prompts)
├── MSCOCO/                             # COCO images
└── Flickr30k/
    └── flickr30k-images/               # Flickr30k images (31783 .jpg files)

Pretrained Checkpoints

ALBEF: https://github.com/salesforce/ALBEF
BLIP: https://github.com/salesforce/BLIP
Update paths in configs/train/... to match your local checkpoints.

Training

Run the provided scripts:

bash scripts/train_clip_MAT.sh
bash scripts/train_albef_retrieval_MAT.sh
bash scripts/train_blip_retrieval_MAT.sh

Outputs are saved under train_results/<dataset>/<model>/.../.

Auto Evaluation

If checkpoint_last.pth already exists in the output directory, the scripts automatically switch to eval-only mode (no need to pass --evaluate).

Custom Run

Edit configs/ and run:

python -m torch.distributed.launch --nproc_per_node=1 --use_env \
  --master_port=$RANDOM \
  train_clip.py \
  --config ./configs/data/flickr30k/caps_k=1.yaml \
  --output_dir ./train_results/flickr30k/clip/MAT/ \
  --train_config ./configs/train/CLIP/clip_base.yaml \
  --total_steps 5000 \
  --attack MultimodalAttack \
  --num_iters 2 \
  --step_size 1.0

Acknowledgements

This code is based on the following repositories:
- models
  - ALBEF: https://github.com/salesforce/ALBEF
  - BLIP: https://github.com/salesforce/BLIP
- attacks
  - Co-Attack: https://github.com/adversarial-for-goodness/Co-Attack
  - SGA: https://github.com/Zoky-2020/SGA

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{waseda2026multimodal,
  title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships},
  author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
attacks		attacks
configs		configs
data_annotation		data_annotation
dataset		dataset
models		models
optim		optim
refTools		refTools
scheduler		scheduler
scripts		scripts
utils		utils
vqaTools		vqaTools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
constants.py		constants.py
requirements.txt		requirements.txt
setup.sh		setup.sh
train_albef_grounding.py		train_albef_grounding.py
train_albef_retrieval.py		train_albef_retrieval.py
train_albef_vqa.py		train_albef_vqa.py
train_blip_caption.py		train_blip_caption.py
train_blip_retrieval.py		train_blip_retrieval.py
train_clip.py		train_clip.py
train_common.py		train_common.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[WACV'26] Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

Abstract

Highlights

Repository Structure

Setup

Requirements (Python 3.11)

Datasets

Flickr30k

COCO

Annotation files

Configuration

Provided Resources (Hugging Face)

Pretrained Checkpoints

Training

Auto Evaluation

Custom Run

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[WACV'26] Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

Abstract

Highlights

Repository Structure

Setup

Requirements (Python 3.11)

Datasets

Flickr30k

COCO

Annotation files

Configuration

Provided Resources (Hugging Face)

Pretrained Checkpoints

Training

Auto Evaluation

Custom Run

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages