Skip to content

CyberAgentAILab/multimodal-adversarial-training

Repository files navigation

[WACV'26] Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

Author: Futa Waseda

Teaser

Abstract

This repository provides the official implementation of Multimodal Adversarial Training (MAT) for Vision-Language Models (VLMs).

Highlights

  • Unified MAT pipeline for image-text retrieval models.
  • MAT+ additionally leverages one-to-many relationships in image-text pairs.
  • Reproducible scripts for Flickr30k and COCO.

Repository Structure

  • attacks/: adversarial attack implementations for images and text
  • configs/: dataset and training configs
  • dataset_json/: dataset metadata and annotations
  • models/: VLM model implementations (ALBEF/BLIP)
  • scripts/: training scripts and presets
  • train_*.py: training entry points

Setup

Requirements (Python 3.11)

pip install -r requirements.txt

For system packages on Linux, see setup.sh.

Datasets

Flickr30k

  1. Request access and download the Flickr30k dataset from the official page.
  2. Place and extract the images under IMAGE_ROOT/Flickr30k/:
# Set your data root
export IMAGE_ROOT=/path/to/your/data

# Create directory and place the downloaded tar.gz
mkdir -p $IMAGE_ROOT/Flickr30k
# Move or copy the downloaded archive
mv flickr30k-images.tar.gz $IMAGE_ROOT/Flickr30k/

# Extract images
cd $IMAGE_ROOT/Flickr30k
tar xzf flickr30k-images.tar.gz

# The archive extracts to flickr30k-images/ by default.
# If your archive extracts to a different name (e.g., Images/), rename it:
# mv Images flickr30k-images

COCO

  1. Download COCO 2014 train/val images from cocodataset.org.
  2. Place them under IMAGE_ROOT/MSCOCO/.

Annotation files

  • Download annotation JSONs from the ALBEF repo and place them under dataset_json/data/.

Configuration

  • Set DATA_ROOT to your dataset base directory. Paths in configs/data/ are relative to this root.
  • Alternatively, set IMAGE_ROOT to the absolute path of your image directory.

Provided Resources (Hugging Face)

The following files are available via the following Hugging Face dataset repository:

  • MAT checkpoints (e.g., CLIP-MAT-Flickr30k)
  • Data augmentations to reproduce MAT+ results:
    • dataset_json (text augmentation data)
    • flickr_SD_I2I_0.5 (Flickr SD T-I2I augmentation data)
https://huggingface.co/cyberagent/multimodal-adversarial-training

Expected directory structure:

DATA_ROOT/
├── dataset_json/
│   ├── data/                          # Original annotations
│   │   ├── coco_train.json
│   │   ├── coco_val.json
│   │   ├── coco_test.json
│   │   ├── flickr30k_train.json
│   │   ├── flickr30k_val.json
│   │   ├── flickr30k_test.json
│   │   ├── refcoco+_train.json
│   │   ├── refcoco+_test.json
│   │   └── refcoco+/
│   │       ├── dets.json
│   │       └── cocos.json
│   ├── data_idx/                       # Subset annotations (caps=1)
│   │   ├── coco_train_caps=1.json
│   │   └── flickr30k_train_caps=1.json
│   ├── EDA0.3/                         # EDA text augmentation
│   ├── data_rewrite/                   # LLaMA text rewriting
│   ├── data_coco_stableDiffusion/      # SD text-to-image augmentation
│   ├── SD-Image2Image/                 # SD image-to-image augmentation
│   ├── RandAug2-5-0.7/                 # RandAugment image augmentation
│   └── InternVL2_5-2B_diverse/         # Image captioning (diverse prompts)
├── MSCOCO/                             # COCO images
└── Flickr30k/
    └── flickr30k-images/               # Flickr30k images (31783 .jpg files)

Pretrained Checkpoints

Training

Run the provided scripts:

bash scripts/train_clip_MAT.sh
bash scripts/train_albef_retrieval_MAT.sh
bash scripts/train_blip_retrieval_MAT.sh

Outputs are saved under train_results/<dataset>/<model>/.../.

Auto Evaluation

If checkpoint_last.pth already exists in the output directory, the scripts automatically switch to eval-only mode (no need to pass --evaluate).

Custom Run

Edit configs/ and run:

python -m torch.distributed.launch --nproc_per_node=1 --use_env \
  --master_port=$RANDOM \
  train_clip.py \
  --config ./configs/data/flickr30k/caps_k=1.yaml \
  --output_dir ./train_results/flickr30k/clip/MAT/ \
  --train_config ./configs/train/CLIP/clip_base.yaml \
  --total_steps 5000 \
  --attack MultimodalAttack \
  --num_iters 2 \
  --step_size 1.0

Acknowledgements

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{waseda2026multimodal,
  title={Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships},
  author={Waseda, Futa and Tejero-de-Pablos, Antonio and Echizen, Isao},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2026}
}

About

[WACV 2026] Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors