Skip to content

CodenameLuo/TGA-ZSR

 
 

Repository files navigation

Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

🚀 TPAMI 2026     Lu Yu · Haiyang Zhang · Changsheng Xu     📄 TPAMI 2026 Paper



Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models

🎯 NeurIPS 2024   Lu Yu · Haiyang Zhang · Changsheng Xu   📄 NeurIPS 2024 Paper


🔍 Overview

Pretrained vision-language models such as CLIP demonstrate remarkable zero-shot generalization ability. However, they remain highly vulnerable to adversarial perturbations.

We identify a critical phenomenon:

Adversarial perturbations systematically shift text-guided attention, rather than merely corrupting pixel space.

Based on this insight, we propose:

  • TGA-ZSR (NeurIPS 2024)
    Text-Guided Attention for Zero-Shot Robustness

  • Comp-TGA (TPAMI 2026)
    Complementary Text-Guided Attention

Across 16 datasets, our methods improve zero-shot robust accuracy by:

  • +9.58% with TGA-ZSR
  • +11.95% with Comp-TGA

🧠 Motivation

Attention Shift Under Adversarial Perturbation

Adversarial examples induce significant deviation in text-guided attention.


Spurious Attention in Clean Samples

Even without adversarial perturbations, text-guided attention may focus on irrelevant regions.


🚀 Method

TGA-ZSR Framework

TGA-ZSR consists of two components:

Local Attention Refinement Module
Aligns adversarial attention with clean attention from the original model.

Global Attention Constraint Module
Preserves clean performance while enhancing robustness.

This design enforces attention consistency without sacrificing zero-shot generalization.


Complementary Text-Guided Attention (Comp-TGA)

We observe that standard text-guided attention occasionally captures spurious foreground cues.

Comp-TGA introduces a complementary fusion mechanism:

  • Class-prompt guided foreground attention
  • Reversed non-class prompt driven attention

By integrating these two complementary signals, the model captures a more accurate foreground representation and improves robustness stability.


📊 Zero-Shot Adversarial Robustness Benchmark

Method Venue Robust Clean Average
CLIP ICML 2021 4.90 64.42 34.66
FT-Clean Initial Entry 7.05 54.37 30.71
FT-Adv. Initial Entry 28.83 43.36 36.09
TeCoA ICLR 2023 28.06 45.81 36.93
PMG-AFT CVPR 2024 32.51 46.60 39.55
FARE ICML 2024 18.25 59.85 39.05
Vision-based Initial Entry 29.47 45.02 37.24
TGA-ZSR (Ours) NeurIPS 2024 42.09 56.44 49.27
Comp-TGA (Ours) TPAMI 2026 44.46 55.44 49.95

Robustness–Clean Trade-off

Each point represents a method.
Point size reflects trade-off quality between clean and robust accuracy.


🔧 Reproducibility

Checkpoints


⚙️ Environment Setup

pip install virtualenv
virtualenv TGA-ZSR
source TGA-ZSR/venv/bin/activate
pip install -r requirements.txt

Experiment:

Run the code with (TeCoA and PMG-AFT see source code.):

bash ./main.sh

options for each of the code parts :

  • --Method: Differentiate between checkpoints obtained using various methods.
  • --train_eps: The magnitude of the perturbation applied to generate the training adversarial example. (default = 1)
  • --train_numsteps: The number of iteration applied to generate the training adversarial example. (default = 2)
  • --train_stepsize: The iteration step size applied to generate the training adversarial example. (default = 1)
  • --test_eps: The magnitude of the perturbation applied to generate the test adversarial example. (default = 1)
  • --test_numsteps: The number of iteration applied to generate the test adversarial example. (default = 100)
  • --test_stepsize: The iteration step size applied to generate the test adversarial example. (default = 1)
  • --arch: Different CLIP versions. (default = 'vit_b32')
  • --dataset: The dataset used for training. (default = 'tinyImageNet')
  • --seed: random seed. (default = 0)
  • --resume: Address of checkpoint. (default = None)
  • --last_num_ft: fine tuning layer (default = 0)
  • --VPbaseline: Whether adversarial training is conducted or not.

Specific Options :

TGA-ZSR.py

  • --Distance_metric: Select the distance measure in the loss function. (default = 'l2')
  • --atten_methods: Attention from different perspectives. (default = 'text')
  • --Alpha: L_LARM in Equ.9. (default = 0.08)
  • --Beta: L_GACM in Equ.12. (default = 0.05)

Comp-TGA.py:

  • --Distance_metric: Select the distance measure in the loss function. (default = 'l2')
  • --atten_methods: Attention from different perspectives. (default = 'text')
  • --Alpha: L_LARM in Equ.9. (default = 0.10)
  • --Beta: L_GACM in Equ.12. (default = 0.07)

Citation

If you find this repository useful, please consider citing our paper:

@inproceedings{TGA-ZSR,
     title={Text-guided attention is all you need for zero-shot robustness in vision-language models},
     author={Yu, Lu and Zhang, Haiyang and Xu, Changsheng},
     journal={Advances in Neural Information Processing Systems},
     volume={37},
     pages={96424--96448},
     year={2024}
}

@article{Comp-TGA,
     title={Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness},
     author={Yu, Lu and Zhang, Haiyang and Xu, Changsheng},
     journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
     year={2026},
     publisher={IEEE}
}

Acknowledgement

We gratefully thank the authors from TeCoA and CLIPCAM for open-sourcing their code.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 81.1%
  • Jupyter Notebook 18.7%
  • Shell 0.2%