Skip to content

arkel23/CLCA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

This repository is the official Pytorch implementation of Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition accepted for publication at ICASSP 25.

We propose a novel plug-in-method, CLCA, to avoid information loss and instabilities during training brought by token reduction methods for ViTs in ultra-fine-grained image recognition (UFGIR) datasets. CLCA incorporates a Cross-Layer Aggregation (CLA) classification head and Cross-Layer Cache (CLC) mechanisms to aggregate and transfer information across layers of a ViT:

As it can be observed, CLCA effectively reduces the instabilities as observed by the large fluctuation in gradients during training:

Our method achieves a superior accuracy vs cost (in terms of FLOPs) trade-off compared to alternatives.

Furthermore, it consistently boosts the performance of token reduction methods across a wide a variety of settings (different backbones, keep rates, datasets):

When compared to SotA UFGIR methods our method obtains favorable results across multiple datasets at a lower cost:

Pre-trained checkpoints are available on HuggingFace!

Requirements

Requirements can be found in the requirements.txt file.

Datasets

We use the UFGIR leaves datasets from:

Xiaohan Yu, Yang Zhao, Yongsheng Gao, Xiaohui Yuan, Shengwu Xiong (2021). Benchmark Platform for Ultra-Fine-Grained Visual Categorization BeyondHuman Performance. In ICCV2021.
https://github.com/XiaohanYu-GU/Ultra-FGVC?tab=readme-ov-file

Training

To finetune a ViT B-16 model pretrained with CLIP style pretraining (on Laion2bdataset) on SoyGene dataset with EViT token reduction, keep rate 0.1, and reduction at layers 4, 7, and 10 with the proposed CLCA:

python train.py --num_workers 24 --reduction_loc 3 6 9 --serial 30 --input-size 448 --ifa_head --clc --num_clr 1 --train_trainval --seed 1 --cfg configs/soygene_ft_weakaugs.yaml --model evit_vit_base_patch16_clip_224.laion2b --lr 0.0001 --keep_rate 0.1

Acknowledgements and Code Credits

We thank NYCU's HPC Center and National Center for High-performance Computing (NCHC) for providing computational and storage resources.

We also thank Weight and Biases for their platform for experiment management.

This repository is based on Which Tokens to Use? Investigating Token Reduction in Vision Transformers.

Specifically, we extended most token reduction methods to support the proposed CLCA and a wider variety of backbones asides from DeiT, as implemented in the timm library.

Also, added support for a wider variety of (U)FGIR datasets, parameter-efficient fine-tuning, and support for H2T and VQT.

Code for H2T and VQT are based on the official implementations:

Original implementation is based on the following:

The token reduction method code is based and inspired by:

Parts of the training code and large part of the ViT implementation is based and inspired by:

Parts of the analysis code is based and inspiredby:

License

The Code is licensed under an MIT License, with exceptions of the afforementioned code credits which follows the license of the original authors.

Bibtex

@misc{rios_cross-layer_2024,
	title = {Cross-{Layer} {Cache} {Aggregation} for {Token} {Reduction} in {Ultra}-{Fine}-{Grained} {Image} {Recognition}},
	copyright = {All rights reserved},
	doi = {10.48550/arXiv.2501.00243},
	url = {http://arxiv.org/abs/2501.00243},
	publisher = {arXiv},
	author = {Rios, Edwin Arkel and Yuanda, Jansen Christopher and Ghanz, Vincent Leon and Yu, Cheng-Wei and Lai, Bo-Cheng and Hu, Min-Chun},
	month = dec,
	year = {2024},
	annote = {Comment: Accepted to ICASSP 2025. Main: 5 pages, 4 figures, 1 table},
}

About

Cross-Layer Cache Aggregation for Token Reduction in Ultra-Fine-Grained Image Recognition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages