Mech Interp for ViT-L/14 fine-tune with 'mitigated' register tokens (via MLP Gates)?

Hi everybody,

first of all - thank you for this amazing research / paper & code!

I modified and fine-tuned OpenAI's ViT-L/14 after reading results from [Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588).

The model has +4 extra tokens in the ViT, but that wasn't enough for a mere fine-tune (vs. training with extra register tokens from scratch).
Alas, I implemented 'Fusion MLP Gates' for the late layers in which register tokens emerge [+20M params].
As far as attention visualization goes, this seems to indeed have mitigated the issue of 'burnt in background patch attention':

![Image](https://github.com/user-attachments/assets/b55a5022-b47c-4eeb-868a-970b4ce5f01d)

Unfortunately, I am too 'GPU poor**' to apply your method myself. But my model inherited the original MIT license, so in case anybody is interested: Anything goes!

[huggingface.co/zer0int/CLIP-Registers-Gated_MLP-ViT-L-14](https://huggingface.co/zer0int/CLIP-Registers-Gated_MLP-ViT-L-14)

My code for modifying & fine-tuning the model, likewise with MIT license:

[github.com/zer0int/CLIP-fine-tune-registers-gated](https://github.com/zer0int/CLIP-fine-tune-registers-gated)

**: I fine-tuned the model on [COCO-SPRIGHT 40k](https://huggingface.co/datasets/SPRIGHT-T2I/spright_coco) with a batch size of ~40 on 1x RTX 4090. Using Geometric Parametrization of the linear layers (.weight -> .theta, .r) to prevent otherwise inevitable overfit due to tiny batch size.

The model outperforms pre-trained ViT-L/14 on MVT ImageNet/ObjectNet zero-shot, accuracy ~84.5% -> ~88%.
VoC-2007 multilabel (CLIP benchmark, LAION): ~76% -> ~85%.
Notably, the modality gap was also reduced from pre-trained: ~0.82 to fine-tuned: ~0.54.
More results on HF (link above, scroll to the very bottom for a table / overview of benchmarks).


PS: If anybody has an idea on how to implement this model for use with the HuggingFace Transformers library, please let me know; it seems "max_position_embeddings" is only valid for the Text Encoder (?), and there are also extra keys (the Fusion MLP layers); alas, it's currently only compatible with OpenAI/CLIP code, i.e. 'import clip' - although I added .safetensors compatibility [see git repo linked above] and am offering the model as .safetensors, in addition to the original 'danger pickle'. :-)


Kind regards!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mech Interp for ViT-L/14 fine-tune with 'mitigated' register tokens (via MLP Gates)? #178

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mech Interp for ViT-L/14 fine-tune with 'mitigated' register tokens (via MLP Gates)? #178

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions