-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Hi everybody,
first of all - thank you for this amazing research / paper & code!
I modified and fine-tuned OpenAI's ViT-L/14 after reading results from Vision Transformers Need Registers.
The model has +4 extra tokens in the ViT, but that wasn't enough for a mere fine-tune (vs. training with extra register tokens from scratch).
Alas, I implemented 'Fusion MLP Gates' for the late layers in which register tokens emerge [+20M params].
As far as attention visualization goes, this seems to indeed have mitigated the issue of 'burnt in background patch attention':
Unfortunately, I am too 'GPU poor**' to apply your method myself. But my model inherited the original MIT license, so in case anybody is interested: Anything goes!
huggingface.co/zer0int/CLIP-Registers-Gated_MLP-ViT-L-14
My code for modifying & fine-tuning the model, likewise with MIT license:
github.com/zer0int/CLIP-fine-tune-registers-gated
**: I fine-tuned the model on COCO-SPRIGHT 40k with a batch size of ~40 on 1x RTX 4090. Using Geometric Parametrization of the linear layers (.weight -> .theta, .r) to prevent otherwise inevitable overfit due to tiny batch size.
The model outperforms pre-trained ViT-L/14 on MVT ImageNet/ObjectNet zero-shot, accuracy ~84.5% -> ~88%.
VoC-2007 multilabel (CLIP benchmark, LAION): ~76% -> ~85%.
Notably, the modality gap was also reduced from pre-trained: ~0.82 to fine-tuned: ~0.54.
More results on HF (link above, scroll to the very bottom for a table / overview of benchmarks).
PS: If anybody has an idea on how to implement this model for use with the HuggingFace Transformers library, please let me know; it seems "max_position_embeddings" is only valid for the Text Encoder (?), and there are also extra keys (the Fusion MLP layers); alas, it's currently only compatible with OpenAI/CLIP code, i.e. 'import clip' - although I added .safetensors compatibility [see git repo linked above] and am offering the model as .safetensors, in addition to the original 'danger pickle'. :-)
Kind regards!
