Skip to content

Mech Interp for ViT-L/14 fine-tune with 'mitigated' register tokens (via MLP Gates)? #178

@zer0int

Description

@zer0int

Hi everybody,

first of all - thank you for this amazing research / paper & code!

I modified and fine-tuned OpenAI's ViT-L/14 after reading results from Vision Transformers Need Registers.

The model has +4 extra tokens in the ViT, but that wasn't enough for a mere fine-tune (vs. training with extra register tokens from scratch).
Alas, I implemented 'Fusion MLP Gates' for the late layers in which register tokens emerge [+20M params].
As far as attention visualization goes, this seems to indeed have mitigated the issue of 'burnt in background patch attention':

Image

Unfortunately, I am too 'GPU poor**' to apply your method myself. But my model inherited the original MIT license, so in case anybody is interested: Anything goes!

huggingface.co/zer0int/CLIP-Registers-Gated_MLP-ViT-L-14

My code for modifying & fine-tuning the model, likewise with MIT license:

github.com/zer0int/CLIP-fine-tune-registers-gated

**: I fine-tuned the model on COCO-SPRIGHT 40k with a batch size of ~40 on 1x RTX 4090. Using Geometric Parametrization of the linear layers (.weight -> .theta, .r) to prevent otherwise inevitable overfit due to tiny batch size.

The model outperforms pre-trained ViT-L/14 on MVT ImageNet/ObjectNet zero-shot, accuracy ~84.5% -> ~88%.
VoC-2007 multilabel (CLIP benchmark, LAION): ~76% -> ~85%.
Notably, the modality gap was also reduced from pre-trained: ~0.82 to fine-tuned: ~0.54.
More results on HF (link above, scroll to the very bottom for a table / overview of benchmarks).

PS: If anybody has an idea on how to implement this model for use with the HuggingFace Transformers library, please let me know; it seems "max_position_embeddings" is only valid for the Text Encoder (?), and there are also extra keys (the Fusion MLP layers); alas, it's currently only compatible with OpenAI/CLIP code, i.e. 'import clip' - although I added .safetensors compatibility [see git repo linked above] and am offering the model as .safetensors, in addition to the original 'danger pickle'. :-)

Kind regards!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions