feat: add SuperCLIP (Nips 2025)implementation #1127
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR integrates SuperCLIP, our NeurIPS 2025 accepted work, into the OpenCLIP framework.
SuperCLIP is a simple yet highly effective improvement over CLIP: by adding only a lightweight linear layer and introducing classification-based supervision, it enables CLIP to recover fine-grained semantic signals that contrastive learning typically overlooks.
SuperCLIP requires no additional annotated data, increases computation by only 0.077% FLOPs, and also greatly reduces CLIP’s dependence on extremely large batch sizes.
Overall, SuperCLIP delivers consistent and substantial gains across zero-shot classification, image-text retrieval, and purely visual tasks.
Why this matters
Despite CLIP’s strong global alignment, it struggles with fine-grained semantics such as object states, spatial relations, and actions.
As shown in Figure 1 of the paper, SuperCLIP significantly improves such distinctions with almost no architectural overhead.
Key advantages
Overall, SuperCLIP provides stronger fine-grained visual–text alignment at effectively zero cost.
What’s included
superclip_model.py(complete SuperCLIP architecture)SuperCLIP-ViT-B-16.jsonSuperCLIP-ViT-L-16.jsonfactory.pyand__init__.pytransformer.pyandloss.pyto support classifier-based supervisionAll components remain fully optional and do not affect existing CLIP models.
Notes
We are happy to make any structural adjustments needed to align with OpenCLIP conventions.
Reference implementation: https://github.com/hustvl/SuperCLIP