Skip to content

Conversation

@SleepingWhz
Copy link

Summary

This PR integrates SuperCLIP, our NeurIPS 2025 accepted work, into the OpenCLIP framework.
SuperCLIP is a simple yet highly effective improvement over CLIP: by adding only a lightweight linear layer and introducing classification-based supervision, it enables CLIP to recover fine-grained semantic signals that contrastive learning typically overlooks.

SuperCLIP requires no additional annotated data, increases computation by only 0.077% FLOPs, and also greatly reduces CLIP’s dependence on extremely large batch sizes.

Overall, SuperCLIP delivers consistent and substantial gains across zero-shot classification, image-text retrieval, and purely visual tasks.

teaser

Why this matters

Despite CLIP’s strong global alignment, it struggles with fine-grained semantics such as object states, spatial relations, and actions.
As shown in Figure 1 of the paper, SuperCLIP significantly improves such distinctions with almost no architectural overhead.

Key advantages

  • Large gains without extra labeled data
  • +3% to +5% zero-shot classification improvement for ViT-B/L at 512M-scale training (Table 2)
  • Up to +5.4% COCO/Flickr retrieval improvement (Table 2)
  • Robust under long / rich captions—effectively utilizes detailed captions where CLIP degrades (Table 5)
  • Enhances purely visual tasks like PASCAL/ADE20K segmentation and NYUv2 depth (Table 7)
  • Alleviates small-batch degradation, maintaining accuracy where CLIP collapses (Figure 4)

Overall, SuperCLIP provides stronger fine-grained visual–text alignment at effectively zero cost.


What’s included

  • Added superclip_model.py (complete SuperCLIP architecture)
  • Added configs:
    • SuperCLIP-ViT-B-16.json
    • SuperCLIP-ViT-L-16.json
  • Integrated model registry into factory.py and __init__.py
  • Modified transformer.py and loss.py to support classifier-based supervision

All components remain fully optional and do not affect existing CLIP models.


Notes

We are happy to make any structural adjustments needed to align with OpenCLIP conventions.
Reference implementation: https://github.com/hustvl/SuperCLIP

@speedinghzl
Copy link

Hi @rwightman

Would appreciate if you could take a look at this PR when you have a chance!

The change is minimal - just adding a single linear layer on top of CLIP - but we're seeing substantial performance gains without requiring any additional labeled data. Happy to discuss the approach or provide more details if needed.

Thanks!

@rwightman
Copy link
Collaborator

rwightman commented Dec 12, 2025

@SleepingWhz @speedinghzl I have taken a look, but it's too much code to take on as is. I haven't taken time to figure out what the 'minimal' required addition would be that has low regression risk.

Also, there don't appear to be any pretrained weights available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants