feat: add SuperCLIP （Nips 2025）implementation #1127

SleepingWhz · 2025-12-01T08:46:53Z

Summary

This PR integrates SuperCLIP, our NeurIPS 2025 accepted work, into the OpenCLIP framework.
SuperCLIP is a simple yet highly effective improvement over CLIP: by adding only a lightweight linear layer and introducing classification-based supervision, it enables CLIP to recover fine-grained semantic signals that contrastive learning typically overlooks.

SuperCLIP requires no additional annotated data, increases computation by only 0.077% FLOPs, and also greatly reduces CLIP’s dependence on extremely large batch sizes.

Overall, SuperCLIP delivers consistent and substantial gains across zero-shot classification, image-text retrieval, and purely visual tasks.

Why this matters

Despite CLIP’s strong global alignment, it struggles with fine-grained semantics such as object states, spatial relations, and actions.
As shown in Figure 1 of the paper, SuperCLIP significantly improves such distinctions with almost no architectural overhead.

Key advantages

Large gains without extra labeled data
+3% to +5% zero-shot classification improvement for ViT-B/L at 512M-scale training (Table 2)
Up to +5.4% COCO/Flickr retrieval improvement (Table 2)
Robust under long / rich captions—effectively utilizes detailed captions where CLIP degrades (Table 5)
Enhances purely visual tasks like PASCAL/ADE20K segmentation and NYUv2 depth (Table 7)
Alleviates small-batch degradation, maintaining accuracy where CLIP collapses (Figure 4)

Overall, SuperCLIP provides stronger fine-grained visual–text alignment at effectively zero cost.

What’s included

Added superclip_model.py (complete SuperCLIP architecture)
Added configs:
- SuperCLIP-ViT-B-16.json
- SuperCLIP-ViT-L-16.json
Integrated model registry into factory.py and __init__.py
Modified transformer.py and loss.py to support classifier-based supervision

All components remain fully optional and do not affect existing CLIP models.

Notes

We are happy to make any structural adjustments needed to align with OpenCLIP conventions.
Reference implementation: https://github.com/hustvl/SuperCLIP

speedinghzl · 2025-12-12T07:02:02Z

Hi @rwightman

Would appreciate if you could take a look at this PR when you have a chance!

The change is minimal - just adding a single linear layer on top of CLIP - but we're seeing substantial performance gains without requiring any additional labeled data. Happy to discuss the approach or provide more details if needed.

Thanks!

rwightman · 2025-12-12T19:20:17Z

@SleepingWhz @speedinghzl I have taken a look, but it's too much code to take on as is. I haven't taken time to figure out what the 'minimal' required addition would be that has low regression risk.

Also, there don't appear to be any pretrained weights available?

SleepingWhz added 2 commits December 1, 2025 08:09

feat: add SuperCLIP implementation

8bdf6c9

fix: restore original weights_only behavior in load_state_dict

8f4ac63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add SuperCLIP （Nips 2025）implementation #1127

feat: add SuperCLIP （Nips 2025）implementation #1127

Uh oh!

SleepingWhz commented Dec 1, 2025

Uh oh!

speedinghzl commented Dec 12, 2025

Uh oh!

rwightman commented Dec 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add SuperCLIP （Nips 2025）implementation #1127

Are you sure you want to change the base?

feat: add SuperCLIP （Nips 2025）implementation #1127

Uh oh!

Conversation

SleepingWhz commented Dec 1, 2025

Summary

Why this matters

Key advantages

What’s included

Notes

Uh oh!

speedinghzl commented Dec 12, 2025

Uh oh!

rwightman commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rwightman commented Dec 12, 2025 •

edited

Loading