Skip to content

creative-graphic-design/longclip-transformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

37 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Long-CLIP

This repository is the official implementation of Long-CLIP

Long-CLIP: Unlocking the Long-Text Capability of CLIP
Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Jiaqi Wang

πŸ’‘ Highlights

  • πŸ”₯ Long Input length Increase the maximum input length of CLIP from 77 to 248.
  • πŸ”₯ Strong Performace Improve the R@5 of long-caption text-image retrieval by 20% and traditional text-image retrieval by 6%.
  • πŸ”₯ Plug-in and play Can be directly applied in any work that requires long-text capability.
  • ✨ Transformers Compatible Seamlessly integrated with HuggingFace Transformers ecosystem.
  • πŸš€ Easy to Use Load models directly from Hugging Face Hub with one line of code.

πŸ“œ News

πŸš€ [2026/1/4] Repository restructured with Transformers-compatible implementation! Now supports easy installation via pip and loading from Hugging Face Hub.

πŸš€ [2024/7/3] Our paper has been accepted by ECCV2024.

πŸš€ [2024/7/3] We release the code of using Long-CLIP in SDXL. For detailed information, you may refer to SDXL/SDXL.md.

πŸš€ [2024/5/21] We update the paper and checkpoints after fixing the bug in DDP and add results in Urban-1k. Special thanks to @MajorDavidZhang for finding and refining this bug in DDP! Now the fine-tuning only takes 0.5 hours on 8 GPUs!

πŸš€ [2024/5/21] Urban-1k: a scaling-up version of Urban-200 dataset in the paper has been released at this page.

πŸš€ [2024/4/1] The training code is released!

πŸš€ [2024/3/25] The Inference code and models (LongCLIP-B and LongCLIP-L) are released!

πŸš€ [2024/3/25] The paper is released!

πŸ‘¨β€πŸ’» Todo

  • Training code for Long-CLIP based on OpenAI-CLIP
  • Evaluation code for Long-CLIP
  • evaluation code for zero-shot classification and text-image retrieval tasks.
  • Usage example of Long-CLIP
  • Checkpoints of Long-CLIP
  • Transformers-compatible implementation
  • Hugging Face Hub integration

πŸ“ Repository Structure

Long-CLIP/
β”œβ”€β”€ src/longclip/              # Transformers-compatible implementation (main package)
β”‚   β”œβ”€β”€ configuration_longclip.py
β”‚   β”œβ”€β”€ modeling_longclip.py
β”‚   └── processing_longclip.py
β”œβ”€β”€ longclip_original/         # Original CLIP-style implementation
β”‚   β”œβ”€β”€ model/                 # Core model code
β”‚   └── open_clip_long/        # OpenCLIP-based implementation
β”œβ”€β”€ scripts/                   # Utility scripts
β”‚   β”œβ”€β”€ convert_longclip_to_hf.py  # Convert .pt to Transformers format
β”‚   └── push_to_hub.py         # Upload models to Hugging Face Hub
β”œβ”€β”€ tests/                     # Test suite
β”œβ”€β”€ train/                     # Training scripts
β”œβ”€β”€ eval/                      # Evaluation scripts
β”œβ”€β”€ SDXL/                      # SDXL integration
└── checkpoints/               # Model checkpoints (.pt files)

πŸ› οΈ Usage

Installation

Option 1: Using Transformers (Recommended)

Install via pip with transformers support:

pip install git+https://github.com/creative-graphic-design/longclip-transformers

Or using uv:

uv pip install git+https://github.com/creative-graphic-design/longclip-transformers

Option 2: Development Installation

Clone the repository and install:

git clone https://github.com/creative-graphic-design/longclip-transformers
cd longclip-transformers
uv sync  # or: pip install -e .

To include the original implementation for comparison:

uv sync --group original

How to Use

Using Transformers (Recommended)

Load pre-converted models from Hugging Face Hub:

from longclip import LongCLIPModel, LongCLIPProcessor
from PIL import Image
import torch

# Load model and processor from Hub
model = LongCLIPModel.from_pretrained("creative-graphic-design/LongCLIP-B")
processor = LongCLIPProcessor.from_pretrained("creative-graphic-design/LongCLIP-B")

# Prepare inputs
image = Image.open("./img/demo.png")
texts = [
    "A man is crossing the street with a red car parked nearby.",
    "A man is driving a car in an urban scene."
]

# Process and get predictions
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)

print("Label probs:", probs)

Long text support (up to 248 tokens):

long_text = "A very detailed description of a complex scene with many objects, people, and activities happening simultaneously in an urban environment with buildings, cars, and natural elements." * 3

inputs = processor(text=long_text, images=image, return_tensors="pt")
outputs = model(**inputs)

Using Original Implementation

If you prefer the original CLIP-style API, download the checkpoints from LongCLIP-B or LongCLIP-L and place them under ./checkpoints:

from longclip_original.model import longclip
import torch
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = longclip.load("./checkpoints/longclip-B.pt", device=device)

text = longclip.tokenize([
    "A man is crossing the street with a red car parked nearby.",
    "A man is driving a car in an urban scene."
]).to(device)
image = preprocess(Image.open("./img/demo.png")).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    logits_per_image = image_features @ text_features.T
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)

Converting Checkpoints to Transformers Format

If you have .pt checkpoints and want to convert them to Transformers format:

python scripts/convert_longclip_to_hf.py \
    --checkpoint_path checkpoints/longclip-B.pt \
    --output_path ./longclip-base-hf

# Then use with transformers
python -c "from longclip import LongCLIPModel; model = LongCLIPModel.from_pretrained('./longclip-base-hf')"

See scripts/README.md for more details on conversion and uploading to Hugging Face Hub.

Comparison: Transformers vs Original

Feature Transformers (Recommended) Original Implementation
API Style HuggingFace standard CLIP-style
Loading from_pretrained() from Hub Load from local .pt file
Processor Unified LongCLIPProcessor Separate tokenizer & preprocessor
Integration Works with transformers ecosystem Standalone
Model Format SafeTensors/PyTorch PyTorch only
Installation pip install Requires manual setup
Use Case Production, Easy deployment Research, Legacy compatibility

Evaluation

Zero-shot classification

To run zero-shot classification on imagenet dataset, run the following command after preparing the data

cd eval/classification/imagenet
python imagenet.py

Similarly, run the following command for cifar datset

cd eval/classification/cifar
python cifar10.py               #cifar10
python cifar100.py              #cifar100

Retrieval

To run text-image retrieval on COCO2017 or Flickr30k, run the following command after preparing the data

cd eval/retrieval
python coco.py                  #COCO2017
python flickr30k.py             #Flickr30k

Traning

Please refer to train/train.md for training details.

⭐ Demos

Long-CLIP-SDXL

Long-caption text-image retrieval

Plug-and-Play text to image generation

Citation

If you find our work helpful for your research, please consider giving a citation:

@article{zhang2024longclip,
        title={Long-CLIP: Unlocking the Long-Text Capability of CLIP},
        author={Beichen Zhang and Pan Zhang and Xiaoyi Dong and Yuhang Zang and Jiaqi Wang},
        journal={arXiv preprint arXiv:2403.15378},
        year={2024}
}

About

Transformers Compatible API of LongCLIP

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages