GitHub - MCG-NJU/CaReBench: A Fine-grained Benchmark for Video Captioning and Retrieval

CaReBench: A Fine-grained Benchmark for Video Captioning and Retrieval

Yifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, Limin Wang

🤗 Model | 🤗 Data ｜ 📑 Paper

📝 Introduction

🌟 CaReBench is a fine-grained benchmark comprising 1,000 high-quality videos with detailed human-annotated captions, including manually separated spatial and temporal descriptions for independent spatiotemporal bias evaluation.

📊 ReBias and CapST Metrics are designed specifically for retrieval and captioning tasks, providing a comprehensive evaluation framework for spatiotemporal understanding in video-language models.

⚡ CaRe: A Unified Baseline for fine-grained video retrieval and captioning, achieving competitive performance through two-stage Supervised Fine-Tuning (SFT). CaRe excels in both generating detailed video descriptions and extracting robust video features.

🚀 State-of-the-art performance on both detailed video captioning and fine-grained video retrieval. CaRe outperforms CLIP-based retrieval models and popular MLLMs in captioning tasks.

🥳 Get Started

Our code is quite simple and easy. Just follow the instructions below and the code will work like magic.

Prepare

Install the requirements.

pip install -r requirements.txt

Inference

Our framework supports auto-loadable inference of all the MLLMs metioned in our paper, including CaRe, LLaVA NeXT Video, MiniCPM-V 2.6, InternVL2, Qwen2-VL and Tarsier. You only need to change the checkpoint path and our model loader will load them automatically.

For Video Captioning Task

from utils.video import read_frames_decord
from models.modeling_captioners import AutoCaptioner

captioner = AutoCaptioner.from_pretrained('path/to/checkpoints/CaRe-7B')
frames = read_frames_decord(video_path='assets/demo.mp4', num_frames=32)
description = captioner.describe(frames.unsqueeze(0))
print(description[0])

For Video Retrieval Task

from utils.video import read_frames_decord
from models.modeling_encoders import AutoEncoder
from torch.nn.functional import cosine_similarity

encoder = AutoEncoder.from_pretrained('path/to/checkpoints/CaRe-7B')
frames = read_frames_decord(video_path='assets/demo.mp4', num_frames=32)
text = "This video features a man slicing tomatoes in the kitchen."
vision_emb = encoder.encode_vision(frames.unsqueeze(0))
text_emb = encoder.encode_text(text)
print(f'Vision embedding shape: {vision_emb.shape}')
print(f'Text embedding shape: {text_emb.shape}')
print(f'Cosine similarity: {cosine_similarity(vision_emb, text_emb)}')

Benchmark

Download data from our huggingface repository.
Add our benchmark to data.config.
Check the arguments in scripts/captioning.sh or scripts/retrieval.sh and run it.

Training

Stage-I

We are preparing for the release of Stage-I training code.

Stage-II

Download data

mkdir data && wget https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/nli_for_simcse.csv -O data/nli_for_simcse.csv

Check the arguments in scripts/train.sh we prepare for you and run it.

Customize Your Own Model

Our framework is designed for our paper, but it is also scalable since we have added many code specification. If you wish to have your retrieval model or caption model evaluated within our framework, please refer to the following guidelines.

Step 1: Add Base Model

Inherit your model from the BaseModel in models/modeling_basemodels.py, and implement the __init__ function. Your model will automatically gain the from_pretrained method.
(Optional) To support all the auto methods, set ARCHITECTURE in your class property. Make sure there is config.json in your model path with the structure below (something like transformers models). ARCHITECTURE should be the same as architectures[0]. Then, all the auto methods will load your model according to this architecture.

{
  "architectures": [
    "CLIPModel"
  ],
  ...
}

Step 2: Add Retrieval Model

Inherit your retrieval model from your custom base model and EncodeMixin in `models/modeling_encoders.py
Implement encode_vision and encode_text method.

Step 3: Add Caption Model

Inherit your caption model from your custom base model and CaptionMixin in `models/modeling_captioners.py
Implement describe method.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CaReBench: A Fine-grained Benchmark for Video Captioning and Retrieval

📝 Introduction

🥳 Get Started

Prepare

Inference

Benchmark

Training

Customize Your Own Model

Step 1: Add Base Model

Step 2: Add Retrieval Model

Step 3: Add Caption Model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
dataset		dataset
models		models
scripts		scripts
tasks		tasks
utils		utils
.gitignore		.gitignore
README.md		README.md
data.config		data.config
ds.config		ds.config
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CaReBench: A Fine-grained Benchmark for Video Captioning and Retrieval

📝 Introduction

🥳 Get Started

Prepare

Inference

Benchmark

Training

Customize Your Own Model

Step 1: Add Base Model

Step 2: Add Retrieval Model

Step 3: Add Caption Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages