Skip to content

aiming-lab/MJ-Video

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation

This repository contains the implementation of the paper "MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation". We create a fine-grained video preference dataset MJ-BENCH-VIDEO and a MoE-structured video reward model MJ-VIDEO.

🎉 News

MJ-VIDEO-4B coming soon !

Aligned Video Generation Model coming soon !

✅ Abstract

Recent advancements in video generation have significantly improved the ability to synthesize videos from text instructions. However, existing models still struggle with key challenges such as instruction misalignment, content hallucination, safety concerns, and bias. Addressing these limitations, we introduce MJ-BENCH-VIDEO, a large-scale video preference benchmark designed to evaluate video generation across five critical aspects: Alignment, Safety, Fineness, Coherence & Consistency, and Bias & Fairness. This benchmark incorporates 28 fine-grained criteria to provide a comprehensive evaluation of video preference. Building upon this dataset, we propose MJ-VIDEO, a Mixture-of-Experts (MoE)-based video reward model designed to deliver fine-grained reward. MJ-VIDEO can dynamically select relevant experts to accurately judge the preference based on the input text-video pair. This architecture enables more precise and adaptable preference judgments. Through extensive benchmarking on MJ-BENCH-VIDEO, we analyze the limitations of existing video reward models and demonstrate the superior performance of MJ-VIDEO in video preference assessment, achieving 17.58% and 15.87% improvements in overall and fine-grained preference judgments, respectively. Additionally, introducing MJ-VIDEO for preference tuning in video generation enhances the alignment performance.

⛽ Installation

To install the necessary dependencies, run the following command:

For the testing of models other than InternVL2-4B and CogVLM2, use the following commands for installation:

git clone git@github.com:aiming-lab/MJ-Video.git
conda create -n mjvideo python=3.10
cd MJ-Video
pip install -r requirements.txt

For InternVL2-4B, modify the versions of some libraries and use the following commands to create the environment:

conda create -n mjvideo_4b python=3.10
cd MJ-Video
pip install -r requirements_4b.txt

For CogVLM2, modify the versions of some libraries and use the following commands to create the environment:

conda create -n mjvideo_cog python=3.10
cd MJ-Video
pip install -r requirements_cog.txt

For the environment of fine-tuning VADER on VideoCrafter2, please refer to VADER.

🚗Vide Preference Dataset

Our dataset is available at MJ-BENCH-VIDEO.

You can download our dataset from Hugging Face and use the code in scripts/data/dataset.py to load the dataset for training and evaluation.

🏭 Video Reward Model

Our reward model is available at MJ-VIDEO.

If you want to use your own dataset or our dataset, use the code in scripts/train to conduct the training.

For inference, you can refer to scritps/model/moe-playground.ipynb, or use the following code for inference.

from model import InternVLChatRewardModeling, InternVLChatRewardModelingConfig, prepare_chat_input
from data_processor import load_video
from torch import distributed as dist
import os

os.environ['WORLD_SIZE'] = str(1)
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = str(12345)
os.environ['LOCAL_RANK'] = str(0)
os.environ['RANK'] = str(0)

dist.init_process_group(backend='nccl', world_size=1, rank=0)

### prepare model
tokenizer = AutoTokenizer.from_pretrained(args.model_name, trust_remote_code=True)
config = InternVLChatRewardModelingConfig.from_pretrained(
    args.model_name,
    num_objectives=args.num_objectives,
    num_aspects=args.num_aspects,
    aspect2criteria=args.aspect2criteria,
    gating_temperature=args.gating_temperature,
    gating_hidden_dim=args.gating_hidden_dim,
    gating_n_hidden=args.gating_n_hidden
)

generation_config = {
    "max_new_tokens": args.max_new_tokens,
    "do_sample": args.do_sample
}

model = InternVLChatRewardModeling(name=args.model_name, config=config).cuda()
model.config.pad_token_id = tokenizer.pad_token_id
model = model.to(torch.bfloat16).cuda()
IMG_CONTEXT_TOKEN = '<IMG_CONTEXT>'
model.model.img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
model.eval()

## prepare data
caption = "Generate a video of a tiger dancing."
pixel_values, num_patches_list = load_video(video, num_segments=8, max_num=1)
video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
pixel_values = pixel_values.to(torch.bfloat16).to(model.model.device)
prompt = video_prefix + caption
input_ids, attention_mask = prepare_chat_input(config, tokenizer, pixel_values, prompt, generation_config, device=model.model.device)

## inference
output = model.forward(pixel_values, input_ids, attention_mask_left)

## criteria score
output.reward

## aspect score
output.aspect_scores

## overall score
output.score

✈️ Preference Alignment

We trained our video generation model based on VADER; here are a few examples.

Case_study_Align

🚋Leadboard

Testing on Aspect Annotations in MJ-BENCH-VIDEO

The bolded numbers in the table represent the best results, while the italic numbers indicate the second-best results. The "C&C" in the table refers to "Coherence and Consistency," while "B&F" refers to "Bias and Fairness." In cases where certain models show strong bias, causing the F1 score to be NaN, a "/" is used in place of the result in the table. For preference comparison, we report the results of the "strict" metric.

Model Alignment (Acc) Alignment (F1) Alignment (strict) Safety (Acc) Safety (F1) Safety (strict) Fineness (Acc) Fineness (F1) Fineness (strict) C&C (Acc) C&C (F1) C&C (strict) B&F (Acc) B&F (F1) B&F (strict)
InternVL2-2B 70.75 60.42 17.71 66.67 55.02 16.67 63.59 49.87 3.125 71.81 46.04 10.34 74.11 63.19 54.54
InternVL2-4B 57.00 55.00 26.96 75.49 60.37 0.00 52.48 49.92 7.143 43.02 33.11 17.86 66.32 56.27 54.55
InternVL2-8B 44.21 44.21 33.33 76.72 72.60 16.67 47.71 47.27 18.75 27.76 24.29 12.07 15.51 13.88 50.00
InternVL2-26B 65.47 62.96 40.51 84.44 78.26 20.00 69.81 51.91 14.29 59.03 41.51 16.33 82.05 59.85 30.00
Qwen2-VL-2B 54.28 53.03 19.35 59.82 56.93 25.00 56.75 51.86 3.448 37.90 31.18 16.39 20.00 19.31 38.46
Qwen2-VL-7B 58.31 56.19 41.94 55.35 52.81 25.00 47.56 46.33 31.03 32.58 27.68 19.67 14.61 13.13 23.08
MiniCPM-8B 65.53 61.38 48.72 72.91 67.22 40.00 62.13 56.02 39.29 49.73 37.21 31.25 15.12 14.17 60.00
CogVLM2 26.71 23.80 7.692 31.67 30.09 16.67 35.61 29.79 11.76 7.87 7.86 4.615 14.61 / 7.692
Gemini-1.5-flash 27.45 25.72 8.421 83.64 77.34 0.0 32.80 25.27 12.90 5.01 4.88 12.07 15.18 / 9.091
GPT-4o 58.27 56.21 50.00 82.86 77.00 50.00 59.67 56.34 27.27 44.52 34.17 40.00 19.17 18.48 33.33
MJ-VIDEO 78.41 71.22 79.05 87.50 81.84 83.33 68.60 58.53 58.82 95.36 53.57 58.46 86.92 55.97 69.23

Results of Overall Video Preference Evaluation

The best test results are highlighted in bold, and the second-best results are in italic. Strict treats undecided cases as incorrect, while tie-aware assigns 0.5 for ties in calculating accuracy.

Model MJ-BENCH-VIDEO(Strict) MJ-BENCH-VIDEO(Tie-aware) Safesora-test (Strict) Safesora-test (Tie-aware) GenAI-Bench (Strict) GenAI-Bench (Tie-aware)
InternVL2-2B 5.93 47.88 4.60 50.30 13.71 55.43
InternVL2-4B 13.55 49.15 11.74 50.91 39.00 61.79
InternVL2-8B 16.95 47.88 14.29 53.09 36.85 62.43
InternVL2-26B 22.88 53.81 10.41 52.00 31.86 55.64
Qwen-VL-2B 13.33 48.09 13.18 51.27 27.29 56.71
Qwen-VL-7B 17.14 47.62 14.58 52.41 20.57 51.36
MiniCPM 30.51 53.39 25.30 52.54 47.43 60.21
CogVLM2 8.47 47.46 9.56 52.48 21.29 56.29
VideoScore 58.47 58.47 55.33 55.51 69.14 69.14
Gemini 2.66 48.67 2.66 48.67 21.45 50.71
GPT-4o 35.35 54.6 35.35 54.6 48.85 59.14
MJ-VIDEO 68.75 68.75 64.16 64.16 70.28 70.28

Evaluation of Video Models Across Human Evaluation and Automated Evaluation on VBench

Human evaluation assesses Video Quality and Text-to-Video Alignment. Automated evaluation on VBench evaluates Imaging Quality (IQ), Human Action (HA), Scene (S), and Overall Consistency (OC).

Model Quality (Human) Align (Human) IQ (Auto) HA (Auto) S (Auto) OC (Auto)
VideoCrafter2 56.30 68.80 67.04 90.00 54.00 28.39
VideoScore 64.50 74.80 65.03 92.00 54.79 28.38
MJ-VIDEO 69.90 79.20 67.89 94.00 55.09 28.19

Cite

Please cite us using the following bibtex

@misc{tong2025mjvideofinegrainedbenchmarkingrewarding,
      title={MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation}, 
      author={Haibo Tong and Zhaoyang Wang and Zhaorun Chen and Haonian Ji and Shi Qiu and Siwei Han and Kexin Geng and Zhongkai Xue and Yiyang Zhou and Peng Xia and Mingyu Ding and Rafael Rafailov and Chelsea Finn and Huaxiu Yao},
      year={2025},
      eprint={2502.01719},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.01719}, 
}

About

[NeurIPS'25 Spotlight] MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors