π€ Benchmark | π€ Dataset | π€ Model | π Homepage
Reward models (RMs) play a critical role in aligning AI behaviors with human preferences. We propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of:
-
π Evaluation: We introduce OmniRewardBench, the first omni-modal reward benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D.
-
π Data: We construct OmniRewardData , a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs.
-
π§ Model: We propose OmniRewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used RM benchmark.
Our dataset is hosted on huggingface and we recommend downloading them with the following command.
huggingface-cli download HongbangYuan/OmniRewardBench --repo-type dataset --local-dir ./OmniRewardBenchmedia_data.zip file (~3.5 GB), which contains all original image, audio, and video resources required for evaluation.
Depending on your internet speed, this step might take a while.
We recommend using the utility functions provided in ./dataset/OmniRewardBench/load_omni_reward_bench.py for loading the dataset. You should specify the task argument to load data for a particular task.
Next, we walk through the data format used in each task, highlighting the structure and key fields.
π‘ Note:
For each sample containing image or audio data:
- You can directly load media using
datasets.Imageordatasets.Audioobjects provided by the π€ Hugging Facedatasetslibrary.- Alternatively, you can use the
image_pathoraudio_pathfields to load files from disk.For video data, only local loading via the path in
videois supported.All media paths are relative paths, and should be resolved relative to the root directory where
media_data.zipis extracted.
We provide a summary of the key-value structure for each task below. Feel free to refer to this section when working with task-specific data samples.
| Key | Type | Description |
|---|---|---|
prompt |
str |
The user instruction or query to be evaluated. |
response1 |
str |
The response generated by Model 1 for the given prompt. |
response2 |
str |
The response generated by Model 2 for the same prompt. |
model1 |
str |
Name of Model1 |
model2 |
str |
Name of Model2 |
criteria |
str |
The evaluation criteria in textual form. |
criteria_preference |
str |
The human-annotated preference (either "response1" or "response2") under the given criterion. |
id |
str |
A unique identifier for this data sample within the dataset. |
| Key | Type | Description |
|---|---|---|
prompt |
str |
The user instruction, typically paired with an image input. |
image |
Image |
The image input of the user prompt. |
image_path |
str |
Path to the associated image file. |
response1 |
str |
The textual response generated by Model 1. |
response2 |
str |
The textual response generated by Model 2. |
model1 |
str |
Name of Model 1. |
model2 |
str |
Name of Model 2. |
criteria |
str |
The evaluation criteria in textual form. |
criteria_preference |
str |
The human-annotated preference (either "response1" or "response2") under the given criterion. |
id |
str |
A unique identifier for this data sample. |
| Key | Type | Description |
|---|---|---|
prompt |
str |
The user instruction, typically paired with a video input. |
video |
str |
Path to the associated video file. |
response1 |
str |
The textual response generated by Model 1. |
response2 |
str |
The textual response generated by Model 2. |
model1 |
str |
Name of Model 1. |
model2 |
str |
Name of Model 2. |
criteria |
str |
The evaluation criteria in textual form. |
criteria_preference |
str |
The human-annotated preference (either "response1" or "response2") under the given criterion. |
id |
str |
A unique identifier for this data sample. |
| Key | Type | Description |
|---|---|---|
prompt |
str |
The user instruction, typically paired with a audio input. |
audio |
audio |
A huggingface audio object. |
audio_path |
str |
Path to the associated audio file. |
response1 |
str |
The textual response generated by Model 1. |
response2 |
str |
The textual response generated by Model 2. |
model1 |
str |
Name of Model 1. |
model2 |
str |
Name of Model 2. |
criteria |
str |
The evaluation criteria in textual form. |
criteria_preference |
str |
The human-annotated preference (either "response1" or "response2") under the given criterion. |
id |
str |
A unique identifier for this data sample. |
| Key | Type | Description |
|---|---|---|
prompt |
str |
The image generation instruction. |
response1 |
Image |
The image generated by Model 1. |
response2 |
Image |
The image generated by Model 2. |
response1_path |
str |
Path to the image file generated by Model 1. |
response2_path |
str |
Path to the image file generated by Model 2. |
model1 |
str |
Name of Model 1. |
model2 |
str |
Name of Model 2. |
criteria |
str |
The evaluation criteria in textual form. |
criteria_preference |
str |
The human-annotated preference (either "response1" or "response2") under the given criterion. |
id |
str |
A unique identifier for this data sample. |
| Key | Type | Description |
|---|---|---|
prompt |
str |
The video generation instruction. |
response1 |
str |
The video file generated by Model 1. |
response2 |
str |
The video file generated by Model 2. |
model1 |
str |
Name of Model 1. |
model2 |
str |
Name of Model 2. |
criteria |
str |
The evaluation criteria in textual form. |
criteria_preference |
str |
The human-annotated preference (either "response1" or "response2") under the given criterion. |
id |
str |
A unique identifier for this data sample. |
| Key | Type | Description |
|---|---|---|
prompt |
str |
The audio generation instruction. |
response1 |
Audio |
The audio clip generated by Model 1. |
response2 |
Audio |
The audio clip generated by Model 2. |
response1_path |
str |
Path to the audio file generated by Model 1. |
response2_path |
str |
Path to the audio file generated by Model 2. |
model1 |
str |
Name of Model 1. |
model2 |
str |
Name of Model 2. |
criteria |
str |
The evaluation criteria in textual form. |
criteria_preference |
str |
The human-annotated preference (either "response1" or "response2") under the given criterion. |
id |
str |
A unique identifier for this data sample. |
| Key | Type | Description |
|---|---|---|
prompt |
str |
The 3D generation instruction. |
response1 |
Image |
The 3D image generated by Model 1. |
response2 |
Image |
The 3D image generated by Model 2. |
response1_path |
str |
Path to the 3D image generated by Model 1. |
response2_path |
str |
Path to the 3D image generated by Model 2. |
model1 |
str |
Name of Model 1. |
model2 |
str |
Name of Model 2. |
criteria |
str |
The evaluation criteria in textual form. |
criteria_preference |
str |
The human-annotated preference (either "response1" or "response2") under the given criterion. |
id |
str |
A unique identifier for this data sample. |
| Key | Type | Description |
|---|---|---|
prompt |
str |
The image edit instruction. |
image |
Image |
The image file to be edited. |
image_path |
str |
Path to the image file to be edited. |
response1 |
Image |
The final image generated by Model 1. |
response2 |
Image |
The final image generated by Model 2. |
response1_path |
str |
Path to the final image generated by Model 1. |
response2_path |
str |
Path to the final image generated by Model 2. |
model1 |
str |
Name of Model 1. |
model2 |
str |
Name of Model 2. |
criteria |
str |
The evaluation criteria in textual form. |
criteria_preference |
str |
The human-annotated preference (either "response1" or "response2") under the given criterion. |
id |
str |
A unique identifier for this data sample. |
To evaluate an API-accessible model on our full benchmark suite, you can run the provided launch script:
bash scripts/eval/run_eval_api.sh <your_model_name>Remember to Sspecifying the model name as a command-line argument (e.g., gpt-4, claude-3) for logging and tracking.
The scripts/eval/run_eval_api.sh script supports:
-
β Evaluating all tasks or selected ones By default, the script runs on all supported tasks. To evaluate only specific tasks, simply comment out the unused tasks in the
taskslist. -
β Two evaluation modes For each task, the script runs:
- Without Tie Evaluation (default)
- WithTie valuation (
--with_tie)
-
β Parallel execution Each pair of evaluations (w/ and w/o TIE) runs in parallel to speed up the process.
-
β Customizable API endpoint The API URL is set to
https://api.vveai.com/v1/chat/completionsby default. You can modify this value in the script to use any OpenAI-compatible endpoint. For example, if you are serving a local model using vLLM, you can set:api_url="http://localhost:8000/v1/chat/completions"This allows you to benchmark models hosted on your own machine.
To reproduce the training process in our paper, please make sure to set up the environment as described below. Our training code is built upon the llama-factory framework.
git clone https://github.com/HongbangYuan/OmniReward.git
conda create -n omnireward python=3.10
conda activate omnirewardWe recommend using torch==2.2.0 for best compatibility.
Install PyTorch (choose one based on your CUDA version):
# For CUDA 11.8:
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 \
--index-url https://download.pytorch.org/whl/cu118
# For CUDA 12.1:
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 \
--index-url https://download.pytorch.org/whl/cu121Then install the remaining dependencies:
cd OmniReward/OmniReward-Factory
pip install -r requirements.txtDownload all required training and evaluation datasets from OmniRewardData and OmniRewardBench:
cd OmniReward-Factory
bash scripts/download.shTo reproduce the training results described in our paper, please navigate to the OmniReward-Factory directory and run the following scripts:
cd OmniReward-Factory
bash scripts/train.sh
bash scripts/train_t2t.sh
bash scripts/train_ti2t.sh
bash scripts/train_t2iv.shYou can also directly use our pretrained Omni-Reward for evaluation without retraining.
The models are publicly available at:
π https://huggingface.co/jinzhuoran/OmniRewardModel
cd OmniReward-Factory
bash scripts/eval_t2t.sh
bash scripts/eval_t2t_tie.sh
bash scripts/eval_ti2t.sh
bash scripts/eval_ti2t_tie.sh-
--eval_dataset: Specifies the evaluation dataset (e.g.,omni_t2t,omni_t2i,omni_t2v, etc.). -
--eval_tie: Enables w/ Ties evaluation.
The following table provides an overview of the subsets in OmniRewardData, including their associated task types and dataset sizes.
βΉοΈ The asterisk (*) denotes the subset constructed in this work.
| Subset Name | Task Type | #Samples |
|---|---|---|
| Skywork-Reward-Preference | T2T | 50,000 |
| Omni-Skywork-Reward-Preference * | T2T | 16,376 |
| Omni-UltraFeedback * | T2T | 7,901 |
| RLAIF-V | TI2T | 83,124 |
| OmniAlign-V-DPO | TI2T | 50,000 |
| Omni-RLAIF-V * | TI2T | 15,867 |
| Omni-VLFeedback * | TI2T | 12,311 |
| HPDv2 | T2I | 50,000 |
| EvalMuse | T2I | 2,944 |
| Omni-HPDv2 * | T2I | 8,959 |
| Omni-Open-Image-Preferences * | T2I | 8,105 |
| VideoDPO | T2V | 10,000 |
| VisionRewardDB-Video | T2V | 1,795 |
