Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

🤗 Benchmark | 🤗 Dataset | 🤗 Model | 🏠 Homepage

Reward models (RMs) play a critical role in aligning AI behaviors with human preferences. We propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of:

📈 Evaluation: We introduce OmniRewardBench, the first omni-modal reward benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D.
📚 Data: We construct OmniRewardData , a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs.
🧠 Model: We propose OmniRewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used RM benchmark.

📈 Evaluation

🌐 Data Download

Our dataset is hosted on huggingface and we recommend downloading them with the following command.

huggingface-cli download HongbangYuan/OmniRewardBench --repo-type dataset --local-dir ./OmniRewardBench

⚠️ Note: The most time-consuming part of the download is the media_data.zip file (~3.5 GB), which contains all original image, audio, and video resources required for evaluation. Depending on your internet speed, this step might take a while.

We recommend using the utility functions provided in ./dataset/OmniRewardBench/load_omni_reward_bench.py for loading the dataset. You should specify the task argument to load data for a particular task.

📁 Dataset Format

Next, we walk through the data format used in each task, highlighting the structure and key fields.

💡 Note:

For each sample containing image or audio data:

You can directly load media using datasets.Image or datasets.Audio objects provided by the 🤗 Hugging Face datasets library.

Alternatively, you can use the image_path or audio_path fields to load files from disk.

For video data, only local loading via the path in video is supported.

All media paths are relative paths, and should be resolved relative to the root directory where media_data.zip is extracted.

We provide a summary of the key-value structure for each task below. Feel free to refer to this section when working with task-specific data samples.

Text-to-Text

Key	Type	Description
`prompt`	`str`	The user instruction or query to be evaluated.
`response1`	`str`	The response generated by Model 1 for the given prompt.
`response2`	`str`	The response generated by Model 2 for the same prompt.
`model1`	`str`	Name of Model1
`model2`	`str`	Name of Model2
`criteria`	`str`	The evaluation criteria in textual form.
`criteria_preference`	`str`	The human-annotated preference (either `"response1"` or `"response2"`) under the given criterion.
`id`	`str`	A unique identifier for this data sample within the dataset.

Text-Image-to-Text

Key	Type	Description
`prompt`	`str`	The user instruction, typically paired with an image input.
`image`	`Image`	The image input of the user prompt.
`image_path`	`str`	Path to the associated image file.
`response1`	`str`	The textual response generated by Model 1.
`response2`	`str`	The textual response generated by Model 2.
`model1`	`str`	Name of Model 1.
`model2`	`str`	Name of Model 2.
`criteria`	`str`	The evaluation criteria in textual form.
`criteria_preference`	`str`	The human-annotated preference (either `"response1"` or `"response2"`) under the given criterion.
`id`	`str`	A unique identifier for this data sample.

Text-Video-to-Text

Key	Type	Description
`prompt`	`str`	The user instruction, typically paired with a video input.
`video`	`str`	Path to the associated video file.
`response1`	`str`	The textual response generated by Model 1.
`response2`	`str`	The textual response generated by Model 2.
`model1`	`str`	Name of Model 1.
`model2`	`str`	Name of Model 2.
`criteria`	`str`	The evaluation criteria in textual form.
`criteria_preference`	`str`	The human-annotated preference (either `"response1"` or `"response2"`) under the given criterion.
`id`	`str`	A unique identifier for this data sample.

Text-Audio-to-Text

Key	Type	Description
`prompt`	`str`	The user instruction, typically paired with a audio input.
`audio`	`audio`	A huggingface audio object.
`audio_path`	`str`	Path to the associated audio file.
`response1`	`str`	The textual response generated by Model 1.
`response2`	`str`	The textual response generated by Model 2.
`model1`	`str`	Name of Model 1.
`model2`	`str`	Name of Model 2.
`criteria`	`str`	The evaluation criteria in textual form.
`criteria_preference`	`str`	The human-annotated preference (either `"response1"` or `"response2"`) under the given criterion.
`id`	`str`	A unique identifier for this data sample.

Text-to-Image

Key	Type	Description
`prompt`	`str`	The image generation instruction.
`response1`	`Image`	The image generated by Model 1.
`response2`	`Image`	The image generated by Model 2.
`response1_path`	`str`	Path to the image file generated by Model 1.
`response2_path`	`str`	Path to the image file generated by Model 2.
`model1`	`str`	Name of Model 1.
`model2`	`str`	Name of Model 2.
`criteria`	`str`	The evaluation criteria in textual form.
`criteria_preference`	`str`	The human-annotated preference (either `"response1"` or `"response2"`) under the given criterion.
`id`	`str`	A unique identifier for this data sample.

Text-to-Video

Key	Type	Description
`prompt`	`str`	The video generation instruction.
`response1`	`str`	The video file generated by Model 1.
`response2`	`str`	The video file generated by Model 2.
`model1`	`str`	Name of Model 1.
`model2`	`str`	Name of Model 2.
`criteria`	`str`	The evaluation criteria in textual form.
`criteria_preference`	`str`	The human-annotated preference (either `"response1"` or `"response2"`) under the given criterion.
`id`	`str`	A unique identifier for this data sample.

Text-to-Audio

Key	Type	Description
`prompt`	`str`	The audio generation instruction.
`response1`	`Audio`	The audio clip generated by Model 1.
`response2`	`Audio`	The audio clip generated by Model 2.
`response1_path`	`str`	Path to the audio file generated by Model 1.
`response2_path`	`str`	Path to the audio file generated by Model 2.
`model1`	`str`	Name of Model 1.
`model2`	`str`	Name of Model 2.
`criteria`	`str`	The evaluation criteria in textual form.
`criteria_preference`	`str`	The human-annotated preference (either `"response1"` or `"response2"`) under the given criterion.
`id`	`str`	A unique identifier for this data sample.

Text-to-3D

Key	Type	Description
`prompt`	`str`	The 3D generation instruction.
`response1`	`Image`	The 3D image generated by Model 1.
`response2`	`Image`	The 3D image generated by Model 2.
`response1_path`	`str`	Path to the 3D image generated by Model 1.
`response2_path`	`str`	Path to the 3D image generated by Model 2.
`model1`	`str`	Name of Model 1.
`model2`	`str`	Name of Model 2.
`criteria`	`str`	The evaluation criteria in textual form.
`criteria_preference`	`str`	The human-annotated preference (either `"response1"` or `"response2"`) under the given criterion.
`id`	`str`	A unique identifier for this data sample.

Text-Image-to-Image

Key	Type	Description
`prompt`	`str`	The image edit instruction.
`image`	`Image`	The image file to be edited.
`image_path`	`str`	Path to the image file to be edited.
`response1`	`Image`	The final image generated by Model 1.
`response2`	`Image`	The final image generated by Model 2.
`response1_path`	`str`	Path to the final image generated by Model 1.
`response2_path`	`str`	Path to the final image generated by Model 2.
`model1`	`str`	Name of Model 1.
`model2`	`str`	Name of Model 2.
`criteria`	`str`	The evaluation criteria in textual form.
`criteria_preference`	`str`	The human-annotated preference (either `"response1"` or `"response2"`) under the given criterion.
`id`	`str`	A unique identifier for this data sample.

🚀 Running Evaluation

To evaluate an API-accessible model on our full benchmark suite, you can run the provided launch script:

bash scripts/eval/run_eval_api.sh <your_model_name>

Remember to Sspecifying the model name as a command-line argument (e.g., gpt-4, claude-3) for logging and tracking.

The scripts/eval/run_eval_api.sh script supports:

✅ Evaluating all tasks or selected ones By default, the script runs on all supported tasks. To evaluate only specific tasks, simply comment out the unused tasks in the tasks list.
✅ Two evaluation modes For each task, the script runs:
- Without Tie Evaluation (default)
- WithTie valuation (--with_tie)
✅ Parallel execution Each pair of evaluations (w/ and w/o TIE) runs in parallel to speed up the process.
✅ Customizable API endpoint The API URL is set to https://api.vveai.com/v1/chat/completions by default. You can modify this value in the script to use any OpenAI-compatible endpoint. For example, if you are serving a local model using vLLM, you can set:
```
api_url="http://localhost:8000/v1/chat/completions"
```
This allows you to benchmark models hosted on your own machine.

⚙️ Training

🛠️ Environment Setup

To reproduce the training process in our paper, please make sure to set up the environment as described below. Our training code is built upon the llama-factory framework.

git clone https://github.com/HongbangYuan/OmniReward.git
conda create -n omnireward python=3.10
conda activate omnireward

We recommend using torch==2.2.0 for best compatibility.

Install PyTorch (choose one based on your CUDA version):

# For CUDA 11.8:
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 \
    --index-url https://download.pytorch.org/whl/cu118

# For CUDA 12.1:
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 \
    --index-url https://download.pytorch.org/whl/cu121

Then install the remaining dependencies:

cd OmniReward/OmniReward-Factory
pip install -r requirements.txt

📦 Data Preparation

Download all required training and evaluation datasets from OmniRewardData and OmniRewardBench:

cd OmniReward-Factory
bash scripts/download.sh

🏋️‍♀️ Training Omni-Reward

To reproduce the training results described in our paper, please navigate to the OmniReward-Factory directory and run the following scripts:

cd OmniReward-Factory
bash scripts/train.sh
bash scripts/train_t2t.sh
bash scripts/train_ti2t.sh
bash scripts/train_t2iv.sh

📈 Loading and Evaluating Omni-Reward

You can also directly use our pretrained Omni-Reward for evaluation without retraining.

The models are publicly available at:

👉 https://huggingface.co/jinzhuoran/OmniRewardModel

cd OmniReward-Factory
bash scripts/eval_t2t.sh
bash scripts/eval_t2t_tie.sh
bash scripts/eval_ti2t.sh
bash scripts/eval_ti2t_tie.sh

--eval_dataset: Specifies the evaluation dataset (e.g., omni_t2t, omni_t2i, omni_t2v, etc.).
--eval_tie: Enables w/ Ties evaluation.

📚 Training Data

The following table provides an overview of the subsets in OmniRewardData, including their associated task types and dataset sizes.

ℹ️ The asterisk (*) denotes the subset constructed in this work.

Subset Name	Task Type	#Samples
Skywork-Reward-Preference	T2T	50,000
Omni-Skywork-Reward-Preference *	T2T	16,376
Omni-UltraFeedback *	T2T	7,901
RLAIF-V	TI2T	83,124
OmniAlign-V-DPO	TI2T	50,000
Omni-RLAIF-V *	TI2T	15,867
Omni-VLFeedback *	TI2T	12,311
HPDv2	T2I	50,000
EvalMuse	T2I	2,944
Omni-HPDv2 *	T2I	8,959
Omni-Open-Image-Preferences *	T2I	8,105
VideoDPO	T2V	10,000
VisionRewardDB-Video	T2V	1,795

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
OmniReward-Factory		OmniReward-Factory
dataset/OmniRewardBench		dataset/OmniRewardBench
evaluation_experiments		evaluation_experiments
files		files
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

📈 Evaluation

🌐 Data Download

📁 Dataset Format

Text-to-Text

Text-Image-to-Text

Text-Video-to-Text

Text-Audio-to-Text

Text-to-Image

Text-to-Video

Text-to-Audio

Text-to-3D

Text-Image-to-Image

🚀 Running Evaluation

⚙️ Training

🛠️ Environment Setup

📦 Data Preparation

🏋️‍♀️ Training Omni-Reward

📈 Loading and Evaluating Omni-Reward

📚 Training Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

📈 Evaluation

🌐 Data Download

📁 Dataset Format

Text-to-Text

Text-Image-to-Text

Text-Video-to-Text

Text-Audio-to-Text

Text-to-Image

Text-to-Video

Text-to-Audio

Text-to-3D

Text-Image-to-Image

🚀 Running Evaluation

⚙️ Training

🛠️ Environment Setup

📦 Data Preparation

🏋️‍♀️ Training Omni-Reward

📈 Loading and Evaluating Omni-Reward

📚 Training Data

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages