Skip to content

HongbangYuan/OmniReward

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

21 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

πŸ€— Benchmark | πŸ€— Dataset | πŸ€— Model | 🏠 Homepage

Reward models (RMs) play a critical role in aligning AI behaviors with human preferences. We propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of:

  • πŸ“ˆ Evaluation: We introduce OmniRewardBench, the first omni-modal reward benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D.

  • πŸ“š Data: We construct OmniRewardData , a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs.

  • 🧠 Model: We propose OmniRewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used RM benchmark.

πŸ“ˆ Evaluation

🌐 Data Download

Our dataset is hosted on huggingface and we recommend downloading them with the following command.

huggingface-cli download HongbangYuan/OmniRewardBench --repo-type dataset --local-dir ./OmniRewardBench

⚠️ Note: The most time-consuming part of the download is the media_data.zip file (~3.5 GB), which contains all original image, audio, and video resources required for evaluation. Depending on your internet speed, this step might take a while.

We recommend using the utility functions provided in ./dataset/OmniRewardBench/load_omni_reward_bench.py for loading the dataset. You should specify the task argument to load data for a particular task.

πŸ“ Dataset Format

Next, we walk through the data format used in each task, highlighting the structure and key fields.

πŸ’‘ Note:

For each sample containing image or audio data:

  • You can directly load media using datasets.Image or datasets.Audio objects provided by the πŸ€— Hugging Face datasets library.
  • Alternatively, you can use the image_path or audio_path fields to load files from disk.

For video data, only local loading via the path in video is supported.

All media paths are relative paths, and should be resolved relative to the root directory where media_data.zip is extracted.

We provide a summary of the key-value structure for each task below. Feel free to refer to this section when working with task-specific data samples.

Text-to-Text

Key Type Description
prompt str The user instruction or query to be evaluated.
response1 str The response generated by Model 1 for the given prompt.
response2 str The response generated by Model 2 for the same prompt.
model1 str Name of Model1
model2 str Name of Model2
criteria str The evaluation criteria in textual form.
criteria_preference str The human-annotated preference (either "response1" or "response2") under the given criterion.
id str A unique identifier for this data sample within the dataset.

Text-Image-to-Text

Key Type Description
prompt str The user instruction, typically paired with an image input.
image Image The image input of the user prompt.
image_path str Path to the associated image file.
response1 str The textual response generated by Model 1.
response2 str The textual response generated by Model 2.
model1 str Name of Model 1.
model2 str Name of Model 2.
criteria str The evaluation criteria in textual form.
criteria_preference str The human-annotated preference (either "response1" or "response2") under the given criterion.
id str A unique identifier for this data sample.

Text-Video-to-Text

Key Type Description
prompt str The user instruction, typically paired with a video input.
video str Path to the associated video file.
response1 str The textual response generated by Model 1.
response2 str The textual response generated by Model 2.
model1 str Name of Model 1.
model2 str Name of Model 2.
criteria str The evaluation criteria in textual form.
criteria_preference str The human-annotated preference (either "response1" or "response2") under the given criterion.
id str A unique identifier for this data sample.

Text-Audio-to-Text

Key Type Description
prompt str The user instruction, typically paired with a audio input.
audio audio A huggingface audio object.
audio_path str Path to the associated audio file.
response1 str The textual response generated by Model 1.
response2 str The textual response generated by Model 2.
model1 str Name of Model 1.
model2 str Name of Model 2.
criteria str The evaluation criteria in textual form.
criteria_preference str The human-annotated preference (either "response1" or "response2") under the given criterion.
id str A unique identifier for this data sample.

Text-to-Image

Key Type Description
prompt str The image generation instruction.
response1 Image The image generated by Model 1.
response2 Image The image generated by Model 2.
response1_path str Path to the image file generated by Model 1.
response2_path str Path to the image file generated by Model 2.
model1 str Name of Model 1.
model2 str Name of Model 2.
criteria str The evaluation criteria in textual form.
criteria_preference str The human-annotated preference (either "response1" or "response2") under the given criterion.
id str A unique identifier for this data sample.

Text-to-Video

Key Type Description
prompt str The video generation instruction.
response1 str The video file generated by Model 1.
response2 str The video file generated by Model 2.
model1 str Name of Model 1.
model2 str Name of Model 2.
criteria str The evaluation criteria in textual form.
criteria_preference str The human-annotated preference (either "response1" or "response2") under the given criterion.
id str A unique identifier for this data sample.

Text-to-Audio

Key Type Description
prompt str The audio generation instruction.
response1 Audio The audio clip generated by Model 1.
response2 Audio The audio clip generated by Model 2.
response1_path str Path to the audio file generated by Model 1.
response2_path str Path to the audio file generated by Model 2.
model1 str Name of Model 1.
model2 str Name of Model 2.
criteria str The evaluation criteria in textual form.
criteria_preference str The human-annotated preference (either "response1" or "response2") under the given criterion.
id str A unique identifier for this data sample.

Text-to-3D

Key Type Description
prompt str The 3D generation instruction.
response1 Image The 3D image generated by Model 1.
response2 Image The 3D image generated by Model 2.
response1_path str Path to the 3D image generated by Model 1.
response2_path str Path to the 3D image generated by Model 2.
model1 str Name of Model 1.
model2 str Name of Model 2.
criteria str The evaluation criteria in textual form.
criteria_preference str The human-annotated preference (either "response1" or "response2") under the given criterion.
id str A unique identifier for this data sample.

Text-Image-to-Image

Key Type Description
prompt str The image edit instruction.
image Image The image file to be edited.
image_path str Path to the image file to be edited.
response1 Image The final image generated by Model 1.
response2 Image The final image generated by Model 2.
response1_path str Path to the final image generated by Model 1.
response2_path str Path to the final image generated by Model 2.
model1 str Name of Model 1.
model2 str Name of Model 2.
criteria str The evaluation criteria in textual form.
criteria_preference str The human-annotated preference (either "response1" or "response2") under the given criterion.
id str A unique identifier for this data sample.

πŸš€ Running Evaluation

To evaluate an API-accessible model on our full benchmark suite, you can run the provided launch script:

bash scripts/eval/run_eval_api.sh <your_model_name>

Remember to Sspecifying the model name as a command-line argument (e.g., gpt-4, claude-3) for logging and tracking.

The scripts/eval/run_eval_api.sh script supports:

  • βœ… Evaluating all tasks or selected ones By default, the script runs on all supported tasks. To evaluate only specific tasks, simply comment out the unused tasks in the tasks list.

  • βœ… Two evaluation modes For each task, the script runs:

    • Without Tie Evaluation (default)
    • WithTie valuation (--with_tie)
  • βœ… Parallel execution Each pair of evaluations (w/ and w/o TIE) runs in parallel to speed up the process.

  • βœ… Customizable API endpoint The API URL is set to https://api.vveai.com/v1/chat/completions by default. You can modify this value in the script to use any OpenAI-compatible endpoint. For example, if you are serving a local model using vLLM, you can set:

    api_url="http://localhost:8000/v1/chat/completions"

    This allows you to benchmark models hosted on your own machine.

βš™οΈ Training

πŸ› οΈ Environment Setup

To reproduce the training process in our paper, please make sure to set up the environment as described below. Our training code is built upon the llama-factory framework.

git clone https://github.com/HongbangYuan/OmniReward.git
conda create -n omnireward python=3.10
conda activate omnireward

We recommend using torch==2.2.0 for best compatibility.

Install PyTorch (choose one based on your CUDA version):

# For CUDA 11.8:
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 \
    --index-url https://download.pytorch.org/whl/cu118

# For CUDA 12.1:
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 \
    --index-url https://download.pytorch.org/whl/cu121

Then install the remaining dependencies:

cd OmniReward/OmniReward-Factory
pip install -r requirements.txt

πŸ“¦ Data Preparation

Download all required training and evaluation datasets from OmniRewardData and OmniRewardBench:

cd OmniReward-Factory
bash scripts/download.sh

πŸ‹οΈβ€β™€οΈ Training Omni-Reward

To reproduce the training results described in our paper, please navigate to the OmniReward-Factory directory and run the following scripts:

cd OmniReward-Factory
bash scripts/train.sh
bash scripts/train_t2t.sh
bash scripts/train_ti2t.sh
bash scripts/train_t2iv.sh

πŸ“ˆ Loading and Evaluating Omni-Reward

You can also directly use our pretrained Omni-Reward for evaluation without retraining.

The models are publicly available at:

πŸ‘‰ https://huggingface.co/jinzhuoran/OmniRewardModel

cd OmniReward-Factory
bash scripts/eval_t2t.sh
bash scripts/eval_t2t_tie.sh
bash scripts/eval_ti2t.sh
bash scripts/eval_ti2t_tie.sh
  • --eval_dataset: Specifies the evaluation dataset (e.g., omni_t2t, omni_t2i, omni_t2v, etc.).

  • --eval_tie: Enables w/ Ties evaluation.

πŸ“š Training Data

The following table provides an overview of the subsets in OmniRewardData, including their associated task types and dataset sizes.

ℹ️ The asterisk (*) denotes the subset constructed in this work.

Subset Name Task Type #Samples
Skywork-Reward-Preference T2T 50,000
Omni-Skywork-Reward-Preference * T2T 16,376
Omni-UltraFeedback * T2T 7,901
RLAIF-V TI2T 83,124
OmniAlign-V-DPO TI2T 50,000
Omni-RLAIF-V * TI2T 15,867
Omni-VLFeedback * TI2T 12,311
HPDv2 T2I 50,000
EvalMuse T2I 2,944
Omni-HPDv2 * T2I 8,959
Omni-Open-Image-Preferences * T2I 8,105
VideoDPO T2V 10,000
VisionRewardDB-Video T2V 1,795

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages