VLM Reward System

A Powerful Reward System for Vision-Language Model Reinforcement Learning

👋 Join our Discord or GitHub Issues
📍 Experience our models at ChatGLM and Zhipu AI Platform
🚀 Related Project: GLM-4.1V-Thinking
💡 Try the demo: python examples/reward_system_demo.py

About

VLM Reward System is developed by the CogVLM Team at Zhipu AI (智谱AI). As a core component of our vision-language model training infrastructure, this system powers the reinforcement learning training of our GLM-4.1V series models.

Key Features:

Production-Ready: Battle-tested in GLM-4.1V-Thinking training
Easy Integration: Works with any RL training pipeline
Multiple Verifiers: Math, general, chart, agent and more
Hybrid Verification: Combines rule-based verifiers with LLM-as-a-judge
Flexible Configuration: YAML-based setup for different use cases

Quick Start

Install the package:
```
pip install -e .
```
Set your API key:

   export ZHIPUAI_API_KEY='your_api_key_here'

Configure the reward system:

cp examples/configs/example.yaml.template examples/configs/example.yaml


Edit the `example.yaml` file to configure the reward system.

4. **Run the demo**:

```bash
python examples/reward_system_demo.py

Testing

Run the test suite to verify the installation:

pytest tests/

How It Works

The reward system takes three inputs and outputs a reward score:

Input:  Question + Ground Truth + Model Response
        ↓
Output: Reward Score (0.0 - 1.0)

Example Usage in RL Training:

from glmv_reward import RewardSystem

# Initialize the reward system
reward_system = RewardSystem("examples/configs/example.yaml")

# Evaluate model responses
rewards = reward_system.get_reward(
    prompts=["What is 15 + 27?"],
    answers=["<think>15 + 27 = 42</think><answer><|begin_of_box|>42<|end_of_box|></answer>"],
    gt_answers=["<think>15 + 27 = 42</think><answer><|begin_of_box|>42<|end_of_box|></answer>"],
    datasources=["math"]
)

# Use reward in your RL training
print(f"Reward: {rewards[0]}")  # Output: 1.0 (correct answer)

Configuration

The system uses YAML configuration files. For a complete configuration reference, see configs/full_config.yaml.

Example:

reward_configs:
  math_verifier_config:
    verifier_type: "math"
    enable_llm_judge_fallback: true
    llm_judge_url:
      - "https://open.bigmodel.cn/api/paas/v4/chat/completions"

Supported Verifiers

Our reward system includes multiple specialized verifiers, each optimized for different types of reasoning:

Core Verifiers

Math Verifier: Evaluates mathematical correctness using symbolic computation
Biology Verifier: Specialized for biological and life science questions
Chemistry Verifier: Handles chemistry problems and molecular reasoning
Physics Verifier: Evaluates physics problems and scientific reasoning
Geography Verifier: Specialized for geographical knowledge and spatial reasoning
Liberal Arts Verifier: Handles literature, history, and humanities questions

Multimodal Verifiers

Chart Verifier: Analyzes chart and visualization responses
OCR Verifier: Evaluates optical character recognition tasks
Multi-Image Verifier: Handles multi-image understanding tasks
VQA Verifier: Specialized for visual question answering
Counting Verifier: Evaluates counting and numerical reasoning in images

Agent Verifiers

AndroidWorld Verifier: Evaluates Android automation and interaction tasks
WebVoyager Verifier: Handles web navigation and interaction evaluation
OSWorld Verifier: Specialized for operating system interaction tasks

Specialized Task Verifiers

General Verifier: Handles general reasoning tasks with LLM judge fallback
Language Mix Verifier: Detects inappropriate language mixing patterns
GeoQuest Verifier: Handles geography-related question answering tasks
MMSI Verifier: Specialized for MMSI

Citation

If you find our work helpful, please consider citing:

@misc{glmvteam2025glm41vthinkingversatilemultimodalreasoning,
      title={GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning}, 
      author={GLM-V Team and Wenyi Hong and Wenmeng Yu and Xiaotao Gu and Guo Wang and Guobing Gan and Haomiao Tang and Jiale Cheng and Ji Qi and Junhui Ji and Lihang Pan and Shuaiqi Duan and Weihan Wang and Yan Wang and Yean Cheng and Zehai He and Zhe Su and Zhen Yang and Ziyang Pan and Aohan Zeng and Baoxu Wang and Boyan Shi and Changyu Pang and Chenhui Zhang and Da Yin and Fan Yang and Guoqing Chen and Jiazheng Xu and Jiali Chen and Jing Chen and Jinhao Chen and Jinghao Lin and Jinjiang Wang and Junjie Chen and Leqi Lei and Letian Gong and Leyi Pan and Mingzhi Zhang and Qinkai Zheng and Sheng Yang and Shi Zhong and Shiyu Huang and Shuyuan Zhao and Siyan Xue and Shangqin Tu and Shengbiao Meng and Tianshu Zhang and Tianwei Luo and Tianxiang Hao and Wenkai Li and Wei Jia and Xin Lyu and Xuancheng Huang and Yanling Wang and Yadong Xue and Yanfeng Wang and Yifan An and Yifan Du and Yiming Shi and Yiheng Huang and Yilin Niu and Yuan Wang and Yuanchang Yue and Yuchen Li and Yutao Zhang and Yuxuan Zhang and Zhanxiao Du and Zhenyu Hou and Zhao Xue and Zhengxiao Du and Zihan Wang and Peng Zhang and Debing Liu and Bin Xu and Juanzi Li and Minlie Huang and Yuxiao Dong and Jie Tang},
      year={2025},
      eprint={2507.01006},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.01006}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLM Reward System

About

Quick Start

Testing

How It Works

Configuration

Supported Verifiers

Core Verifiers

Multimodal Verifiers

Agent Verifiers

Specialized Task Verifiers

Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

VLM Reward System

About

Quick Start

Testing

How It Works

Configuration

Supported Verifiers

Core Verifiers

Multimodal Verifiers

Agent Verifiers

Specialized Task Verifiers

Citation