奖励模型

默认情况下，奖励模型是指具有分类头数值输出的模型，通常称为输出奖励模型（ORM）。这些模型会对其他模型的输出进行评分，从而生成一个标量值，表示模型响应的质量。

我们可以通过使用参数 reward_models 来加载具有分类头的奖励模型，或者加载经过奖励建模训练的奖励模型，进而使用模型的logits作为奖励。

自定义奖励模型

对于生成式奖励模型，有两种常见的调用方式：一种是在 Trainer 内部直接使用 reward_model_plugin 定义奖励模型的逻辑，可以使用PTEngine对奖励模型进行推理，另一种是通过外部部署的模型服务进行调用。

使用 reward_model_plugin 调用奖励模型时，模型会被内嵌在 Trainer 内部，无需额外占用计算资源。该方式优点是方便集成，但生成速度相对较慢，更适合参数量较小的奖励模型场景。
外部部署奖励模型时，可以通过诸如 swift deploy 或 vllm serve 等命令将模型服务部署于独立设备，大幅提升推理速度，适合参数量较大的模型。但这样需要预留额外的硬件资源。

内部插件

我们可以在 reward_model_plugin 中灵活地自定义奖励模型的处理逻辑。这使得实现诸如生成式奖励模型等技术成为可能，包括：

自定义模型的系统提示：定义特定的指令和上下文以指导评估过程。
处理模型交互历史：管理对话上下文，以提供有意义且具有上下文感知的评估。
定义自定义评估标准：设置独特的标准和度量，用于评估模型的响应，超越默认的准确性和相关性衡量标准。

通过reward_model_plugin，开发者可以针对其应用的特定需求定制奖励评估过程。这种灵活性允许更细致和有效的基于奖励的训练策略。

奖励模型通过plugin的__call__方法进行调用，该方法接受 inputs 作为参数，包含了模型输入输出的 messages 和数据集中的其他列

    def __call__(self, inputs):
        print(inputs)
        """
[
    {
        'messages': [
                {'role': 'system', 'content': 'system prompt'},
                {'role': 'query', 'content': 'query'},
                {'role': 'user', 'content': 'completions1'},
            ],
        'solution': "abc",
    },
    {
        'messages': [
                {'role': 'system', 'content': 'system prompt'},
                {'role': 'query', 'content': 'query'},
                {'role': 'user', 'content': 'completions2'},
            ],
        'solution': "abc",
    }
]

在插件中使用 PTEngine 进行奖励模型的推理，我们只需构造 messages ，并通过 infer 接口调用：

class RMPlugin(DefaultRMPlugin):

    def __init__(self, model, template):

        super().__init__(model, template)
        # initilize PTEngine to infer
        self.engine = PtEngine.from_model_template(self.model, self.template, max_batch_size=0)

    def __call__(self, inputs):
        system_prompt = ...
        query = ...
        messages = [{'role': 'system', 'content': system_prompt}, {'role': 'query', 'content': query}]
        result = self.engine.infer([messages], self.request_config, use_tqdm=False)
        rewards = ...
        return rewards

我们在 rm_plugin.py 中提供了一个简单的生成式奖励模型示例（GenRMPlugin）。

在 plugin.py 中自定义奖励模型插件，并使用 external_plugins 参数进行注册。

注意：

在 GRPOTrainer 中，reward_model 会依次append到 reward_funcs 中。因此，reward_weights 的顺序对应 [reward_funcs, reward_model]。
reward_model_plugin 默认为 default，即使用 ORM 处理逻辑。
对于参数量较大的模型，PTEngine 生成速度较慢，请使用外部部署方法

对于 BERT 这类无法通过 reward_model 加载的模型，我们可以内置在 reward_function 中进行加载，参考issue

外部部署

示例 2：使用 swift deploy 部署奖励模型并进行远程调用

这类方法则不需要使用 reward_model_plugin , 而是直接在奖励函数中进行调用即可

首先用如下命令启动模型服务：

# 注意部署的设备不要与训练设备重叠
CUDA_VISIBLE_DEVICES=0,1,2,3 \
swift deploy \
    --model Qwen/Qwen2.5-72B-Instruct \
    --vllm_tensor_parallel_size 4

# [INFO:swift] model_list: ['Qwen2.5-72B-Instruct']
# INFO:     Started server process [xxxxxx]
# INFO:     Waiting for application startup.
# INFO:     Application startup complete.
# INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

在奖励函数中通过 OpenAI 库初始化客户端，指定模型服务的地址和端口，示例代码如下：

from openai import OpenAI

class RMReward(ORM):

    def __init__(self):
        super().__init__()
        try:
            self.client = OpenAI(
                api_key='EMPTY',
                base_url='http://127.0.0.1:8000/v1', # 若在本地部署则为 127.0.0.1
            )
            self.verify_model_name = self.client.models.list().data[0].id
        except Exception as e:
            raise RuntimeError('Failed to connect to the model service. Please deploy the model '
                               "using 'swift deploy' or 'vllm serve'.") from e


    def __call__(self, completions, messages, **kwargs) -> List[float]:
        rewards = []
        for completion, message in zip(completions, messages):
            rm_prompt = ... # 构建 reward model 的prompt
            chat_response = self.client.chat.completions.create(
                model=self.verify_model_name,
                messages=[
                    {
                        'role': 'system',
                        'content': 'You are a helpful assistant.'
                    },
                    {
                        'role': 'user',
                        'content': rm_prompt
                    },
                ],
            )
            response = chat_response.choices[0].message.content.strip()
            reward = ... # 根据奖励模型生成结果提取奖励值
            rewards.append(reward)
        return rewards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

奖励模型

自定义奖励模型

内部插件

外部部署

FilesExpand file tree

奖励模型.md

Latest commit

History

奖励模型.md

File metadata and controls

奖励模型

自定义奖励模型

内部插件

外部部署