add ppo by dddd-d · Pull Request #10884 · PaddlePaddle/PaddleNLP

dddd-d · 2025-07-23T11:41:00Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

New features

PR changes

Models

Description

Modify the original PPO algorithm implementation

paddle-bot · 2025-07-23T11:41:07Z

Thanks for your contribution!

CLAassistant · 2025-07-24T05:20:32Z

All committers have signed the CLA.

DesmonDay · 2025-07-28T03:50:40Z

+        critic_model_config.use_sparse_head_and_loss_fn = False
+        critic_model_config.num_labels = 1
+        critic_model_config.classifier_dropout = 0.0
+        critic_model_config.hidden_dropout = "0"


这里为啥是个字符串？

class Qwen2ForTokenClassification(Qwen2PreTrainedModel):
def init(self, config):
super().init(config)
self.num_labels = config.num_labels
self.model = Qwen2Model(config)
if getattr(config, "classifier_dropout", None) is not None:
classifier_dropout = config.classifier_dropout
elif getattr(config, "hidden_dropout", None) is not None:
classifier_dropout = config.hidden_dropout
else:
classifier_dropout = 0.1
self.dropout = nn.Dropout(classifier_dropout)

这里的作用就是让classifier_dropout=0.0。config.hidden_dropout = "0"是verl的写法，但我也觉得应该是float。没有报错的原因是只经过了if的第一个分支

DesmonDay · 2025-07-28T03:55:09Z

+        critic_model_config.num_labels = 1
+        critic_model_config.classifier_dropout = 0.0
+        critic_model_config.hidden_dropout = "0"
+        critic_model_config.summary_dropout_prob = 0.0


summary_dropout_prob 是啥参数？

summary_dropout_prob这个参数是多余的，需要删掉

critic_model_config.num_labels = 1
critic_model_config.classifier_dropout = 0.0
critic_model_config.hidden_dropout = 0.0
logger.info(f"Loading Critic model with config:\n\t{critic_model_config}\n")

已修改

DesmonDay · 2025-07-28T07:26:56Z

    def compute_metrics(eval_preds):
-        accuracy = (eval_preds.predictions == 3).astype("float32").mean().item()
+        if training_args.use_rule_reward:
+            accuracy = (eval_preds.predictions == 1).astype("float32").mean().item()


总觉得有点hack，话说下面else分支 == 3，也是指的打分为3才算正确吗？

下面的else分支基于服务器奖励，打分范围在-3到3，只有3属于完全正确（格式+结果）。
上面的分支针对规则打分function（目前只实现了gsm8k数据集），打分：0/1

那要不这个加个注释说明一下吧

已修改：
'''
If "use_rm_server" is TRUE, the score ranges from -3 to 3, with 3 being the only correct score (format + result).
If using the "Regularized Matching Function (use_rule_reward=True)" (currently only implemented for the gsm8k dataset), the score ranges from 0 to 1.
'''

DesmonDay · 2025-07-28T07:36:09Z

+@paddle.no_grad()
 def compute_gae_advantage_return(
-    token_level_rewards: paddle.Tensor,
+    rewards: paddle.Tensor,


原来的函数写法感觉通用性更强一些，这块改动是否有必要？

这里的reward是sequence_level_reward，就是一条response只对应一个奖励值。paddlenlp和verl的reward都基于这种形式。
token level reward需要单独训练一个奖励模型，目前还没有见过这种实现

DesmonDay · 2025-07-28T11:27:15Z

+@paddle.no_grad()
 def compute_gae_advantage_return(
-    token_level_rewards: paddle.Tensor,
+    rewards: paddle.Tensor,


目前看起来 compute_gae_advantage_return，compute_advantage函数都把 use_tgt_len_return 相关的逻辑去掉了，但是貌似原来的实现更加灵活些，这块是怎么考虑的？

use_tgt_len_return决定在计算优势的时候是否拼接prompt部分（非response）的token。如果true，就把prompt token对应的优势值填0，但是实际上并没有什么意义。参考verl的实现也没有考虑prompt部分，因此，这里把use_tgt_len_return删除。
并且，我认为prompt属于用户的输入，模型无法控制，不需要对其进行优化。

DesmonDay · 2025-07-28T11:28:28Z

+        if self.use_fp32_compute and reward_values.dtype != paddle.float32:
+            reward_values = reward_values.cast(paddle.float32)
+
+        reward_values = reward_values.squeeze(axis=-1)[:, response_start:-1]


哦，前面的逻辑是挪到这个forward函数里头来处理了吗？

value和advantage的计算都从response_start开始，不考虑prompt。
这里的reward_values来自critic的预测，他需要和上一步计算的returns计算value loss，因此需要保持shape一致

DesmonDay · 2025-07-28T11:30:42Z


-        # config.architectures = [self.__class__.__name__]
-        self.init_score_head(config, hidden_size=config.hidden_size, **kwargs)
+    @classmethod


所以最后我们的代码里面 AutoModelForScore 实际没啥用了是不

AutoModelForScore实际上没有再用了。但是我修改了原始的代码，用修改后的AutoModelForScore应该也是可以的

DesmonDay · 2025-07-28T11:34:45Z

            config.tensor_parallel_rank = 0
-        with timers_scope_runtimer("Reward critic eval model loading time"):
+        with timers_scope_runtimer("Critic eval model loading time"):
            critic_eval_model = AutoModelForScore.from_config(config)


这个是不是也得统一改成 AutoModelForTokenClassification 呢？

这里已经修改啦

DesmonDay · 2025-07-28T11:35:24Z

    normalize_function: NormalizeFunction = "affine"
    _initialized: bool = False

+    @classmethod


emmm，这个文件在做啥

这个文件是之前ppo写的，主要功能是对value进行归一化，设计了不同的归一化方法。但是 critic 预测的value要直接用来优化critic model，这并不是target值（returns），因此我认为不应该归一化（verl也并未实现）
这个文件还有一个功能是初始化value head以及get_score (将hidden states送入value head)，如果不启用AutoModelForScore，这个文件是无效的

DesmonDay · 2025-07-29T06:30:08Z

-        attention_mask = rl_batch.batch["attention_mask"]  # length: src+tgt
        position_ids = rl_batch.batch["position_ids"]  # length: src+tgt
-        sequence_mask = rl_batch.batch["sequence_mask"]  # length: src+tgt(-1)
+        sequence_mask = rl_batch.batch["eos_mask"]  # length: src+tgt(-1)


为啥 sequence_mask 名字换成 eos_mask了？

batch.batch.update(
{
# "log_probs": old_log_probs,
"reward_advantages": reward_advantages,
"reward_advantages_clean": reward_advantages[eos_mask != 0],
# "ref_log_probs": ref_log_probs,
"rewards": rewards,
"eos_mask": eos_mask,
}
)
打包后的数据一直是eos_mask，实际上只是名字不同。
这里的eos_mask同样不包含prompt部分

DesmonDay · 2025-07-29T06:30:32Z


-        return DataProto(meta_info={"metrics": {"train_value_loss": reward_critic_loss}})
+        # return DataProto(meta_info={"metrics": {"train_value_loss": reward_critic_loss}})
+        return {"train_value_loss": reward_critic_loss}


原始就是DataProto的话，还是保持DataProto吧

train_value_loss = self.critic_trainer.update_critic(micro_batch)
rl_info.meta_info["metrics"].update(train_value_loss)

原始写法：
rl_info.union(train_value_loss)

原始写法会报错，修改为“ return {"train_value_loss": reward_critic_loss} ”，rl_info仍然是DataProto

DesmonDay · 2025-08-01T05:45:41Z

-        mean = getattr(config, "mean", None)
-        var = getattr(config, "var", None)
-        self.normalizer.set_mean_var(mean, var)
+        # if config.score_type == "reward":


下面的注释如果没有用到的话，要不还原回去？

注释部分已还原

add ppo

0bda8a7

paddle-bot Bot added the contributor label Jul 23, 2025

paddle-bot Bot assigned KB-Ding Jul 23, 2025

DesmonDay self-requested a review July 23, 2025 12:44

Update run_rl.py

faaa350

Update ppo_trainer.py

9f1b06c

DesmonDay reviewed Jul 28, 2025

View reviewed changes

DesmonDay reviewed Jul 29, 2025

View reviewed changes

dddd-d added 4 commits July 29, 2025 15:06

Update config_utils.py

f2664bc

Update run_rl.py

b517f02

Update gsm8k_processor.py

5d4c614

Update run_rl.py

1a574bd

DesmonDay reviewed Aug 1, 2025

View reviewed changes

dddd-d added 6 commits August 1, 2025 16:23

Update score_model_utils.py

d48283a

Update score_model_utils.py

d6d43a9

Update run_rl.py

a4705c4

Update ppo_trainer.py

01599d3

Update ppo_trainer.py

89f25f3

Update ppo_trainer.py

325d76c

dddd-d and others added 5 commits August 5, 2025 10:53

Update score_model_utils.py

c55a878

Update gsm8k_processor.py

2614d18

pre_commit

60486dc

Update advantage.py

da24bff

Update advantage.py

07a1dea

PaddlePaddle locked and limited conversation to collaborators Aug 6, 2025