Skip to content

add ppo#10884

Closed
dddd-d wants to merge 19 commits into
PaddlePaddle:developfrom
dddd-d:add_ppo
Closed

add ppo#10884
dddd-d wants to merge 19 commits into
PaddlePaddle:developfrom
dddd-d:add_ppo

Conversation

@dddd-d

@dddd-d dddd-d commented Jul 23, 2025

Copy link
Copy Markdown
Contributor

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

New features

PR changes

Models

Description

Modify the original PPO algorithm implementation

@paddle-bot

paddle-bot Bot commented Jul 23, 2025

Copy link
Copy Markdown

Thanks for your contribution!

@DesmonDay DesmonDay self-requested a review July 23, 2025 12:44
@CLAassistant

CLAassistant commented Jul 24, 2025

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Comment thread llm/alignment/rl/run_rl.py Outdated
critic_model_config.use_sparse_head_and_loss_fn = False
critic_model_config.num_labels = 1
critic_model_config.classifier_dropout = 0.0
critic_model_config.hidden_dropout = "0"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为啥是个字符串?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class Qwen2ForTokenClassification(Qwen2PreTrainedModel):
def init(self, config):
super().init(config)
self.num_labels = config.num_labels
self.model = Qwen2Model(config)
if getattr(config, "classifier_dropout", None) is not None:
classifier_dropout = config.classifier_dropout
elif getattr(config, "hidden_dropout", None) is not None:
classifier_dropout = config.hidden_dropout
else:
classifier_dropout = 0.1
self.dropout = nn.Dropout(classifier_dropout)

这里的作用就是让classifier_dropout=0.0。config.hidden_dropout = "0"是verl的写法,但我也觉得应该是float。没有报错的原因是只经过了if的第一个分支

Comment thread llm/alignment/rl/run_rl.py Outdated
critic_model_config.num_labels = 1
critic_model_config.classifier_dropout = 0.0
critic_model_config.hidden_dropout = "0"
critic_model_config.summary_dropout_prob = 0.0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summary_dropout_prob 是啥参数?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summary_dropout_prob这个参数是多余的,需要删掉

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critic_model_config.num_labels = 1
critic_model_config.classifier_dropout = 0.0
critic_model_config.hidden_dropout = 0.0
logger.info(f"Loading Critic model with config:\n\t{critic_model_config}\n")

已修改

def compute_metrics(eval_preds):
accuracy = (eval_preds.predictions == 3).astype("float32").mean().item()
if training_args.use_rule_reward:
accuracy = (eval_preds.predictions == 1).astype("float32").mean().item()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

总觉得有点hack,话说下面else分支 == 3,也是指的打分为3才算正确吗?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

下面的else分支基于服务器奖励,打分范围在-3到3,只有3属于完全正确(格式+结果)。
上面的分支针对规则打分function(目前只实现了gsm8k数据集),打分:0/1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那要不这个加个注释说明一下吧

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改:
'''
If "use_rm_server" is TRUE, the score ranges from -3 to 3, with 3 being the only correct score (format + result).
If using the "Regularized Matching Function (use_rule_reward=True)" (currently only implemented for the gsm8k dataset), the score ranges from 0 to 1.
'''

Comment thread paddlenlp/rl/algos/advantage.py Outdated
@paddle.no_grad()
def compute_gae_advantage_return(
token_level_rewards: paddle.Tensor,
rewards: paddle.Tensor,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原来的函数写法感觉通用性更强一些,这块改动是否有必要?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的reward是sequence_level_reward,就是一条response只对应一个奖励值。paddlenlp和verl的reward都基于这种形式。
token level reward需要单独训练一个奖励模型,目前还没有见过这种实现

Comment thread paddlenlp/rl/algos/advantage.py Outdated
@paddle.no_grad()
def compute_gae_advantage_return(
token_level_rewards: paddle.Tensor,
rewards: paddle.Tensor,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前看起来 compute_gae_advantage_return,compute_advantage函数都把 use_tgt_len_return 相关的逻辑去掉了,但是貌似原来的实现更加灵活些,这块是怎么考虑的?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_tgt_len_return决定在计算优势的时候是否拼接prompt部分(非response)的token。如果true,就把prompt token对应的优势值填0, 但是实际上并没有什么意义。参考verl的实现也没有考虑prompt部分,因此,这里把use_tgt_len_return删除。
并且,我认为prompt属于用户的输入,模型无法控制,不需要对其进行优化。

if self.use_fp32_compute and reward_values.dtype != paddle.float32:
reward_values = reward_values.cast(paddle.float32)

reward_values = reward_values.squeeze(axis=-1)[:, response_start:-1]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

哦,前面的逻辑是挪到这个forward函数里头来处理了吗?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value和advantage的计算都从response_start开始,不考虑prompt。
这里的reward_values来自critic的预测,他需要和上一步计算的returns计算value loss,因此需要保持shape一致


# config.architectures = [self.__class__.__name__]
self.init_score_head(config, hidden_size=config.hidden_size, **kwargs)
@classmethod

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

所以最后我们的代码里面 AutoModelForScore 实际没啥用了是不

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoModelForScore实际上没有再用了。但是我修改了原始的代码,用修改后的AutoModelForScore应该也是可以的

Comment thread llm/alignment/rl/run_rl.py Outdated
config.tensor_parallel_rank = 0
with timers_scope_runtimer("Reward critic eval model loading time"):
with timers_scope_runtimer("Critic eval model loading time"):
critic_eval_model = AutoModelForScore.from_config(config)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是不是也得统一改成 AutoModelForTokenClassification 呢?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里已经修改啦

normalize_function: NormalizeFunction = "affine"
_initialized: bool = False

@classmethod

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emmm, 这个文件在做啥

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件是之前ppo写的,主要功能是对value进行归一化,设计了不同的归一化方法。但是 critic 预测的value要直接用来优化critic model,这并不是target值(returns),因此我认为不应该归一化(verl也并未实现)
这个文件还有一个功能是初始化value head以及get_score (将hidden states送入value head),如果不启用AutoModelForScore,这个文件是无效的

attention_mask = rl_batch.batch["attention_mask"] # length: src+tgt
position_ids = rl_batch.batch["position_ids"] # length: src+tgt
sequence_mask = rl_batch.batch["sequence_mask"] # length: src+tgt(-1)
sequence_mask = rl_batch.batch["eos_mask"] # length: src+tgt(-1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥 sequence_mask 名字换成 eos_mask了?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch.batch.update(
{
# "log_probs": old_log_probs,
"reward_advantages": reward_advantages,
"reward_advantages_clean": reward_advantages[eos_mask != 0],
# "ref_log_probs": ref_log_probs,
"rewards": rewards,
"eos_mask": eos_mask,
}
)
打包后的数据一直是eos_mask,实际上只是名字不同。
这里的eos_mask同样不包含prompt部分


return DataProto(meta_info={"metrics": {"train_value_loss": reward_critic_loss}})
# return DataProto(meta_info={"metrics": {"train_value_loss": reward_critic_loss}})
return {"train_value_loss": reward_critic_loss}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原始就是DataProto的话,还是保持DataProto吧

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train_value_loss = self.critic_trainer.update_critic(micro_batch)
rl_info.meta_info["metrics"].update(train_value_loss)

原始写法:
rl_info.union(train_value_loss)

原始写法会报错,修改为“ return {"train_value_loss": reward_critic_loss} ”,rl_info仍然是DataProto

mean = getattr(config, "mean", None)
var = getattr(config, "var", None)
self.normalizer.set_mean_var(mean, var)
# if config.score_type == "reward":

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

下面的注释如果没有用到的话,要不还原回去?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释部分已还原

@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Aug 6, 2025
@PaddlePaddle PaddlePaddle unlocked this conversation Aug 6, 2025
@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Aug 6, 2025
@PaddlePaddle PaddlePaddle unlocked this conversation Aug 6, 2025
@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Aug 6, 2025
@PaddlePaddle PaddlePaddle unlocked this conversation Aug 6, 2025
@Liujie0926 Liujie0926 marked this pull request as draft August 6, 2025 02:12
@Liujie0926 Liujie0926 marked this pull request as ready for review August 6, 2025 02:12
@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Aug 6, 2025
@PaddlePaddle PaddlePaddle unlocked this conversation Aug 6, 2025
@Liujie0926 Liujie0926 closed this Aug 6, 2025
@Liujie0926 Liujie0926 reopened this Aug 6, 2025
@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Aug 6, 2025
@PaddlePaddle PaddlePaddle unlocked this conversation Aug 6, 2025
@DesmonDay DesmonDay mentioned this pull request Aug 6, 2025
2 tasks
@DesmonDay DesmonDay closed this Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants