add ppo#10884
Conversation
|
Thanks for your contribution! |
| critic_model_config.use_sparse_head_and_loss_fn = False | ||
| critic_model_config.num_labels = 1 | ||
| critic_model_config.classifier_dropout = 0.0 | ||
| critic_model_config.hidden_dropout = "0" |
There was a problem hiding this comment.
class Qwen2ForTokenClassification(Qwen2PreTrainedModel):
def init(self, config):
super().init(config)
self.num_labels = config.num_labels
self.model = Qwen2Model(config)
if getattr(config, "classifier_dropout", None) is not None:
classifier_dropout = config.classifier_dropout
elif getattr(config, "hidden_dropout", None) is not None:
classifier_dropout = config.hidden_dropout
else:
classifier_dropout = 0.1
self.dropout = nn.Dropout(classifier_dropout)
这里的作用就是让classifier_dropout=0.0。config.hidden_dropout = "0"是verl的写法,但我也觉得应该是float。没有报错的原因是只经过了if的第一个分支
| critic_model_config.num_labels = 1 | ||
| critic_model_config.classifier_dropout = 0.0 | ||
| critic_model_config.hidden_dropout = "0" | ||
| critic_model_config.summary_dropout_prob = 0.0 |
There was a problem hiding this comment.
summary_dropout_prob 是啥参数?
There was a problem hiding this comment.
summary_dropout_prob这个参数是多余的,需要删掉
There was a problem hiding this comment.
critic_model_config.num_labels = 1
critic_model_config.classifier_dropout = 0.0
critic_model_config.hidden_dropout = 0.0
logger.info(f"Loading Critic model with config:\n\t{critic_model_config}\n")
已修改
| def compute_metrics(eval_preds): | ||
| accuracy = (eval_preds.predictions == 3).astype("float32").mean().item() | ||
| if training_args.use_rule_reward: | ||
| accuracy = (eval_preds.predictions == 1).astype("float32").mean().item() |
There was a problem hiding this comment.
总觉得有点hack,话说下面else分支 == 3,也是指的打分为3才算正确吗?
There was a problem hiding this comment.
下面的else分支基于服务器奖励,打分范围在-3到3,只有3属于完全正确(格式+结果)。
上面的分支针对规则打分function(目前只实现了gsm8k数据集),打分:0/1
There was a problem hiding this comment.
已修改:
'''
If "use_rm_server" is TRUE, the score ranges from -3 to 3, with 3 being the only correct score (format + result).
If using the "Regularized Matching Function (use_rule_reward=True)" (currently only implemented for the gsm8k dataset), the score ranges from 0 to 1.
'''
| @paddle.no_grad() | ||
| def compute_gae_advantage_return( | ||
| token_level_rewards: paddle.Tensor, | ||
| rewards: paddle.Tensor, |
There was a problem hiding this comment.
原来的函数写法感觉通用性更强一些,这块改动是否有必要?
There was a problem hiding this comment.
这里的reward是sequence_level_reward,就是一条response只对应一个奖励值。paddlenlp和verl的reward都基于这种形式。
token level reward需要单独训练一个奖励模型,目前还没有见过这种实现
| @paddle.no_grad() | ||
| def compute_gae_advantage_return( | ||
| token_level_rewards: paddle.Tensor, | ||
| rewards: paddle.Tensor, |
There was a problem hiding this comment.
目前看起来 compute_gae_advantage_return,compute_advantage函数都把 use_tgt_len_return 相关的逻辑去掉了,但是貌似原来的实现更加灵活些,这块是怎么考虑的?
There was a problem hiding this comment.
use_tgt_len_return决定在计算优势的时候是否拼接prompt部分(非response)的token。如果true,就把prompt token对应的优势值填0, 但是实际上并没有什么意义。参考verl的实现也没有考虑prompt部分,因此,这里把use_tgt_len_return删除。
并且,我认为prompt属于用户的输入,模型无法控制,不需要对其进行优化。
| if self.use_fp32_compute and reward_values.dtype != paddle.float32: | ||
| reward_values = reward_values.cast(paddle.float32) | ||
|
|
||
| reward_values = reward_values.squeeze(axis=-1)[:, response_start:-1] |
There was a problem hiding this comment.
哦,前面的逻辑是挪到这个forward函数里头来处理了吗?
There was a problem hiding this comment.
value和advantage的计算都从response_start开始,不考虑prompt。
这里的reward_values来自critic的预测,他需要和上一步计算的returns计算value loss,因此需要保持shape一致
|
|
||
| # config.architectures = [self.__class__.__name__] | ||
| self.init_score_head(config, hidden_size=config.hidden_size, **kwargs) | ||
| @classmethod |
There was a problem hiding this comment.
所以最后我们的代码里面 AutoModelForScore 实际没啥用了是不
There was a problem hiding this comment.
AutoModelForScore实际上没有再用了。但是我修改了原始的代码,用修改后的AutoModelForScore应该也是可以的
| config.tensor_parallel_rank = 0 | ||
| with timers_scope_runtimer("Reward critic eval model loading time"): | ||
| with timers_scope_runtimer("Critic eval model loading time"): | ||
| critic_eval_model = AutoModelForScore.from_config(config) |
There was a problem hiding this comment.
这个是不是也得统一改成 AutoModelForTokenClassification 呢?
| normalize_function: NormalizeFunction = "affine" | ||
| _initialized: bool = False | ||
|
|
||
| @classmethod |
There was a problem hiding this comment.
这个文件是之前ppo写的,主要功能是对value进行归一化,设计了不同的归一化方法。但是 critic 预测的value要直接用来优化critic model,这并不是target值(returns),因此我认为不应该归一化(verl也并未实现)
这个文件还有一个功能是初始化value head以及get_score (将hidden states送入value head),如果不启用AutoModelForScore,这个文件是无效的
| attention_mask = rl_batch.batch["attention_mask"] # length: src+tgt | ||
| position_ids = rl_batch.batch["position_ids"] # length: src+tgt | ||
| sequence_mask = rl_batch.batch["sequence_mask"] # length: src+tgt(-1) | ||
| sequence_mask = rl_batch.batch["eos_mask"] # length: src+tgt(-1) |
There was a problem hiding this comment.
为啥 sequence_mask 名字换成 eos_mask了?
There was a problem hiding this comment.
batch.batch.update(
{
# "log_probs": old_log_probs,
"reward_advantages": reward_advantages,
"reward_advantages_clean": reward_advantages[eos_mask != 0],
# "ref_log_probs": ref_log_probs,
"rewards": rewards,
"eos_mask": eos_mask,
}
)
打包后的数据一直是eos_mask,实际上只是名字不同。
这里的eos_mask同样不包含prompt部分
|
|
||
| return DataProto(meta_info={"metrics": {"train_value_loss": reward_critic_loss}}) | ||
| # return DataProto(meta_info={"metrics": {"train_value_loss": reward_critic_loss}}) | ||
| return {"train_value_loss": reward_critic_loss} |
There was a problem hiding this comment.
原始就是DataProto的话,还是保持DataProto吧
There was a problem hiding this comment.
train_value_loss = self.critic_trainer.update_critic(micro_batch)
rl_info.meta_info["metrics"].update(train_value_loss)
原始写法:
rl_info.union(train_value_loss)
原始写法会报错,修改为“ return {"train_value_loss": reward_critic_loss} ”,rl_info仍然是DataProto
| mean = getattr(config, "mean", None) | ||
| var = getattr(config, "var", None) | ||
| self.normalizer.set_mean_var(mean, var) | ||
| # if config.score_type == "reward": |
Before submitting
testsfolder. If there are codecov issues, please add tests cases first.PR types
New features
PR changes
Models
Description
Modify the original PPO algorithm implementation