Consultation on papers   训练流程的咨询

1. 报告中提到了一个自我分析的训练阶段，也就是说，generator生成证明，而后自己给出分析和评分，然后用meta verifier去评估分析的质量作为reward的一部分纳入训练。那么这样训练后，照理说模型的推理任务形式应该是 "模型先给出证明，而后出具自己的分析和评分" 那么真的能有效泛化为 "模型直接生成高可信且高准确的证明"吗？ 是因为这只是作为子任务参与一部分训练所以不会对推理格式造成很大影响  还是说  基座模型够大，所以这样练确实真正提升了”思维“能力？ 在小参数模型上这种方式是否同时适用？   如果按照结论真的提升了思维能力，那么之前很火的一篇论文提到的 ：大模型的能力只在预训练阶段保证，强化学习也只是强化它特定任务的输出分布。  是否不成立？
2. 关于自动标注。报告中提到的大致意思是 ： 对于比较难的答案，采样多个评估结果进行投票，从而过滤无效的标注，提高标注准确性？那么对于别的场景，例如垂域、私域中比较难评估的场景，现在很流行llm as judge，是否也满足运用投票机制来提升准确性？也就是这个投票机制，对于别的reward model  乃至  通用模型 judge  是否具备泛化性？

1.The report mentioned a self-analysis training stage, that is to say, the generator generates the proof, then provides the analysis and score by itself, and then the meta verifier evaluates the quality of the analysis as part of the reward and incorporates it into the training. So, after such training, logically speaking, the form of the model's reasoning task should be "the model first provides the proof, and then issues its own analysis and score". But can it really be effectively generalized to "the model directly generates highly reliable and highly accurate proofs"? Is it because this is only a sub-task participating in part of the training, so it won't have a significant impact on the reasoning format, or is it that the base model is large enough, so this practice does indeed truly enhance the "thinking" ability? Is this approach applicable simultaneously on small-parameter models? If the conclusion really enhances the thinking ability, then as mentioned in a very popular paper before: the ability of large models is only guaranteed during the pre-training stage, and reinforcement learning only strengthens the output distribution of its specific task. Is it not valid?

2. Regarding automatic annotation. The general meaning mentioned in the report is: For more difficult answers, sample multiple evaluation results for voting, thereby filtering out invalid annotations and improving the accuracy of annotations? Then for other scenarios, such as those in vertical domains and private domains that are relatively difficult to evaluate, llm as judge is now very popular. Does it also meet the requirement of using the voting mechanism to improve accuracy? That is to say, does this voting mechanism have generalization for other reward models and even general model judges?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consultation on papers 训练流程的咨询 #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consultation on papers 训练流程的咨询 #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions