feature(xjy): Refine PriorZero Implementation#441
Open
xiongjyu wants to merge 68 commits intoopendilab:dev-multitask-balance-clean-rftfrom
Open
feature(xjy): Refine PriorZero Implementation#441xiongjyu wants to merge 68 commits intoopendilab:dev-multitask-balance-clean-rftfrom
xiongjyu wants to merge 68 commits intoopendilab:dev-multitask-balance-clean-rftfrom
Conversation
…_llm_prior, and SFT loss
xiongjyu
commented
Nov 24, 2025
puyuan1996
reviewed
Nov 24, 2025
…lect to cprofile.
…ed the REINFORCE-series loss computation.
…me. Single-GPU works; multi-GPU not tested yet.
puyuan1996
reviewed
Dec 15, 2025
| for i in range(num_engines): | ||
| bundle_indices = None | ||
| if tensor_parallel_size > 1: | ||
| bundle_indices = get_bundle_indices(shared_pg, i, tensor_parallel_size) |
Collaborator
Author
There was a problem hiding this comment.
这个vllm_engine基本和openrlhf这部分是一样的;不过目前只使用一个vllm,并且tensor_parallel_size =1;因为显存够
puyuan1996
reviewed
Dec 15, 2025
…ple for world-model training; train LLM only on latest trajectories
51292a4 to
f88989b
Compare
…ffle and some fine-grained metrics
… as the advantage value.
…averages over parameter updates; add NaN debug logs; enable LLM checkpoint saving.
…policy_model and vllm.
…rZeroEvaluator including WM, WM_LLMPrior, and LLMPrior
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
这个 PR 主要完善了 PriorZero的实现与开发流程,修复了若干影响训练正确性和稳定性的关键问题,并对训练逻辑、损失计算、数据采集进行了系统性的增强。
本 PR 已完成的工作
• 修复了 PriorZero 训练流程中的多个关键 bug,包括 game segment 构建、loss 计算、log-prob 对齐以及 action 处理中的错误。
• 完善了 REINFORCE / RFT 风格的策略优化实现,在 buffer 中正确存储并使用 old_logprob,保证策略更新的正确性。
• 补充并规范了训练过程中的统计指标,包括 KL divergence、policy entropy 等,用于更好地监控训练状态。
• 优化了 Collector 与 Replay Buffer 的数据流转逻辑,提升数据一致性与采样稳定性,减少隐式错误。
• 引入并验证了单卡场景下的 vLLM 权重同步机制。
• 多 GPU / 多节点场景下的 vLLM 权重同步与稳定性验证