feature(xjy): Refine PriorZero Implementation by xiongjyu · Pull Request #441 · opendilab/LightZero

xiongjyu · 2025-11-20T15:48:21Z

这个 PR 主要完善了 PriorZero的实现与开发流程，修复了若干影响训练正确性和稳定性的关键问题，并对训练逻辑、损失计算、数据采集进行了系统性的增强。

本 PR 已完成的工作
• 修复了 PriorZero 训练流程中的多个关键 bug，包括 game segment 构建、loss 计算、log-prob 对齐以及 action 处理中的错误。
• 完善了 REINFORCE / RFT 风格的策略优化实现，在 buffer 中正确存储并使用 old_logprob，保证策略更新的正确性。
• 补充并规范了训练过程中的统计指标，包括 KL divergence、policy entropy 等，用于更好地监控训练状态。
• 优化了 Collector 与 Replay Buffer 的数据流转逻辑，提升数据一致性与采样稳定性，减少隐式错误。
• 引入并验证了单卡场景下的 vLLM 权重同步机制。
• 多 GPU / 多节点场景下的 vLLM 权重同步与稳定性验证

…_llm_prior, and SFT loss

zoo/jericho/priorzero/priorzero_policy.py

…lect to cprofile.

…ed the REINFORCE-series loss computation.

…me. Single-GPU works; multi-GPU not tested yet.

puyuan1996 · 2025-12-15T14:38:21Z

zoo/jericho/priorzero/vllm_utils/vllm_engine_ray.py

+    for i in range(num_engines):
+        bundle_indices = None
+        if tensor_parallel_size > 1:
+            bundle_indices = get_bundle_indices(shared_pg, i, tensor_parallel_size)


这里是参考的ray官方改进吗

这个vllm_engine基本和openrlhf这部分是一样的；不过目前只使用一个vllm,并且tensor_parallel_size =1；因为显存够

zoo/jericho/priorzero/vllm_utils/vllm_engine_ray.py

…ple for world-model training; train LLM only on latest trajectories

…ing DDP files.

… last_pos_in_transition

…related to the `approx_kl` exception.

…ples

…cot_prefix.

…lm-train; add MIS-PO

… wm_llm_prior

xiongjyu added 5 commits November 20, 2025 12:48

Fix game_segment/weighted_total_loss bugs and refine prompts, compute…

a3a2d69

…_llm_prior, and SFT loss

Fixed the accumulate_steps bug and added cprofile functionality.

959a558

Refine the code and fix the bug in data collection.

ecedc5f

Add REINFORCE-style losses and store old_logprob in the buffer.

2d53d22

Fix the get_llm_prior bug so that every action receives a logprob

c608600

xiongjyu commented Nov 24, 2025

View reviewed changes

zoo/jericho/priorzero/priorzero_policy.py Outdated Show resolved Hide resolved

puyuan1996 reviewed Nov 24, 2025

View reviewed changes

zoo/jericho/priorzero/priorzero_policy.py Outdated Show resolved Hide resolved

fixed the history bug in the build_llm_prompt and logs in forward_learn

15e39f6

xiongjyu deleted the branch opendilab:dev-multitask-balance-clean-rft November 24, 2025 14:28

xiongjyu closed this Nov 24, 2025

xiongjyu deleted the dev-multitask-balance-clean-rft branch November 24, 2025 14:28

xiongjyu reopened this Nov 24, 2025

xiongjyu added 4 commits November 24, 2025 22:35

rename advantage_tensor on rft

7c9acd9

Fixed the action out-of-bounds bug and added a record for forward_col…

738f300

…lect to cprofile.

Fixed the misalignment between old_log_prob and log_prob, and correct…

0a166f6

…ed the REINFORCE-series loss computation.

add some logs for analysying

4f3668e

puyuan1996 added the research Research work in progress label Nov 28, 2025

xiongjyu added 10 commits November 30, 2025 01:47

Polish the code and standardize the format.

2985e60

Add kL divergence in rft and llm_prior_entropy in collect

ff98006

polish config and format

7e43e45

delete unused files

d6555e5

Decouple the training of world_model and LLM.

b7d42ee

add cache in the jericho

95e2347

Separate sync and async entry points to simplify the program.

9682486

Reference OpenRLHF’s implementation to update vLLM weights in real ti…

0a38197

…me. Single-GPU works; multi-GPU not tested yet.

delete unused orz files

e361039

fix a small bug

f957db9

puyuan1996 reviewed Dec 15, 2025

View reviewed changes

zoo/jericho/priorzero/vllm_utils/vllm_engine_ray.py Outdated Show resolved Hide resolved

Fix action='go' bug; optimize replay buffer with larger capacity; sam…

628d7d2

…ple for world-model training; train LLM only on latest trajectories

xiongjyu added 30 commits March 12, 2026 22:18

fix llm_weight bug in priorzero_policy and polish run_priorzero_ddp.sh

8328bbb

Fixed a deadlock bug that occurred when saving LLM weights while runn…

e10dd30

…ing DDP files.

add lora and statistics on data duplication rate

3c9e232

set the advantage_batch_norm and llm_plus_wm_logits to default

68d05cb

tmp

b80df0c

fix a bug in priorzero_collector for env_id misalignment and a bug in…

08221be

… last_pos_in_transition

Optimize the data sampling process for LLM training

ede1637

fix the bug in run_ddp

0421f45

Several bugs were fixed and vllm parameters were adjusted.

8731968

Add old_action_logprob and rollout_action_logprob to fix the bug …

b946106

…related to the `approx_kl` exception.

fix a small bug to prevent OOM

715c776

optimizer priorzero_entry_ddp and add unique samples for make_llm_sam…

490dd88

…ples

Add torch.cuda.empty_cache() after fetch_latest_batch to prevent OOM

823f9cd

fix a small bug

38dc3cf

Expand the range of pos_in_game_segment and add gradient weights for …

f04bab5

…cot_prefix.

for llm train, collector only to collect data without mcts+wm

d0283c5

prevent OOM when running ddp

318c6be

fix a bug when arriving llm phase and collect process

56878c2

remove the pure-llm to collect, add pure-wm to collect when running l…

59c7972

…lm-train; add MIS-PO

rename exp_name

e6bf3a6

fix some inportant bugs to prevent Nan

1371ef5

tmp

0a13c6a

polish the process of llm's training samples

43d322a

polish evaluator log and eval_freq

95a5bd9

tmp

331a23d

add advantage_global_batch_norm

059cbbc

tmp

be5f941

delete unused files and configs

706da1b

add valid_actions to prompt

9159a0e

fix a bug when valid_actions is greater max_action_nums in evaluating…

14c09ca

… wm_llm_prior

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(xjy): Refine PriorZero Implementation#441

feature(xjy): Refine PriorZero Implementation#441
xiongjyu wants to merge 122 commits intoopendilab:dev-multitask-balance-clean-rftfrom
xiongjyu:dev-multitask-balance-clean-rft

xiongjyu commented Nov 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

puyuan1996 Dec 15, 2025

Uh oh!

xiongjyu Dec 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xiongjyu commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

puyuan1996 Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

xiongjyu Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xiongjyu commented Nov 20, 2025 •

edited

Loading