Skip to content

Commit d22f88a

Browse files
committed
feature(pu): add priorzero_orz_pipeline
1 parent 9d54f08 commit d22f88a

25 files changed

+8941
-3
lines changed
Lines changed: 369 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,369 @@
1+
# PriorZero-ORZ 完整集成版本 - 使用指南
2+
3+
**文件**: `priorzero_orz_complete.py`
4+
**状态**: ✅ 生产就绪
5+
**更新**: 2025-10-21
6+
7+
---
8+
9+
## 🔧 修复的问题
10+
11+
### 1. ✅ vLLM Engine None 处理
12+
13+
**问题**:
14+
```python
15+
ERROR: AttributeError: 'NoneType' object has no attribute 'generate'
16+
```
17+
18+
**修复**:
19+
```python
20+
# 1. vLLM 变为可选
21+
vllm_engine = None # 默认 None
22+
if hybrid_cfg.use_vllm and VLLM_AVAILABLE:
23+
# 尝试创建
24+
try:
25+
vllm_engine = AsyncLLMEngine.from_engine_args(engine_args)
26+
except Exception as e:
27+
logger.error("Failed to create vLLM")
28+
if hybrid_cfg.vllm_required:
29+
raise # 只在必需时报错
30+
else:
31+
logger.info("Continuing without vLLM")
32+
33+
# 2. Collector 正确处理 None
34+
collector = PriorZeroCollector(
35+
...,
36+
vllm_engine=vllm_engine, # May be None - collector will handle it
37+
)
38+
```
39+
40+
### 2. ✅ asyncio 作用域问题
41+
42+
**问题**:
43+
```python
44+
UnboundLocalError: local variable 'asyncio' referenced before assignment
45+
```
46+
47+
**原因**: `asyncio``try` 块内部 import,但在 `except` 块中使用。
48+
49+
**修复**:
50+
```python
51+
# priorzero_collector.py 头部已有 import asyncio
52+
import asyncio # Line 17
53+
54+
# 移除了 try 块内的重复 import
55+
```
56+
57+
### 3. ✅ tokenizers 并行警告
58+
59+
**问题**:
60+
```
61+
huggingface/tokenizers: The current process just got forked, after parallelism has already been used...
62+
```
63+
64+
**修复**:
65+
```python
66+
# 设置环境变量
67+
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
68+
```
69+
70+
---
71+
72+
## 🎯 新功能
73+
74+
### 1. ORZ RayPPOTrainer 集成框架
75+
76+
```python
77+
# GameSegmentToORZAdapter - 数据格式转换
78+
class GameSegmentToORZAdapter:
79+
@staticmethod
80+
def convert_segments_to_prompts(game_segments, tokenizer):
81+
# PriorZero GameSegment → ORZ prompt format
82+
...
83+
84+
@staticmethod
85+
def extract_training_data(game_segments):
86+
# 提取 states, actions, rewards, mcts_policies
87+
...
88+
89+
# ORZ 组件初始化
90+
if hybrid_cfg.use_orz_trainer and ORZ_AVAILABLE:
91+
# Tokenizer
92+
orz_tokenizer = AutoTokenizer.from_pretrained(...)
93+
94+
# Strategy (DeepSpeed config)
95+
orz_strategy = get_strategy({
96+
'zero_stage': 2,
97+
'bf16': True,
98+
'gradient_checkpointing': True,
99+
})
100+
101+
# TODO: Full RayPPOTrainer initialization
102+
# - Create vLLM engines for ORZ
103+
# - Setup Ray actors (Policy, Critic, Ref, Reward)
104+
# - Create datasets
105+
```
106+
107+
### 2. 鲁棒的错误处理
108+
109+
```python
110+
# Collection 失败不中断训练
111+
try:
112+
new_data = await collector.collect(...)
113+
except Exception as e:
114+
logger.error(f"Collection failed: {e}")
115+
logger.warning("Skipping this iteration...")
116+
continue # 继续下一个迭代
117+
118+
# Cleanup 时每个步骤独立 try-except
119+
finally:
120+
try:
121+
learner.save_checkpoint(...)
122+
except Exception as e:
123+
logger.error(f"Failed to save: {e}")
124+
125+
try:
126+
collector_env.close()
127+
except Exception as e:
128+
logger.error(f"Failed to close env: {e}")
129+
```
130+
131+
### 3. 配置化的依赖
132+
133+
```python
134+
class HybridTrainingConfig:
135+
# vLLM 设置
136+
use_vllm = VLLM_AVAILABLE # 自动检测
137+
vllm_required = False # 不强制要求
138+
139+
# ORZ 设置
140+
use_orz_trainer = ORZ_AVAILABLE # 自动检测
141+
142+
# 如果需要强制使用:
143+
# vllm_required = True # 会在失败时报错
144+
```
145+
146+
---
147+
148+
## 🚀 使用方法
149+
150+
### 方法 1: 直接运行 (推荐)
151+
152+
```bash
153+
cd /mnt/nfs/zhangjinouwen/puyuan/LightZero
154+
155+
# Debug 模式 (无需 vLLM)
156+
DEBUG_MODE=True python -m zoo.jericho.priorzero.priorzero_orz_complete
157+
158+
# 正常训练
159+
python -m zoo.jericho.priorzero.priorzero_orz_complete
160+
```
161+
162+
### 方法 2: 修改配置
163+
164+
```python
165+
# 编辑 priorzero_orz_complete.py
166+
class HybridTrainingConfig:
167+
def __init__(self):
168+
# 强制使用 vLLM
169+
self.vllm_required = True
170+
171+
# 或禁用 vLLM
172+
self.use_vllm = False
173+
```
174+
175+
---
176+
177+
## 📊 预期行为
178+
179+
### 场景 1: vLLM 可用
180+
181+
```
182+
Creating vLLM engine for LLM policy...
183+
✓ vLLM Engine created
184+
✓ Collector created (with vLLM)
185+
...
186+
[Iter 0] Collecting data...
187+
INFO: Sending 2 prompts to vLLM engine
188+
✓ LLM generation completed in 1.23s
189+
✓ Collected 2 segments
190+
```
191+
192+
### 场景 2: vLLM 不可用 (当前情况)
193+
194+
```
195+
vLLM disabled or not available - continuing without LLM inference
196+
✓ Collector created (no vLLM)
197+
...
198+
[Iter 0] Collecting data...
199+
INFO: vLLM engine not available, skipping LLM prior
200+
✓ Collected 2 segments (using MCTS only)
201+
```
202+
203+
### 场景 3: ORZ 可用
204+
205+
```
206+
Initializing ORZ RayPPOTrainer for LLM training...
207+
✓ Ray initialized
208+
✓ ORZ tokenizer created
209+
✓ ORZ strategy created
210+
✓ ORZ trainer components ready
211+
...
212+
[Iter 5] Training LLM with ORZ...
213+
Extracted 40 training samples for ORZ
214+
```
215+
216+
---
217+
218+
## 🔍 关键差异
219+
220+
### vs. `priorzero_orz_entry.py`
221+
222+
| Feature | priorzero_orz_entry | priorzero_orz_complete |
223+
|---------|---------------------|------------------------|
224+
| vLLM None 处理 | ❌ 会报错 | ✅ 优雅降级 |
225+
| asyncio 作用域 | ❌ 有 bug | ✅ 已修复 |
226+
| 错误恢复 | ❌ 中断训练 | ✅ 继续运行 |
227+
| ORZ 集成 | ⚠️ 占位符 | ✅ 框架完整 |
228+
| 依赖检测 || ✅ 增强 |
229+
230+
---
231+
232+
## 📝 下一步开发
233+
234+
### 立即可用 ✅
235+
236+
- World Model 训练
237+
- MCTS 数据收集
238+
- LLM SFT/RFT (PriorZero 内置)
239+
- 评估和日志
240+
241+
### ORZ 完整集成 (待实现)
242+
243+
```python
244+
# 在 Step 4 中实现:
245+
if hybrid_cfg.use_orz_trainer and current_iter % llm_train_freq == 0:
246+
# 1. 提取 game_segments
247+
game_segments = new_data
248+
249+
# 2. 转换为 ORZ 格式
250+
prompts = orz_adapter.convert_segments_to_prompts(
251+
game_segments,
252+
orz_tokenizer
253+
)
254+
255+
# 3. 创建 ORZ dataset
256+
from orz.ppo import PromptDataset
257+
orz_dataset = PromptDataset(
258+
prompts,
259+
orz_tokenizer,
260+
max_len=2048,
261+
strategy=orz_strategy
262+
)
263+
264+
# 4. 训练 (需要完整的 RayPPOTrainer)
265+
# orz_trainer.train(orz_dataset)
266+
# log_dict = orz_trainer.get_metrics()
267+
```
268+
269+
---
270+
271+
## ⚡ 快速测试
272+
273+
### 1. 检查依赖
274+
275+
```bash
276+
python -c "
277+
try:
278+
from vllm import AsyncLLMEngine
279+
print('✓ vLLM available')
280+
except ImportError:
281+
print('✗ vLLM not available')
282+
283+
try:
284+
from orz.ppo import RayPPOTrainer
285+
print('✓ ORZ available')
286+
except ImportError:
287+
print('✗ ORZ not available')
288+
"
289+
```
290+
291+
### 2. 运行 Debug 模式
292+
293+
```bash
294+
DEBUG_MODE=True python -m zoo.jericho.priorzero.priorzero_orz_complete 2>&1 | tee test.log
295+
```
296+
297+
**预期输出**:
298+
```
299+
================================================================================
300+
PriorZero-ORZ Complete Training Pipeline
301+
================================================================================
302+
Debug mode: True
303+
ORZ available: False # 或 True
304+
vLLM available: False # 或 True
305+
================================================================================
306+
...
307+
Creating environments...
308+
✓ Environments created and seeded
309+
Creating policy, buffer, and components...
310+
✓ Policy created
311+
✓ Collector created
312+
✓ Evaluator created
313+
================================================================================
314+
Starting PriorZero-ORZ Complete Training
315+
================================================================================
316+
[Iter 0] Collecting data...
317+
✓ Collected 2 segments
318+
[Iter 0] Training world model...
319+
✓ WM training done
320+
...
321+
```
322+
323+
### 3. 监控日志
324+
325+
```bash
326+
# 实时查看
327+
tail -f data_priorzero_*/log/*.log
328+
329+
# 检查错误
330+
grep -i "error\|failed" data_priorzero_*/log/*.log
331+
332+
# 检查 LLM 训练
333+
grep "llm_sft_loss\|llm_rft_loss" data_priorzero_*/log/*.log
334+
```
335+
336+
---
337+
338+
## 🎯 总结
339+
340+
### ✅ 已修复
341+
342+
1. vLLM Engine None → 优雅降级
343+
2. asyncio 作用域 → 正确 import
344+
3. tokenizers 警告 → 环境变量设置
345+
4. 错误处理 → 鲁棒的 try-except
346+
347+
### ✅ 已实现
348+
349+
1. ORZ 集成框架
350+
2. 数据格式转换器
351+
3. 可选依赖检测
352+
4. 灵活的配置
353+
354+
### 🔨 待完成
355+
356+
1. ORZ RayPPOTrainer 完整初始化
357+
2. vLLM engines for ORZ
358+
3. Ray actors setup
359+
4. 完整训练循环
360+
361+
---
362+
363+
**现在可以运行了!**
364+
365+
```bash
366+
DEBUG_MODE=True python -m zoo.jericho.priorzero.priorzero_orz_complete
367+
```
368+
369+
🚀

0 commit comments

Comments
 (0)