Skip to content

Commit fefd62b

Browse files
committed
polish(pu): polish comments, docstring, readme
1 parent 50db85f commit fefd62b

37 files changed

+747
-334
lines changed

lzero/entry/README.md

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
# LightZero Entry Functions
2+
3+
English | [中文](./README_zh.md)
4+
5+
This directory contains the training and evaluation entry functions for various algorithms in the LightZero framework. These entry functions serve as the main interfaces for launching different types of reinforcement learning experiments.
6+
7+
## 📁 Directory Structure
8+
9+
### 🎯 Training Entries
10+
11+
#### AlphaZero Family
12+
- **`train_alphazero.py`** - Training entry for AlphaZero algorithm
13+
- Suitable for perfect information board games (e.g., Go, Chess)
14+
- No environment model needed, learns through self-play
15+
- Uses Monte Carlo Tree Search (MCTS) for policy improvement
16+
17+
#### MuZero Family
18+
- **`train_muzero.py`** - Standard training entry for MuZero algorithm
19+
- Supports MuZero, EfficientZero, Sampled EfficientZero, Gumbel MuZero variants
20+
- Learns an implicit model of the environment (dynamics model)
21+
- Suitable for single-task reinforcement learning scenarios
22+
23+
- **`train_muzero_segment.py`** - MuZero training with segment collector and buffer reanalyze
24+
- Uses `MuZeroSegmentCollector` for data collection
25+
- Supports buffer reanalyze trick for improved sample efficiency
26+
- Supported algorithms: MuZero, EfficientZero, Sampled MuZero, Sampled EfficientZero, Gumbel MuZero, StochasticMuZero
27+
28+
- **`train_muzero_with_gym_env.py`** - MuZero training adapted for Gym environments
29+
- Specifically designed for OpenAI Gym-style environments
30+
- Simplifies environment interface adaptation
31+
32+
- **`train_muzero_with_reward_model.py`** - MuZero training with reward model
33+
- Integrates external Reward Model
34+
- Suitable for scenarios requiring learning complex reward functions
35+
36+
- **`train_muzero_multitask_segment_ddp.py`** - MuZero multi-task distributed training
37+
- Supports multi-task learning
38+
- Uses DDP (Distributed Data Parallel) for distributed training
39+
- Uses Segment Collector
40+
41+
#### UniZero Family
42+
- **`train_unizero.py`** - Training entry for UniZero algorithm
43+
- Based on paper "UniZero: Generalized and Efficient Planning with Scalable Latent World Models"
44+
- Enhanced planning capabilities for better long-term dependency capture
45+
- Uses scalable latent world models
46+
- Paper: https://arxiv.org/abs/2406.10667
47+
48+
- **`train_unizero_segment.py`** - UniZero training with segment collector
49+
- Uses `MuZeroSegmentCollector` for efficient data collection
50+
- Supports buffer reanalyze trick
51+
52+
- **`train_unizero_multitask_segment_ddp.py`** - UniZero multi-task distributed training
53+
- Supports multi-task learning and distributed training
54+
- Includes benchmark score definitions (e.g., Atari human-normalized scores)
55+
- Supports curriculum learning strategies
56+
- Uses DDP for training acceleration
57+
58+
- **`train_unizero_multitask_balance_segment_ddp.py`** - UniZero balanced multi-task distributed training
59+
- Implements balanced sampling across tasks in multi-task training
60+
- Dynamically adjusts batch sizes for different tasks
61+
- Suitable for scenarios with large task difficulty variations
62+
63+
- **`train_unizero_multitask_segment_eval.py`** - UniZero multi-task evaluation training
64+
- Specialized for training and periodic evaluation in multi-task scenarios
65+
- Includes detailed evaluation metric statistics
66+
67+
- **`train_unizero_with_loss_landscape.py`** - UniZero training with loss landscape visualization
68+
- For training with loss landscape visualization
69+
- Helps understand model optimization process and generalization performance
70+
- Integrates `loss_landscapes` library
71+
72+
#### ReZero Family
73+
- **`train_rezero.py`** - Training entry for ReZero algorithm
74+
- Supports ReZero-MuZero and ReZero-EfficientZero
75+
- Improves training stability through residual connections
76+
- Paper: https://arxiv.org/pdf/2404.16364
77+
78+
### 🎓 Evaluation Entries
79+
80+
- **`eval_alphazero.py`** - Evaluation entry for AlphaZero
81+
- Loads trained AlphaZero models for evaluation
82+
- Can play against other agents for performance testing
83+
84+
- **`eval_muzero.py`** - Evaluation entry for MuZero family
85+
- Supports evaluation of all MuZero variants
86+
- Provides detailed performance statistics
87+
88+
- **`eval_muzero_with_gym_env.py`** - MuZero evaluation for Gym environments
89+
- Specialized for evaluating models trained in Gym environments
90+
91+
### 🛠️ Utility Modules
92+
93+
- **`utils.py`** - Common utility functions library
94+
- **Math & Tensor Utilities**:
95+
- `symlog`, `inv_symlog` - Symmetric logarithm transformations
96+
- `initialize_zeros_batch`, `initialize_pad_batch` - Batch initialization
97+
98+
- **LoRA Utilities**:
99+
- `freeze_non_lora_parameters` - Freeze non-LoRA parameters
100+
101+
- **Task & Curriculum Learning Utilities**:
102+
- `compute_task_weights` - Compute task weights
103+
- `TemperatureScheduler` - Temperature scheduler
104+
- `tasks_per_stage` - Calculate tasks per stage
105+
- `compute_unizero_mt_normalized_stats` - Compute normalized statistics
106+
- `allocate_batch_size` - Dynamically allocate batch sizes
107+
108+
- **Distributed Training Utilities (DDP)**:
109+
- `is_ddp_enabled` - Check if DDP is enabled
110+
- `ddp_synchronize` - DDP synchronization
111+
- `ddp_all_reduce_sum` - DDP all-reduce sum
112+
113+
- **RL Workflow Utilities**:
114+
- `calculate_update_per_collect` - Calculate updates per collection
115+
- `random_collect` - Random policy data collection
116+
- `convert_to_batch_for_unizero` - UniZero batch data conversion
117+
- `create_unizero_loss_metrics` - Create loss metrics function
118+
- `UniZeroDataLoader` - UniZero data loader
119+
120+
- **Logging Utilities**:
121+
- `log_module_trainable_status` - Log module trainable status
122+
- `log_param_statistics` - Log parameter statistics
123+
- `log_buffer_memory_usage` - Log buffer memory usage
124+
- `log_buffer_run_time` - Log buffer runtime
125+
126+
- **`__init__.py`** - Package initialization file
127+
- Exports all training and evaluation entry functions
128+
- Exports commonly used functions from utility modules
129+
130+
## 📖 Usage Guide
131+
132+
### Basic Usage Pattern
133+
134+
All training entry functions follow a similar calling pattern:
135+
136+
```python
137+
from lzero.entry import train_muzero
138+
139+
# Prepare configuration
140+
cfg = dict(...) # User configuration
141+
create_cfg = dict(...) # Creation configuration
142+
143+
# Start training
144+
policy = train_muzero(
145+
input_cfg=(cfg, create_cfg),
146+
seed=0,
147+
model=None, # Optional: pre-initialized model
148+
model_path=None, # Optional: pretrained model path
149+
max_train_iter=int(1e10), # Maximum training iterations
150+
max_env_step=int(1e10), # Maximum environment steps
151+
)
152+
```
153+
154+
### Choosing the Right Entry Function
155+
156+
1. **Single-Task Learning**:
157+
- Board games → `train_alphazero`
158+
- General RL tasks → `train_muzero` or `train_unizero`
159+
- Gym environments → `train_muzero_with_gym_env`
160+
161+
2. **Multi-Task Learning**:
162+
- Standard multi-task → `train_unizero_multitask_segment_ddp`
163+
- Balanced task sampling → `train_unizero_multitask_balance_segment_ddp`
164+
165+
3. **Distributed Training**:
166+
- All entry functions with `_ddp` suffix support distributed training
167+
168+
4. **Special Requirements**:
169+
- Loss landscape visualization → `train_unizero_with_loss_landscape`
170+
- External reward model → `train_muzero_with_reward_model`
171+
- Improved training stability → `train_rezero`
172+
173+
## 🔗 Related Resources
174+
175+
- **AlphaZero**: [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm](https://arxiv.org/abs/1712.01815)
176+
- **MuZero**: [Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model](https://arxiv.org/abs/1911.08265)
177+
- **EfficientZero**: [Mastering Atari Games with Limited Data](https://arxiv.org/abs/2111.00210)
178+
- **UniZero**: [Generalized and Efficient Planning with Scalable Latent World Models](https://arxiv.org/abs/2406.10667)
179+
- **ReZero**: [Boosting MCTS-based Algorithms by Reconstructing the Terminal Reward](https://arxiv.org/abs/2404.16364)
180+
181+
## 💡 Tips
182+
183+
- Recommended to start with standard `train_muzero` or `train_unizero`
184+
- For large-scale experiments, consider using DDP versions for faster training
185+
- Using `_segment` versions can achieve better sample efficiency (via reanalyze trick)
186+
- Check configuration examples in `zoo/` directory to learn how to set up each algorithm
187+
188+
## 📝 Notes
189+
190+
1. All path parameters should use **absolute paths**
191+
2. Pretrained model paths typically follow format: `exp_name/ckpt/ckpt_best.pth.tar`
192+
3. When using distributed training, ensure `CUDA_VISIBLE_DEVICES` environment variable is set correctly
193+
4. Some entry functions have specific algorithm type requirements - check function documentation

lzero/entry/README_zh.md

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
# LightZero 入口函数说明
2+
3+
[English](./README.md) | 中文
4+
5+
本目录包含了 LightZero 框架中各种算法的训练和评估入口函数。这些入口函数是启动不同类型强化学习实验的主要接口。
6+
7+
## 📁 目录结构
8+
9+
### 🎯 训练入口 (Training Entries)
10+
11+
#### AlphaZero 系列
12+
- **`train_alphazero.py`** - AlphaZero 算法的训练入口
13+
- 适用于完美信息的棋类游戏(如围棋、国际象棋等)
14+
- 不需要环境模型,直接通过自我对弈学习
15+
- 使用蒙特卡洛树搜索(MCTS)进行策略改进
16+
17+
#### MuZero 系列
18+
- **`train_muzero.py`** - MuZero 算法的标准训练入口
19+
- 支持 MuZero、EfficientZero、Sampled EfficientZero、Gumbel MuZero 等变体
20+
- 学习环境的隐式模型(dynamics model)
21+
- 适用于单任务强化学习场景
22+
23+
- **`train_muzero_segment.py`** - MuZero 带分段收集器和缓冲区重分析技巧的训练入口
24+
- 使用 `MuZeroSegmentCollector` 进行数据收集
25+
- 支持缓冲区重分析(reanalyze)技巧提高样本效率
26+
- 支持的算法:MuZero, EfficientZero, Sampled MuZero, Sampled EfficientZero, Gumbel MuZero, StochasticMuZero
27+
28+
- **`train_muzero_with_gym_env.py`** - 适配 Gym 环境的 MuZero 训练入口
29+
- 专门为 OpenAI Gym 风格的环境设计
30+
- 简化了环境接口的适配过程
31+
32+
- **`train_muzero_with_reward_model.py`** - 带奖励模型的 MuZero 训练入口
33+
- 集成外部奖励模型(Reward Model)
34+
- 适用于需要学习复杂奖励函数的场景
35+
36+
- **`train_muzero_multitask_segment_ddp.py`** - MuZero 多任务分布式训练入口
37+
- 支持多任务学习(Multi-task Learning)
38+
- 使用 DDP (Distributed Data Parallel) 进行分布式训练
39+
- 使用分段收集器(Segment Collector)
40+
41+
#### UniZero 系列
42+
- **`train_unizero.py`** - UniZero 算法的训练入口
43+
- 基于论文 "UniZero: Generalized and Efficient Planning with Scalable Latent World Models"
44+
- 增强的规划能力,能更好地捕获长期依赖
45+
- 使用可扩展的隐式世界模型
46+
- 论文链接:https://arxiv.org/abs/2406.10667
47+
48+
- **`train_unizero_segment.py`** - UniZero 带分段收集器的训练入口
49+
- 使用 `MuZeroSegmentCollector` 进行高效数据收集
50+
- 支持缓冲区重分析技巧
51+
52+
- **`train_unizero_multitask_segment_ddp.py`** - UniZero 多任务分布式训练入口
53+
- 支持多任务学习和分布式训练
54+
- 包含基准测试分数定义(如 Atari 的人类归一化分数)
55+
- 支持课程学习(Curriculum Learning)策略
56+
- 使用 DDP 加速训练
57+
58+
- **`train_unizero_multitask_balance_segment_ddp.py`** - UniZero 多任务均衡分布式训练入口
59+
- 在多任务训练中实现任务间的均衡采样
60+
- 动态调整不同任务的批次大小
61+
- 适用于任务难度差异较大的场景
62+
63+
- **`train_unizero_multitask_segment_eval.py`** - UniZero 多任务评估训练入口
64+
- 专门用于多任务场景的训练和周期性评估
65+
- 包含详细的评估指标统计
66+
67+
- **`train_unizero_with_loss_landscape.py`** - UniZero 损失地形可视化训练入口
68+
- 用于训练的同时进行损失地形(Loss Landscape)可视化
69+
- 帮助理解模型的优化过程和泛化性能
70+
- 集成 `loss_landscapes`
71+
72+
#### ReZero 系列
73+
- **`train_rezero.py`** - ReZero 算法的训练入口
74+
- 支持 ReZero-MuZero 和 ReZero-EfficientZero
75+
- 通过残差连接改进训练稳定性
76+
- 论文链接:https://arxiv.org/pdf/2404.16364
77+
78+
### 🎓 评估入口 (Evaluation Entries)
79+
80+
- **`eval_alphazero.py`** - AlphaZero 算法的评估入口
81+
- 加载训练好的 AlphaZero 模型进行评估
82+
- 可以与其他智能体对弈测试性能
83+
84+
- **`eval_muzero.py`** - MuZero 系列算法的评估入口
85+
- 支持所有 MuZero 变体的评估
86+
- 提供详细的性能统计
87+
88+
- **`eval_muzero_with_gym_env.py`** - Gym 环境下的 MuZero 评估入口
89+
- 专门用于评估在 Gym 环境中训练的模型
90+
91+
### 🛠️ 工具模块 (Utilities)
92+
93+
- **`utils.py`** - 通用工具函数库
94+
- **数学与张量工具**
95+
- `symlog`, `inv_symlog` - 对称对数变换
96+
- `initialize_zeros_batch`, `initialize_pad_batch` - 批次初始化
97+
98+
- **LoRA 相关工具**
99+
- `freeze_non_lora_parameters` - 冻结非 LoRA 参数
100+
101+
- **任务与课程学习工具**
102+
- `compute_task_weights` - 计算任务权重
103+
- `TemperatureScheduler` - 温度调度器
104+
- `tasks_per_stage` - 计算每阶段任务数
105+
- `compute_unizero_mt_normalized_stats` - 计算归一化统计
106+
- `allocate_batch_size` - 动态分配批次大小
107+
108+
- **分布式训练工具 (DDP)**
109+
- `is_ddp_enabled` - 检查是否启用 DDP
110+
- `ddp_synchronize` - DDP 同步
111+
- `ddp_all_reduce_sum` - DDP 全局归约
112+
113+
- **强化学习工作流工具**
114+
- `calculate_update_per_collect` - 计算每次收集的更新次数
115+
- `random_collect` - 随机策略数据收集
116+
- `convert_to_batch_for_unizero` - UniZero 批次数据转换
117+
- `create_unizero_loss_metrics` - 创建损失度量函数
118+
- `UniZeroDataLoader` - UniZero 数据加载器
119+
120+
- **日志工具**
121+
- `log_module_trainable_status` - 记录模块可训练状态
122+
- `log_param_statistics` - 记录参数统计
123+
- `log_buffer_memory_usage` - 记录缓冲区内存使用
124+
- `log_buffer_run_time` - 记录缓冲区运行时间
125+
126+
- **`__init__.py`** - 包初始化文件
127+
- 导出所有训练和评估入口函数
128+
- 导出工具模块中的常用函数
129+
130+
## 📖 使用指南
131+
132+
### 基本使用模式
133+
134+
所有训练入口函数遵循相似的调用模式:
135+
136+
```python
137+
from lzero.entry import train_muzero
138+
139+
# 准备配置
140+
cfg = dict(...) # 用户配置
141+
create_cfg = dict(...) # 创建配置
142+
143+
# 开始训练
144+
policy = train_muzero(
145+
input_cfg=(cfg, create_cfg),
146+
seed=0,
147+
model=None, # 可选:预初始化模型
148+
model_path=None, # 可选:预训练模型路径
149+
max_train_iter=int(1e10), # 最大训练迭代次数
150+
max_env_step=int(1e10), # 最大环境步数
151+
)
152+
```
153+
154+
### 选择合适的入口函数
155+
156+
1. **单任务学习**
157+
- 棋类游戏 → `train_alphazero`
158+
- 一般 RL 任务 → `train_muzero``train_unizero`
159+
- Gym 环境 → `train_muzero_with_gym_env`
160+
161+
2. **多任务学习**
162+
- 标准多任务 → `train_unizero_multitask_segment_ddp`
163+
- 任务均衡采样 → `train_unizero_multitask_balance_segment_ddp`
164+
165+
3. **分布式训练**
166+
- 所有带 `_ddp` 后缀的入口函数都支持分布式训练
167+
168+
4. **特殊需求**
169+
- 损失地形可视化 → `train_unizero_with_loss_landscape`
170+
- 外部奖励模型 → `train_muzero_with_reward_model`
171+
- 改进训练稳定性 → `train_rezero`
172+
173+
## 🔗 相关资源
174+
175+
- **AlphaZero**: [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm](https://arxiv.org/abs/1712.01815)
176+
- **MuZero**: [Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model](https://arxiv.org/abs/1911.08265)
177+
- **EfficientZero**: [Mastering Atari Games with Limited Data](https://arxiv.org/abs/2111.00210)
178+
- **UniZero**: [Generalized and Efficient Planning with Scalable Latent World Models](https://arxiv.org/abs/2406.10667)
179+
- **ReZero**: [Boosting MCTS-based Algorithms by Reconstructing the Terminal Reward](https://arxiv.org/abs/2404.16364)
180+
181+
## 💡 提示
182+
183+
- 建议从标准的 `train_muzero``train_unizero` 开始
184+
- 对于大规模实验,考虑使用 DDP 版本以提高训练速度
185+
- 使用 `_segment` 版本可以获得更好的样本效率(通过重分析技巧)
186+
- 查看 `zoo/` 目录下的配置示例以了解如何设置各个算法
187+
188+
## 📝 注意事项
189+
190+
1. 所有路径参数建议使用**绝对路径**
191+
2. 预训练模型路径通常格式为 `exp_name/ckpt/ckpt_best.pth.tar`
192+
3. 使用分布式训练时,确保正确设置 `CUDA_VISIBLE_DEVICES` 环境变量
193+
4. 某些入口函数有特定的算法类型要求,请查看函数文档说明

0 commit comments

Comments
 (0)