opendilab
diff --git a/‎lzero/entry/README.md‎
Lines changed: 193 additions & 0 deletions b/‎lzero/entry/README.md‎
Lines changed: 193 additions & 0 deletions
diff --git a/‎lzero/entry/README_zh.md‎
Lines changed: 193 additions & 0 deletions b/‎lzero/entry/README_zh.md‎
Lines changed: 193 additions & 0 deletions
@@ -0,0 +1,193 @@
+# LightZero Entry Functions
+
+English | [中文](./README_zh.md)
+
+This directory contains the training and evaluation entry functions for various algorithms in the LightZero framework. These entry functions serve as the main interfaces for launching different types of reinforcement learning experiments.
+
+## 📁 Directory Structure
+
+### 🎯 Training Entries
+
+#### AlphaZero Family
+- **`train_alphazero.py`** - Training entry for AlphaZero algorithm
+  - Suitable for perfect information board games (e.g., Go, Chess)
+  - No environment model needed, learns through self-play
+  - Uses Monte Carlo Tree Search (MCTS) for policy improvement
+
+#### MuZero Family
+- **`train_muzero.py`** - Standard training entry for MuZero algorithm
+  - Supports MuZero, EfficientZero, Sampled EfficientZero, Gumbel MuZero variants
+  - Learns an implicit model of the environment (dynamics model)
+  - Suitable for single-task reinforcement learning scenarios
+
+- **`train_muzero_segment.py`** - MuZero training with segment collector and buffer reanalyze
+  - Uses `MuZeroSegmentCollector` for data collection
+  - Supports buffer reanalyze trick for improved sample efficiency
+  - Supported algorithms: MuZero, EfficientZero, Sampled MuZero, Sampled EfficientZero, Gumbel MuZero, StochasticMuZero
+
+- **`train_muzero_with_gym_env.py`** - MuZero training adapted for Gym environments
+  - Specifically designed for OpenAI Gym-style environments
+  - Simplifies environment interface adaptation
+
+- **`train_muzero_with_reward_model.py`** - MuZero training with reward model
+  - Integrates external Reward Model
+  - Suitable for scenarios requiring learning complex reward functions
+
+- **`train_muzero_multitask_segment_ddp.py`** - MuZero multi-task distributed training
+  - Supports multi-task learning
+  - Uses DDP (Distributed Data Parallel) for distributed training
+  - Uses Segment Collector
+
+#### UniZero Family
+- **`train_unizero.py`** - Training entry for UniZero algorithm
+  - Based on paper "UniZero: Generalized and Efficient Planning with Scalable Latent World Models"
+  - Enhanced planning capabilities for better long-term dependency capture
+  - Uses scalable latent world models
+  - Paper: https://arxiv.org/abs/2406.10667
+
+- **`train_unizero_segment.py`** - UniZero training with segment collector
+  - Uses `MuZeroSegmentCollector` for efficient data collection
+  - Supports buffer reanalyze trick
+
+- **`train_unizero_multitask_segment_ddp.py`** - UniZero multi-task distributed training
+  - Supports multi-task learning and distributed training
+  - Includes benchmark score definitions (e.g., Atari human-normalized scores)
+  - Supports curriculum learning strategies
+  - Uses DDP for training acceleration
+
+- **`train_unizero_multitask_balance_segment_ddp.py`** - UniZero balanced multi-task distributed training
+  - Implements balanced sampling across tasks in multi-task training
+  - Dynamically adjusts batch sizes for different tasks
+  - Suitable for scenarios with large task difficulty variations
+
+- **`train_unizero_multitask_segment_eval.py`** - UniZero multi-task evaluation training
+  - Specialized for training and periodic evaluation in multi-task scenarios
+  - Includes detailed evaluation metric statistics
+
+- **`train_unizero_with_loss_landscape.py`** - UniZero training with loss landscape visualization
+  - For training with loss landscape visualization
+  - Helps understand model optimization process and generalization performance
+  - Integrates `loss_landscapes` library
+
+#### ReZero Family
+- **`train_rezero.py`** - Training entry for ReZero algorithm
+  - Supports ReZero-MuZero and ReZero-EfficientZero
+  - Improves training stability through residual connections
+  - Paper: https://arxiv.org/pdf/2404.16364
+
+### 🎓 Evaluation Entries
+
+- **`eval_alphazero.py`** - Evaluation entry for AlphaZero
+  - Loads trained AlphaZero models for evaluation
+  - Can play against other agents for performance testing
+
+- **`eval_muzero.py`** - Evaluation entry for MuZero family
+  - Supports evaluation of all MuZero variants
+  - Provides detailed performance statistics
+
+- **`eval_muzero_with_gym_env.py`** - MuZero evaluation for Gym environments
+  - Specialized for evaluating models trained in Gym environments
+
+### 🛠️ Utility Modules
+
+- **`utils.py`** - Common utility functions library
+  - **Math & Tensor Utilities**:
+    - `symlog`, `inv_symlog` - Symmetric logarithm transformations
+    - `initialize_zeros_batch`, `initialize_pad_batch` - Batch initialization
+
+  - **LoRA Utilities**:
+    - `freeze_non_lora_parameters` - Freeze non-LoRA parameters
+
+  - **Task & Curriculum Learning Utilities**:
+    - `compute_task_weights` - Compute task weights
+    - `TemperatureScheduler` - Temperature scheduler
+    - `tasks_per_stage` - Calculate tasks per stage
+    - `compute_unizero_mt_normalized_stats` - Compute normalized statistics
+    - `allocate_batch_size` - Dynamically allocate batch sizes
+
+  - **Distributed Training Utilities (DDP)**:
+    - `is_ddp_enabled` - Check if DDP is enabled
+    - `ddp_synchronize` - DDP synchronization
+    - `ddp_all_reduce_sum` - DDP all-reduce sum
+
+  - **RL Workflow Utilities**:
+    - `calculate_update_per_collect` - Calculate updates per collection
+    - `random_collect` - Random policy data collection
+    - `convert_to_batch_for_unizero` - UniZero batch data conversion
+    - `create_unizero_loss_metrics` - Create loss metrics function
+    - `UniZeroDataLoader` - UniZero data loader
+
+  - **Logging Utilities**:
+    - `log_module_trainable_status` - Log module trainable status
+    - `log_param_statistics` - Log parameter statistics
+    - `log_buffer_memory_usage` - Log buffer memory usage
+    - `log_buffer_run_time` - Log buffer runtime
+
+- **`__init__.py`** - Package initialization file
+  - Exports all training and evaluation entry functions
+  - Exports commonly used functions from utility modules
+
+## 📖 Usage Guide
+
+### Basic Usage Pattern
+
+All training entry functions follow a similar calling pattern:
+
+```python
+from lzero.entry import train_muzero
+
+# Prepare configuration
+cfg = dict(...)  # User configuration
+create_cfg = dict(...)  # Creation configuration
+
+# Start training
+policy = train_muzero(
+    input_cfg=(cfg, create_cfg),
+    seed=0,
+    model=None,  # Optional: pre-initialized model
+    model_path=None,  # Optional: pretrained model path
+    max_train_iter=int(1e10),  # Maximum training iterations
+    max_env_step=int(1e10),  # Maximum environment steps
+)
+```
+
+### Choosing the Right Entry Function
+
+1. **Single-Task Learning**:
+   - Board games → `train_alphazero`
+   - General RL tasks → `train_muzero` or `train_unizero`
+   - Gym environments → `train_muzero_with_gym_env`
+
+2. **Multi-Task Learning**:
+   - Standard multi-task → `train_unizero_multitask_segment_ddp`
+   - Balanced task sampling → `train_unizero_multitask_balance_segment_ddp`
+
+3. **Distributed Training**:
+   - All entry functions with `_ddp` suffix support distributed training
+
+4. **Special Requirements**:
+   - Loss landscape visualization → `train_unizero_with_loss_landscape`
+   - External reward model → `train_muzero_with_reward_model`
+   - Improved training stability → `train_rezero`
+
+## 🔗 Related Resources
+
+- **AlphaZero**: [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm](https://arxiv.org/abs/1712.01815)
+- **MuZero**: [Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model](https://arxiv.org/abs/1911.08265)
+- **EfficientZero**: [Mastering Atari Games with Limited Data](https://arxiv.org/abs/2111.00210)
+- **UniZero**: [Generalized and Efficient Planning with Scalable Latent World Models](https://arxiv.org/abs/2406.10667)
+- **ReZero**: [Boosting MCTS-based Algorithms by Reconstructing the Terminal Reward](https://arxiv.org/abs/2404.16364)
+
+## 💡 Tips
+
+- Recommended to start with standard `train_muzero` or `train_unizero`
+- For large-scale experiments, consider using DDP versions for faster training
+- Using `_segment` versions can achieve better sample efficiency (via reanalyze trick)
+- Check configuration examples in `zoo/` directory to learn how to set up each algorithm
+
+## 📝 Notes
+
+1. All path parameters should use **absolute paths**
+2. Pretrained model paths typically follow format: `exp_name/ckpt/ckpt_best.pth.tar`
+3. When using distributed training, ensure `CUDA_VISIBLE_DEVICES` environment variable is set correctly
+4. Some entry functions have specific algorithm type requirements - check function documentation
@@ -0,0 +1,193 @@
+# LightZero 入口函数说明
+
+[English](./README.md) | 中文
+
+本目录包含了 LightZero 框架中各种算法的训练和评估入口函数。这些入口函数是启动不同类型强化学习实验的主要接口。
+
+## 📁 目录结构
+
+### 🎯 训练入口 (Training Entries)
+
+#### AlphaZero 系列
+- **`train_alphazero.py`** - AlphaZero 算法的训练入口
+  - 适用于完美信息的棋类游戏（如围棋、国际象棋等）
+  - 不需要环境模型，直接通过自我对弈学习
+  - 使用蒙特卡洛树搜索（MCTS）进行策略改进
+
+#### MuZero 系列
+- **`train_muzero.py`** - MuZero 算法的标准训练入口
+  - 支持 MuZero、EfficientZero、Sampled EfficientZero、Gumbel MuZero 等变体
+  - 学习环境的隐式模型（dynamics model）
+  - 适用于单任务强化学习场景
+
+- **`train_muzero_segment.py`** - MuZero 带分段收集器和缓冲区重分析技巧的训练入口
+  - 使用 `MuZeroSegmentCollector` 进行数据收集
+  - 支持缓冲区重分析（reanalyze）技巧提高样本效率
+  - 支持的算法：MuZero, EfficientZero, Sampled MuZero, Sampled EfficientZero, Gumbel MuZero, StochasticMuZero
+
+- **`train_muzero_with_gym_env.py`** - 适配 Gym 环境的 MuZero 训练入口
+  - 专门为 OpenAI Gym 风格的环境设计
+  - 简化了环境接口的适配过程
+
+- **`train_muzero_with_reward_model.py`** - 带奖励模型的 MuZero 训练入口
+  - 集成外部奖励模型（Reward Model）
+  - 适用于需要学习复杂奖励函数的场景
+
+- **`train_muzero_multitask_segment_ddp.py`** - MuZero 多任务分布式训练入口
+  - 支持多任务学习（Multi-task Learning）
+  - 使用 DDP (Distributed Data Parallel) 进行分布式训练
+  - 使用分段收集器（Segment Collector）
+
+#### UniZero 系列
+- **`train_unizero.py`** - UniZero 算法的训练入口
+  - 基于论文 "UniZero: Generalized and Efficient Planning with Scalable Latent World Models"
+  - 增强的规划能力，能更好地捕获长期依赖
+  - 使用可扩展的隐式世界模型
+  - 论文链接：https://arxiv.org/abs/2406.10667
+
+- **`train_unizero_segment.py`** - UniZero 带分段收集器的训练入口
+  - 使用 `MuZeroSegmentCollector` 进行高效数据收集
+  - 支持缓冲区重分析技巧
+
+- **`train_unizero_multitask_segment_ddp.py`** - UniZero 多任务分布式训练入口
+  - 支持多任务学习和分布式训练
+  - 包含基准测试分数定义（如 Atari 的人类归一化分数）
+  - 支持课程学习（Curriculum Learning）策略
+  - 使用 DDP 加速训练
+
+- **`train_unizero_multitask_balance_segment_ddp.py`** - UniZero 多任务均衡分布式训练入口
+  - 在多任务训练中实现任务间的均衡采样
+  - 动态调整不同任务的批次大小
+  - 适用于任务难度差异较大的场景
+
+- **`train_unizero_multitask_segment_eval.py`** - UniZero 多任务评估训练入口
+  - 专门用于多任务场景的训练和周期性评估
+  - 包含详细的评估指标统计
+
+- **`train_unizero_with_loss_landscape.py`** - UniZero 损失地形可视化训练入口
+  - 用于训练的同时进行损失地形（Loss Landscape）可视化
+  - 帮助理解模型的优化过程和泛化性能
+  - 集成 `loss_landscapes` 库
+
+#### ReZero 系列
+- **`train_rezero.py`** - ReZero 算法的训练入口
+  - 支持 ReZero-MuZero 和 ReZero-EfficientZero
+  - 通过残差连接改进训练稳定性
+  - 论文链接：https://arxiv.org/pdf/2404.16364
+
+### 🎓 评估入口 (Evaluation Entries)
+
+- **`eval_alphazero.py`** - AlphaZero 算法的评估入口
+  - 加载训练好的 AlphaZero 模型进行评估
+  - 可以与其他智能体对弈测试性能
+
+- **`eval_muzero.py`** - MuZero 系列算法的评估入口
+  - 支持所有 MuZero 变体的评估
+  - 提供详细的性能统计
+
+- **`eval_muzero_with_gym_env.py`** - Gym 环境下的 MuZero 评估入口
+  - 专门用于评估在 Gym 环境中训练的模型
+
+### 🛠️ 工具模块 (Utilities)
+
+- **`utils.py`** - 通用工具函数库
+  - **数学与张量工具**：
+    - `symlog`, `inv_symlog` - 对称对数变换
+    - `initialize_zeros_batch`, `initialize_pad_batch` - 批次初始化
+
+  - **LoRA 相关工具**：
+    - `freeze_non_lora_parameters` - 冻结非 LoRA 参数
+
+  - **任务与课程学习工具**：
+    - `compute_task_weights` - 计算任务权重
+    - `TemperatureScheduler` - 温度调度器
+    - `tasks_per_stage` - 计算每阶段任务数
+    - `compute_unizero_mt_normalized_stats` - 计算归一化统计
+    - `allocate_batch_size` - 动态分配批次大小
+
+  - **分布式训练工具 (DDP)**：
+    - `is_ddp_enabled` - 检查是否启用 DDP
+    - `ddp_synchronize` - DDP 同步
+    - `ddp_all_reduce_sum` - DDP 全局归约
+
+  - **强化学习工作流工具**：
+    - `calculate_update_per_collect` - 计算每次收集的更新次数
+    - `random_collect` - 随机策略数据收集
+    - `convert_to_batch_for_unizero` - UniZero 批次数据转换
+    - `create_unizero_loss_metrics` - 创建损失度量函数
+    - `UniZeroDataLoader` - UniZero 数据加载器
+
+  - **日志工具**：
+    - `log_module_trainable_status` - 记录模块可训练状态
+    - `log_param_statistics` - 记录参数统计
+    - `log_buffer_memory_usage` - 记录缓冲区内存使用
+    - `log_buffer_run_time` - 记录缓冲区运行时间
+
+- **`__init__.py`** - 包初始化文件
+  - 导出所有训练和评估入口函数
+  - 导出工具模块中的常用函数
+
+## 📖 使用指南
+
+### 基本使用模式
+
+所有训练入口函数遵循相似的调用模式：
+
+```python
+from lzero.entry import train_muzero
+
+# 准备配置
+cfg = dict(...)  # 用户配置
+create_cfg = dict(...)  # 创建配置
+
+# 开始训练
+policy = train_muzero(
+    input_cfg=(cfg, create_cfg),
+    seed=0,
+    model=None,  # 可选：预初始化模型
+    model_path=None,  # 可选：预训练模型路径
+    max_train_iter=int(1e10),  # 最大训练迭代次数
+    max_env_step=int(1e10),  # 最大环境步数
+)
+```
+
+### 选择合适的入口函数
+
+1. **单任务学习**：
+   - 棋类游戏 → `train_alphazero`
+   - 一般 RL 任务 → `train_muzero` 或 `train_unizero`
+   - Gym 环境 → `train_muzero_with_gym_env`
+
+2. **多任务学习**：
+   - 标准多任务 → `train_unizero_multitask_segment_ddp`
+   - 任务均衡采样 → `train_unizero_multitask_balance_segment_ddp`
+
+3. **分布式训练**：
+   - 所有带 `_ddp` 后缀的入口函数都支持分布式训练
+
+4. **特殊需求**：
+   - 损失地形可视化 → `train_unizero_with_loss_landscape`
+   - 外部奖励模型 → `train_muzero_with_reward_model`
+   - 改进训练稳定性 → `train_rezero`
+
+## 🔗 相关资源
+
+- **AlphaZero**: [Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm](https://arxiv.org/abs/1712.01815)
+- **MuZero**: [Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model](https://arxiv.org/abs/1911.08265)
+- **EfficientZero**: [Mastering Atari Games with Limited Data](https://arxiv.org/abs/2111.00210)
+- **UniZero**: [Generalized and Efficient Planning with Scalable Latent World Models](https://arxiv.org/abs/2406.10667)
+- **ReZero**: [Boosting MCTS-based Algorithms by Reconstructing the Terminal Reward](https://arxiv.org/abs/2404.16364)
+
+## 💡 提示
+
+- 建议从标准的 `train_muzero` 或 `train_unizero` 开始
+- 对于大规模实验，考虑使用 DDP 版本以提高训练速度
+- 使用 `_segment` 版本可以获得更好的样本效率（通过重分析技巧）
+- 查看 `zoo/` 目录下的配置示例以了解如何设置各个算法
+
+## 📝 注意事项
+
+1. 所有路径参数建议使用**绝对路径**
+2. 预训练模型路径通常格式为 `exp_name/ckpt/ckpt_best.pth.tar`
+3. 使用分布式训练时，确保正确设置 `CUDA_VISIBLE_DEVICES` 环境变量
+4. 某些入口函数有特定的算法类型要求，请查看函数文档说明