Skip to content

Commit 2876bf8

Browse files
authored
[doc] cleanup redundant example and scripts (#1431)
1 parent c97aff9 commit 2876bf8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+6
-5835
lines changed

docs/en/get_started/quick_start.md

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -105,15 +105,6 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
105105

106106
Note that as Megatron will do padding to embedding for better performance, it may happen that the converted embedding is not correct. In that case, please manually set `--vocab-size` during convertion.
107107

108-
For FSDP checkpoints (without `common.pt`), use the dedicated conversion script. Point `--input-dir` to the checkpoint directory (e.g. `iter_xxx` or `iter_xxx/model`) and provide the original Hugging Face directory:
109-
110-
```bash
111-
python tools/convert_fsdp_to_hf.py \
112-
--input-dir /path/to/fsdp_ckpt/iter_xxx \
113-
--output-dir /root/fsdp-converted \
114-
--origin-hf-dir /root/GLM-Z1-9B-0414
115-
```
116-
117108
## Training Script and Parameter Overview
118109

119110
After completing the above preparation work, you can run the training script.

docs/en/get_started/usage.md

Lines changed: 2 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
When using slime, parameters are primarily passed for the following purposes:
77

88
1. To allocate a portion of the GPUs in the cluster for training and another portion for inference.
9-
2. To load Megatron or FSDP for the training portion.
9+
2. To load Megatron for the training portion.
1010
3. To load SGLang for the inference portion.
1111
4. To configure the hyperparameters required for RL training.
1212

@@ -35,7 +35,7 @@ Additionally, slime supports Prefill and Decode disaggregation (PD Disaggregatio
3535
slime supports multiple training backends, which can be selected via the `--train-backend` parameter:
3636

3737
- `megatron` (default): Uses Megatron-LM as the training backend, supporting efficient training of large-scale models.
38-
- `fsdp`: Uses PyTorch FSDP as the training backend, allowing direct loading of HuggingFace format weights without conversion.
38+
- `fsdp` (experimental): Uses PyTorch FSDP as the training backend, allowing direct loading of HuggingFace format weights without conversion.
3939

4040
### Loading Megatron
4141

@@ -322,63 +322,3 @@ In some customized Megatron implementations, special operations need to be perfo
322322
- `--custom-megatron-init-path`: Adds some initialization calls.
323323
- `--custom-megatron-before-log-prob-hook-path`: Is called before calculating the log probability.
324324
- `--custom-megatron-before-train-step-hook-path`: Is called before each training step. You could use this to mix in special training losses, for example.
325-
326-
## How to Use FSDP
327-
328-
slime also support FSDP2 as the training backend, docs [here](https://lmsys.org/blog/2025-12-03-miles-fsdp/).
329-
330-
> FSDP automatically reads all architecture information via `AutoModelForCausalLM.from_pretrained()`, without manual specification. Megatron requires manual configuration of parameters to read model architecture information. FSDP can read entirely from `config.json`, directly avoiding the weight format conversion step.
331-
332-
To run FSDP as the training backend, pass `--train-backend fsdp` to enable.
333-
334-
### Parameters
335-
336-
Parameters that FSDP used are shown as below in comparison to Megatron, more supports are coming on the way.
337-
338-
| Configuration Category | Megatron Parameter | FSDP Parameter | Description |
339-
| --- | --- | --- | --- |
340-
| **Model Loading** | `--load` (Megatron checkpoint) + architecture args (`--num-layers`, `--hidden-size` etc.) | `--hf-checkpoint` (Required) | **FSDP**: Directly uses HuggingFace format, no weight conversion needed, architecture inferred via `AutoConfig` |
341-
| **Tensor Parallel** | `--tensor-model-parallel-size` | Coming Soon | |
342-
| **Pipeline Parallel** | `--pipeline-model-parallel-size` | Coming Soon | |
343-
| **Expert Parallel** | `--expert-model-parallel-size` | Coming Soon | |
344-
| **Context Parallel** | `--context-parallel-size` | `--context-parallel-size` | Both support CP |
345-
| **Initial Learning Rate** | `--lr` | `--lr` | Same parameter |
346-
| **Learning Rate Decay** | `--lr-decay-style` (linear/cosine etc.) | `--lr-decay-style` | Same parameter |
347-
| **Warmup** | `--lr-warmup-iters` (steps) | `--lr-warmup-iters` | Same parameter |
348-
| **Min Learning Rate** | `--min-lr` | `--min-lr` | Same parameter |
349-
| **Optimizer Type** | `--optimizer` (adam/sgd etc.) | `--optimizer` (default adam) | Basically same |
350-
| **Distributed Optimizer** | `--use-distributed-optimizer` | Built-in to FSDP | FSDP uses distributed optimizer by default |
351-
| **Gradient Checkpoint** | `--recompute-granularity`, `--recompute-method` | `--gradient-checkpointing` | **FSDP**: Simplified to boolean switch |
352-
| **CPU Offload** | Implemented via distributed optimizer | `--fsdp-cpu-offload` | **FSDP**: Offload parameters/gradients/optimizer states to CPU |
353-
| **CPU Backend** | Implemented via distributed optimizer | `--fsdp-cpu-backend` | **FSDP**: Specify CPU backend and use hybrid backend when CPU offload is enabled |
354-
| **Attention Backend** | Decided by Megatron Core | `--attn-implementation` (flash_attention_2/sdpa/eager) | **FSDP**: Directly passed to HuggingFace |
355-
| **Mixed Precision** | `--fp16` or `--bf16` | `--fp16` (bf16 inferred automatically) | Basically same |
356-
| **Training Backend** | Default or `--train-backend megatron` | `--train-backend fsdp` (Required) | Used to switch backend |
357-
| **Config** | | `--config` | **FSDP**: Set additional parameters for FSDP backend |
358-
359-
### Quick Start
360-
361-
```bash
362-
# If you need to use WANDB, you need to set the environment variable WANDB_API_KEY in advance
363-
# Download model weights (Qwen3-4B)
364-
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
365-
366-
# Download training dataset (dapo-math-17k)
367-
hf download --repo-type dataset zhuzilin/dapo-math-17k \
368-
--local-dir /root/dapo-math-17k
369-
370-
# Download evaluation dataset (aime-2024)
371-
hf download --repo-type dataset zhuzilin/aime-2024 \
372-
--local-dir /root/aime-2024
373-
374-
# Clone code and install dependencies
375-
git clone https://github.com/THUDM/slime.git
376-
cd slime
377-
pip install -e . --no-deps
378-
379-
380-
# FSDP does not require weight conversion, natively supports huggingface format
381-
# Enable reference model, train Qwen3-4B in colocate mode
382-
source /root/slime/scripts/run-qwen3-4B-fsdp.sh
383-
```
384-

docs/zh/examples/qwen3-next-80B-A3B.md

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -96,13 +96,3 @@ export BASE_FOLDER=/root
9696
export MASTER_ADDR=your_master_addr
9797
bash scripts/run-qwen3-next-80B-A3B.sh
9898
```
99-
100-
## 执行训练 (FSDP)
101-
102-
```bash
103-
export BASE_FOLDER=./models/
104-
export MASTER_ADDR=127.0.0.1
105-
106-
bash scripts/run-qwen3-next-80B-A3B-fsdp.sh
107-
```
108-

docs/zh/get_started/quick_start.md

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -104,15 +104,6 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
104104

105105
由于 Megatron 会对 embedding 做 padding,可能会出现转换出来的权重的 embedding 形状不匹配的问题。这时需要在转换时设置 `--vocab-size`
106106

107-
对于使用 FSDP 后端训练并保存的检查点(目录中没有 `common.pt` 的情况),请使用专门的转换脚本。将 `--input-dir` 指向检查点目录(例如 `iter_xxx``iter_xxx/model`),并提供原始 Hugging Face 模型路径:
108-
109-
```bash
110-
python tools/convert_fsdp_to_hf.py \
111-
--input-dir /path/to/fsdp_ckpt/iter_xxx \
112-
--output-dir /root/fsdp-converted \
113-
--origin-hf-dir /root/GLM-Z1-9B-0414
114-
```
115-
116107
## 训练脚本与参数概览
117108

118109
完成上述准备工作后,即可运行训练脚本。

docs/zh/get_started/usage.md

Lines changed: 2 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
在使用 slime 时,传参主要是为了如下几件事:
66

77
1. 把集群中一部分 GPU 分配做训练,一部分分配做推理;
8-
2. 训练的部分加载 megatron或者FSDP
8+
2. 训练的部分加载 megatron
99
3. 推理部分加载 sglang;
1010
4. 配置 RL 训练需要的超参。
1111

@@ -38,7 +38,7 @@
3838
slime 支持多种训练后端,可以通过 `--train-backend` 参数进行选择:
3939

4040
- `megatron`(默认):使用 Megatron-LM 作为训练后端,支持大规模模型的高效训练;
41-
- `fsdp`:使用 PyTorch FSDP 作为训练后端,可以直接加载 HuggingFace 格式权重,无需转换。
41+
- `fsdp`(实验性):使用 PyTorch FSDP 作为训练后端,可以直接加载 HuggingFace 格式权重,无需转换。
4242

4343
### 加载 megatron
4444

@@ -321,61 +321,3 @@ if __name__ == "__main__":
321321
- `--custom-megatron-init-path`:会增加一些 init 的调用;
322322
- `--custom-megatron-before-log-prob-hook-path`:会在计算 log prob 之前调用;
323323
- `--custom-megatron-before-train-step-hook-path`:会在每个训练步之前调用。可以考虑用这种方式混入特殊的训练 loss 之类的。
324-
325-
## FSDP 使用方法
326-
327-
slime 同样也支持FSDP2作为训练后端,可以参考[文档](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/fsdp/readme.md)
328-
329-
> FSDP 通过 `AutoModelForCausalLM.from_pretrained()` 自动读取所有架构信息,无需手动指定。Megatron 需要手动配置参数读取 model 架构信息,FSDP可以全部从 `config.json` 自动读取,可以直接避免权重格式转换步骤。
330-
331-
可以通过在命令行传递 `--train-backend fsdp` 来启动 FSDP 作为训练后端。
332-
333-
### 参数
334-
335-
FSDP和Megatron后端支持的参数的对比如下表所示,接下来FSDP会有更多的支持。
336-
337-
| 配置类别 | Megatron 参数 | FSDP 参数 | 说明 |
338-
| --- | --- | --- | --- |
339-
| **模型加载** | `--load` (Megatron checkpoint) + 架构参数 (`--num-layers`, `--hidden-size` 等) | `--hf-checkpoint` (必需) | **FSDP**: 直接使用 HuggingFace 格式,无需转换权重,通过 `AutoConfig` 自动推断架构 |
340-
| **张量并行** | `--tensor-model-parallel-size` | Coming Soon | |
341-
| **流水线并行** | `--pipeline-model-parallel-size` | Coming Soon | |
342-
| **专家并行** | `--expert-model-parallel-size` | Coming Soon | |
343-
| **上下文并行** | `--context-parallel-size` | `--context-parallel-size` | 两者都支持 CP |
344-
| **初始学习率** | `--lr` | `--lr` | 参数相同 |
345-
| **学习率衰减** | `--lr-decay-style` (linear/cosine 等) | `--lr-decay-style` | 参数相同 |
346-
| **Warmup** | `--lr-warmup-iters` (步数) | `--lr-warmup-iters` | 参数相同 |
347-
| **最小学习率** | `--min-lr` | `--min-lr` | 参数相同 |
348-
| **优化器类型** | `--optimizer` (adam/sgd 等) | `--optimizer` (默认 adam) | 基本相同 |
349-
| **分布式优化器** | `--use-distributed-optimizer` | 内置于 FSDP | FSDP 默认使用分布式优化器 |
350-
| **梯度检查点** | `--recompute-granularity`, `--recompute-method` | `--gradient-checkpointing` | **FSDP**: 简化为布尔开关 |
351-
| **CPU Offload** | 通过分布式优化器实现 | `--fsdp-cpu-offload` | **FSDP**: 将参数/梯度/优化器状态卸载到 CPU |
352-
| **CPU 后端** | 通过分布式优化器实现 | `--fsdp-cpu-backend` | **FSDP**: 指定CPU的后端并且当CPU offload时使用混合后端 |
353-
| **Attention 后端** | 由 Megatron Core 决定 | `--attn-implementation` (flash_attention_2/sdpa/eager) | **FSDP**: 直接透传给 HuggingFace |
354-
| **混合精度** | `--fp16``--bf16` | `--fp16` (bf16 自动推断) | 基本相同 |
355-
| **训练后端** | 默认或 `--train-backend megatron` | `--train-backend fsdp` (必需) | 用于切换后端 |
356-
| **参数配置** | | `--config` | **FSDP**: 为FSDP设置额外的参数 |
357-
358-
### FSDP 一键启动
359-
360-
```bash
361-
# 如果需要使用 WANDB,需要提前设置好环境变量 WANDB_API_KEY
362-
# 下载模型权重 (Qwen3-4B)
363-
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
364-
365-
# 下载训练数据集 (dapo-math-17k)
366-
hf download --repo-type dataset zhuzilin/dapo-math-17k \
367-
--local-dir /root/dapo-math-17k
368-
369-
# 下载评估数据集 (aime-2024)
370-
hf download --repo-type dataset zhuzilin/aime-2024 \
371-
--local-dir /root/aime-2024
372-
373-
# 克隆代码并安装依赖
374-
git clone https://github.com/THUDM/slime.git
375-
cd slime
376-
pip install -e . --no-deps
377-
378-
379-
# FSDP不用进行权重转换,native 支持 huggingface 格式
380-
# 开启 reference model,在 colocated 模式下训练 Qwen3-4B
381-
source /root/slime/scripts/run-qwen3-4B-fsdp.sh

examples/DrGRPO/README.md

Lines changed: 0 additions & 50 deletions
This file was deleted.

examples/DrGRPO/custom_reducer.py

Lines changed: 0 additions & 67 deletions
This file was deleted.

examples/README.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,7 @@ These examples provide concrete examples to leverage slime in your own RL workfl
44

55
## Directory Structure
66

7-
- **[DrGRPO](./DrGRPO)**: Custom reducer for Dr.GRPO algorithm.
8-
- **[eval](./eval)**: Documentation and setup for evaluation environments using NeMo-Skills.
9-
- **[eval_multi_task](./eval_multi_task)**: Example for supporting OOD evaluation tasks, e.g., GPQA, IFBench.
10-
- **[formal_math](./formal_math)**: Examples related to formal math reasoning tasks, including a single round demo.
7+
- **[eval_multi_task](./eval_multi_task)**: Example for supporting evaluation multiple tasks with different configs.
118
- **[fully_async](./fully_async)**: Demonstrates fully asynchronous rollout generation for higher efficiency.
129
- **[geo3k_vlm](./geo3k_vlm)**: Training VLMs with FSDP on a single-turn reasoning task using GRPO on the GEO3K dataset.
1310
- **[geo3k_vlm_multi_turn](./geo3k_vlm_multi_turn)**: VLM multi-turn training (FSDP backend) on Geo3k dataset.

0 commit comments

Comments
 (0)