Skip to content

Commit 0e1c4b2

Browse files
authored
[doc] some clean up for the doc (#238)
* [doc] some clean up for the doc * remove redundant scripts
1 parent fd7c64e commit 0e1c4b2

11 files changed

+93
-1268
lines changed

README.md

Lines changed: 2 additions & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -38,116 +38,9 @@
3838
For a comprehensive quick start guide covering environment setup, data preparation, training startup, and key code analysis, please refer to:
3939
- [Quick Start Guide](./docs/en/quick_start.md)
4040

41-
## Checkpoint Format Conversion
41+
## Arguments Walk Through
4242

43-
Since slime uses Megatron, and Megatron does not support loading Hugging Face checkpoints directly, we need to convert the model to the `torch_dist` format that Megatron supports.
44-
45-
#### HF → Megatron torch\_dist ckpt
46-
47-
We are using [mbridge](https://github.com/ISEEKYAN/mbridge.git) for conversion:
48-
49-
```bash
50-
cd slime/
51-
52-
source scripts/models/glm4-9B.sh
53-
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
54-
${MODEL_ARGS[@]} \
55-
--hf-checkpoint /root/GLM-Z1-9B-0414 \
56-
--save /root/GLM-Z1-9B-0414_torch_dist
57-
```
58-
59-
This conversion requires GPU, so for large models, you can use the following methods to convert with multiple GPUS, note that you can add parallel config the same way as training:
60-
61-
```bash
62-
source scripts/models/glm4.5-355B-A32B.sh
63-
PYTHONPATH=/root/Megatron-LM/ torchrun \
64-
--nproc-per-node 8 \
65-
--master-addr ${MASTER_ADDR} --master-port 12345 \
66-
--nnodes=2 --node-rank ${NODE_RANK} \
67-
tools/convert_hf_to_torch_dist.py \
68-
${MODEL_ARGS[@]} \
69-
--hf-checkpoint $BASE_DIR/GLM-4.5-355B-A32B/ \
70-
--save $BASE_DIR/GLM-4.5-355B-A32B_torch_dist/
71-
```
72-
73-
⚠️ If you encounter an issue where slime cannot be found, please run `pip install -e .` in the slime directory.
74-
75-
#### Megatron torch\_dist → HF ckpt
76-
77-
To convert a `torch_dist` checkpoint saved during training back to a Hugging Face checkpoint:
78-
79-
```bash
80-
cd slime/
81-
PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
82-
--input-dir /path/to/torch_dist_ckpt/iter_xxx/ \
83-
--output-dir /root/GLM-Z1-9B-0414-iter_xxx \
84-
--origin-hf-dir /root/GLM-Z1-9B-0414
85-
```
86-
87-
There are times when Megatron padded embedding, you can pass `--vocab-size` to make sure the embedding size of the converted HF ckpt is correct.
88-
89-
⚠️ Since the `torch_dist` checkpoint converted by mbridge does not currently save args, you cannot convert the checkpoint from the previous step back to HF format.
90-
91-
#### Any Megatron ckpt → HF
92-
93-
Applicable for custom save formats (e.g., `--ckpt-format torch`).
94-
95-
The principle behind this conversion method is to reuse the function that updates parameters from Megatron to SGLang during training. This means reusing the training script and changing the original command from:
96-
97-
```bash
98-
ray job submit --address="http://127.0.0.1:8265" \
99-
--runtime-env-json='{
100-
"env_vars": { ...}
101-
}' \
102-
-- python3 train.py \
103-
... # Other training args
104-
```
105-
106-
To:
107-
108-
```bash
109-
torchrun --nproc_per_node ${NUM_GPU} tools/convert_to_hf.py \
110-
--load /your/saved/megatron_ckpt \
111-
--output-dir /your/converted/hf_ckpt \
112-
... # Other training args
113-
```
114-
115-
That is, keep all other arguments the same, and:
116-
117-
1. Change the task launcher from `ray` to `torchrun`. Set the number of GPUs to the minimum required for Megatron's parallelism without data parallelism (DP). For example, if you are using `tp4`, set it to 4.
118-
2. Make sure to change `--load` to the path of the checkpoint you want to load.
119-
3. Add the `--output-dir` argument to specify where the converted Hugging Face checkpoint should be saved.
120-
121-
## Starting the Training Process
122-
123-
The entire program needs to be launched using Ray. First, you need to start a Ray cluster. On node 0, run:
124-
125-
```bash
126-
# Node0 (HEAD)
127-
ray start --head --node-ip-address ${MASTER_ADDR} \
128-
--num-gpus 8 --disable-usage-stats
129-
130-
# Other Nodes
131-
ray start --address=${MASTER_ADDR}:6379 --num-gpus 8
132-
```
133-
134-
After the Ray cluster has started, you can submit a job from node 0, for example:
135-
136-
```bash
137-
ray job submit --address="http://127.0.0.1:8265" \
138-
--runtime-env-json='{
139-
"env_vars": {
140-
"PYTHONPATH": "/root/Megatron-LM/",
141-
... # e.g., no_proxy, API variables, etc.
142-
}
143-
}' \
144-
-- python3 train.py \
145-
--... # Other Megatron/SGLang/slime arguments
146-
```
147-
148-
### Argument Descriptions
149-
150-
Arguments are divided into three categories:
43+
Arguments in slime are divided into three categories:
15144

15245
1. **Megatron arguments**: slime reads all arguments set in Megatron via `PYTHONPATH`. You can configure Megatron by passing arguments like `--tensor-model-parallel-size 2`.
15346
2. **SGLang arguments**: All arguments for the installed SGLang are supported. These arguments must be prefixed with `--sglang-`. For example, `--mem-fraction-static` should be passed as `--sglang-mem-fraction-static`.

README_zh.md

Lines changed: 1 addition & 108 deletions
Original file line numberDiff line numberDiff line change
@@ -33,114 +33,7 @@
3333

3434
- [快速开始指南](./docs/zh/quick_start.md)
3535

36-
## Checkpoint 格式转换
37-
38-
由于 slime 使用 megatron,而 megatron 不支持加载 huggingface checkpoint,我们需要将模型转换至 megatron 可以支持的 torch_dist 格式。
39-
40-
#### HF → Megatron torch_dist ckpt
41-
42-
我们使用 [mbridge](https://github.com/ISEEKYAN/mbridge.git) 进行 checkpoint 转换,使用方式如下:
43-
44-
```bash
45-
cd slime/
46-
47-
source scripts/models/glm4-9B.sh
48-
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
49-
${MODEL_ARGS[@]} \
50-
--hf-checkpoint /root/GLM-Z1-9B-0414 \
51-
--save /root/GLM-Z1-9B-0414_torch_dist
52-
```
53-
54-
转换需要使用 GPU,如果模型较大,可以用如下方式进行多机多卡的转换,并且在转换时像训练一样配置上合适的并行,例如:
55-
56-
```bash
57-
source scripts/models/glm4.5-355B-A32B.sh
58-
PYTHONPATH=/root/Megatron-LM/ torchrun \
59-
--nproc-per-node 8 \
60-
--master-addr ${MASTER_ADDR} --master-port 12345 \
61-
--nnodes=2 --node-rank ${NODE_RANK} \
62-
tools/convert_hf_to_torch_dist.py \
63-
${MODEL_ARGS[@]} \
64-
--hf-checkpoint $BASE_DIR/GLM-4.5-355B-A32B/ \
65-
--save $BASE_DIR/GLM-4.5-355B-A32B_torch_dist/
66-
```
67-
68-
⚠️ 如果出现找不到 slime 的问题,请在 slime 目录下 `pip install -e .`
69-
70-
#### Megatron torch_dist → HF ckpt
71-
72-
将训练过程中的存储的 torch_dist ckpt 转为 hf ckpt:
73-
74-
```bash
75-
cd slime/
76-
PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
77-
--input-dir /path/to/torch_dist_ckpt/iter_xxx/ \
78-
--output-dir /root/GLM-Z1-9B-0414-iter_xxx \
79-
--origin-hf-dir /root/GLM-Z1-9B-0414
80-
```
81-
82-
由于 Megatron 会对 embedding 做 padding,可能会出现转换出来的 hf checkpoint 的 embedding 形状不匹配的问题。这时需要在转换时设置 `--vocab-size`
83-
84-
⚠️ 由于 mbridge 转换的 torch_dist ckpt 目前不保存 args,不能基于上一步的 torch_dist ckpt 反转回 HF。
85-
86-
#### 任意 Megatron ckpt → HF
87-
88-
适用于自定义保存格式(如 `--ckpt-format torch`)。
89-
90-
转化方式的原理是直接复用训练中,从 megatron 向 sglang 更新参数的函数,也就是直接复用一下训练脚本,将原先的:
91-
92-
```bash
93-
ray job submit --address="http://127.0.0.1:8265" \
94-
--runtime-env-json='{
95-
"env_vars": { ...}
96-
}' \
97-
-- python3 train.py \
98-
... # 其他训练 args
99-
```
100-
101-
改成:
102-
103-
```bash
104-
torchrun --nproc_per_node ${NUM_GPU} tools/convert_to_hf.py \
105-
--load /your/saved/megatron_ckpt \
106-
--output-dir /your/converted/hf_ckpt \
107-
... # 其他训练 args
108-
```
109-
110-
即,保持所有的参数不变,将:
111-
112-
1. 任务启动从 ray 变成 torchrun,把 gpu 数量保存为 megatron 并行的不带 dp 的最小 gpu 数,例如如果是 tp4,就设成 4;
113-
2. 确认把 `--load` 改成了需要 load 的路径;
114-
3. 增加 `--output-dir` 对应要保存的 hf_ckpt。
115-
116-
## 启动训练流程
117-
118-
整个程序需要使用 ray 进行启动,首先需要启动一个 ray 集群,即在 node 0 运行:
119-
120-
```bash
121-
# Node0(HEAD)
122-
ray start --head --node-ip-address ${MASTER_ADDR} \
123-
--num-gpus 8 --disable-usage-stats
124-
125-
# 其他 Node
126-
ray start --address=${MASTER_ADDR}:6379 --num-gpus 8
127-
```
128-
129-
在 ray 集群启动后,可以在 node 0 提交任务,例如:
130-
131-
```bash
132-
ray job submit --address="http://127.0.0.1:8265" \
133-
--runtime-env-json='{
134-
"env_vars": {
135-
"PYTHONPATH": "/root/Megatron-LM/",
136-
... # e.g. no_proxy、接口变量等
137-
}
138-
}' \
139-
-- python3 train.py \
140-
--...(其他 Megatron/SGLang/slime 参数)
141-
```
142-
143-
#### 参数说明
36+
## 参数说明
14437

14538
参数分为三类:
14639

docs/en/quick_start.md

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# slime Quick Start Guide
22

3+
![中文版](../zh/quick_start.md)
4+
35
This document will guide you through setting up the environment and getting started with slime within one hour, covering environment configuration, data preparation, training startup, and key code analysis and modifications.
46

57
## Basic Environment Setup
@@ -78,6 +80,22 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
7880
--save /root/GLM-Z1-9B-0414_torch_dist
7981
```
8082

83+
For larger models, you can use `torchrun` to start the covnersion script to convert with multi-gpus or even multi-nodes.
84+
85+
### Convert from Megatron Format to Hugging Face Format
86+
87+
You can use the following script to convert the saved Megatron chekcpoints back to Hugging Face format:
88+
89+
```bash
90+
PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
91+
--input-dir /path/to/torch_dist_ckpt/iter_xxx/ \
92+
--output-dir /root/GLM-Z1-9B-0414-iter_xxx \
93+
--origin-hf-dir /root/GLM-Z1-9B-0414
94+
```
95+
96+
Note that as Megatron will do padding to embedding for better performance, it may happen that the converted embedding is not correct. In that case, please manually set `--vocab-size` during convertion.
97+
98+
8199
## Training Script and Parameter Overview
82100

83101
After completing the above preparation work, you can run the training script.
@@ -506,7 +524,32 @@ ROLLOUT_ARGS+=(
506524
)
507525
```
508526

509-
## Multi-Machine Training for Large-Scale MOE Models
527+
## Multi-Node Training for Large-Scale MOE Models
528+
529+
To start a multi-node task, you need to first start a Ray cluster. On node 0, run:
530+
531+
```bash
532+
# Node0 (HEAD)
533+
ray start --head --node-ip-address ${MASTER_ADDR} \
534+
--num-gpus 8 --disable-usage-stats
535+
536+
# Other Nodes
537+
ray start --address=${MASTER_ADDR}:6379 --num-gpus 8
538+
```
539+
540+
After the Ray cluster has started, you can submit a job from node 0, for example:
541+
542+
```bash
543+
ray job submit --address="http://127.0.0.1:8265" \
544+
--runtime-env-json='{
545+
"env_vars": {
546+
"PYTHONPATH": "/root/Megatron-LM/",
547+
... # e.g., no_proxy, API variables, etc.
548+
}
549+
}' \
550+
-- python3 train.py \
551+
--... # Other Megatron/SGLang/slime arguments
552+
```
510553

511554
slime has been deeply optimized for distributed training of large-scale Mixture of Experts (MoE) models. We provide some end-to-end training cases for reference:
512555

docs/zh/quick_start.md

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# slime 快速使用指南
22

3+
![English](../en/quick_start.md)
4+
35
本文档从搭建环境开始,在一小时内带您快速上手 slime,涵盖环境配置,数据准备,训练启动和关键代码解析和魔改。
46

57
## 基础环境搭建
@@ -79,6 +81,21 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
7981
--save /root/GLM-Z1-9B-0414_torch_dist
8082
```
8183

84+
对于更大的模型,可以使用 `torchrun` 来启动转换脚本,从而使用多张 GPU 甚至多机进行权重转换。
85+
86+
### Megatron 格式 转换为 Hugging Face 格式
87+
88+
可以通过这样的方式将训练过程中保存的 Megatron 格式的权重转换回 Huggingface 格式:
89+
90+
```bash
91+
PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
92+
--input-dir /path/to/torch_dist_ckpt/iter_xxx/ \
93+
--output-dir /root/GLM-Z1-9B-0414-iter_xxx \
94+
--origin-hf-dir /root/GLM-Z1-9B-0414
95+
```
96+
97+
由于 Megatron 会对 embedding 做 padding,可能会出现转换出来的权重的 embedding 形状不匹配的问题。这时需要在转换时设置 `--vocab-size`
98+
8299
## 训练脚本与参数概览
83100

84101
完成上述准备工作后,即可运行训练脚本。
@@ -515,8 +532,33 @@ ROLLOUT_ARGS+=(
515532
)
516533
```
517534

518-
519535
## 大规模 MOE 模型的多机训练
536+
537+
为了启动多机任务,首先需要启动一个 ray 集群,即在 node 0 运行:
538+
539+
```bash
540+
# Node0(HEAD)
541+
ray start --head --node-ip-address ${MASTER_ADDR} \
542+
--num-gpus 8 --disable-usage-stats
543+
544+
# 其他 Node
545+
ray start --address=${MASTER_ADDR}:6379 --num-gpus 8
546+
```
547+
548+
在 ray 集群启动后,可以在 node 0 提交任务,例如:
549+
550+
```bash
551+
ray job submit --address="http://127.0.0.1:8265" \
552+
--runtime-env-json='{
553+
"env_vars": {
554+
"PYTHONPATH": "/root/Megatron-LM/",
555+
... # e.g. no_proxy、接口变量等
556+
}
557+
}' \
558+
-- python3 train.py \
559+
--...(其他 Megatron/SGLang/slime 参数)
560+
```
561+
520562
slime 针对大规模混合专家(MoE)模型的分布式训练进行了深度优化。我们提供了一些端到端的训练案例以供参考:
521563

522564
- [示例:64xH100 训练 GLM-4.5](models/glm4.5-355B-A32B.md)

0 commit comments

Comments
 (0)