THUDM
diff --git a/‎README.md‎
Lines changed: 2 additions & 109 deletions b/‎README.md‎
Lines changed: 2 additions & 109 deletions
diff --git a/‎README_zh.md‎
Lines changed: 1 addition & 108 deletions b/‎README_zh.md‎
Lines changed: 1 addition & 108 deletions
diff --git a/‎docs/en/quick_start.md‎
Lines changed: 44 additions & 1 deletion b/‎docs/en/quick_start.md‎
Lines changed: 44 additions & 1 deletion
diff --git a/‎docs/zh/quick_start.md‎
Lines changed: 43 additions & 1 deletion b/‎docs/zh/quick_start.md‎
Lines changed: 43 additions & 1 deletion
@@ -38,116 +38,9 @@
 For a comprehensive quick start guide covering environment setup, data preparation, training startup, and key code analysis, please refer to:
 - [Quick Start Guide](./docs/en/quick_start.md)
 
-## Checkpoint Format Conversion
+## Arguments Walk Through
 
-Since slime uses Megatron, and Megatron does not support loading Hugging Face checkpoints directly, we need to convert the model to the `torch_dist` format that Megatron supports.
-
-#### HF → Megatron torch\_dist ckpt
-
-We are using [mbridge](https://github.com/ISEEKYAN/mbridge.git) for conversion:
-
-```bash
-cd slime/
-
-source scripts/models/glm4-9B.sh
-PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
-    ${MODEL_ARGS[@]} \
-    --hf-checkpoint /root/GLM-Z1-9B-0414 \
-    --save /root/GLM-Z1-9B-0414_torch_dist
-```
-
-This conversion requires GPU, so for large models, you can use the following methods to convert with multiple GPUS, note that you can add parallel config the same way as training:
-
-```bash
-source scripts/models/glm4.5-355B-A32B.sh
-PYTHONPATH=/root/Megatron-LM/ torchrun \
-   --nproc-per-node 8 \
-   --master-addr ${MASTER_ADDR} --master-port 12345 \
-   --nnodes=2 --node-rank ${NODE_RANK} \
-   tools/convert_hf_to_torch_dist.py \
-   ${MODEL_ARGS[@]} \
-   --hf-checkpoint $BASE_DIR/GLM-4.5-355B-A32B/ \
-   --save $BASE_DIR/GLM-4.5-355B-A32B_torch_dist/
-```
-
-⚠️ If you encounter an issue where slime cannot be found, please run `pip install -e .` in the slime directory.
-
-#### Megatron torch\_dist → HF ckpt
-
-To convert a `torch_dist` checkpoint saved during training back to a Hugging Face checkpoint:
-
-```bash
-cd slime/
-PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
-  --input-dir /path/to/torch_dist_ckpt/iter_xxx/ \
-  --output-dir /root/GLM-Z1-9B-0414-iter_xxx \
-  --origin-hf-dir /root/GLM-Z1-9B-0414
-```
-
-There are times when Megatron padded embedding, you can pass `--vocab-size` to make sure the embedding size of the converted HF ckpt is correct.
-
-⚠️ Since the `torch_dist` checkpoint converted by mbridge does not currently save args, you cannot convert the checkpoint from the previous step back to HF format.
-
-#### Any Megatron ckpt → HF
-
-Applicable for custom save formats (e.g., `--ckpt-format torch`).
-
-The principle behind this conversion method is to reuse the function that updates parameters from Megatron to SGLang during training. This means reusing the training script and changing the original command from:
-
-```bash
-ray job submit --address="http://127.0.0.1:8265" \
-   --runtime-env-json='{
-     "env_vars": { ...}
-   }' \
-   -- python3 train.py \
-   ... # Other training args
-```
-
-To:
-
-```bash
-torchrun --nproc_per_node ${NUM_GPU} tools/convert_to_hf.py \
-   --load /your/saved/megatron_ckpt \
-   --output-dir /your/converted/hf_ckpt \
-   ... # Other training args
-```
-
-That is, keep all other arguments the same, and:
-
-1.  Change the task launcher from `ray` to `torchrun`. Set the number of GPUs to the minimum required for Megatron's parallelism without data parallelism (DP). For example, if you are using `tp4`, set it to 4.
-2.  Make sure to change `--load` to the path of the checkpoint you want to load.
-3.  Add the `--output-dir` argument to specify where the converted Hugging Face checkpoint should be saved.
-
-## Starting the Training Process
-
-The entire program needs to be launched using Ray. First, you need to start a Ray cluster. On node 0, run:
-
-```bash
-# Node0 (HEAD)
-ray start --head --node-ip-address ${MASTER_ADDR} \
-  --num-gpus 8 --disable-usage-stats
-
-# Other Nodes
-ray start --address=${MASTER_ADDR}:6379 --num-gpus 8
-```
-
-After the Ray cluster has started, you can submit a job from node 0, for example:
-
-```bash
-ray job submit --address="http://127.0.0.1:8265" \
-   --runtime-env-json='{
-     "env_vars": {
-        "PYTHONPATH": "/root/Megatron-LM/",
-        ... # e.g., no_proxy, API variables, etc.
-     }
-   }' \
-   -- python3 train.py \
-   --... # Other Megatron/SGLang/slime arguments
-```
-
-### Argument Descriptions
-
-Arguments are divided into three categories:
+Arguments in slime are divided into three categories:
 
 1.  **Megatron arguments**: slime reads all arguments set in Megatron via `PYTHONPATH`. You can configure Megatron by passing arguments like `--tensor-model-parallel-size 2`.
 2.  **SGLang arguments**: All arguments for the installed SGLang are supported. These arguments must be prefixed with `--sglang-`. For example, `--mem-fraction-static` should be passed as `--sglang-mem-fraction-static`.
 
@@ -33,114 +33,7 @@
 
 - [快速开始指南](./docs/zh/quick_start.md)
 
-## Checkpoint 格式转换
-
-由于 slime 使用 megatron，而 megatron 不支持加载 huggingface checkpoint，我们需要将模型转换至 megatron 可以支持的 torch_dist 格式。
-
-#### HF → Megatron torch_dist ckpt
-
-我们使用 [mbridge](https://github.com/ISEEKYAN/mbridge.git) 进行 checkpoint 转换，使用方式如下：
-
-```bash
-cd slime/
-
-source scripts/models/glm4-9B.sh
-PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
-    ${MODEL_ARGS[@]} \
-    --hf-checkpoint /root/GLM-Z1-9B-0414 \
-    --save /root/GLM-Z1-9B-0414_torch_dist
-```
-
-转换需要使用 GPU，如果模型较大，可以用如下方式进行多机多卡的转换，并且在转换时像训练一样配置上合适的并行，例如：
-
-```bash
-source scripts/models/glm4.5-355B-A32B.sh
-PYTHONPATH=/root/Megatron-LM/ torchrun \
-   --nproc-per-node 8 \
-   --master-addr ${MASTER_ADDR} --master-port 12345 \
-   --nnodes=2 --node-rank ${NODE_RANK} \
-   tools/convert_hf_to_torch_dist.py \
-   ${MODEL_ARGS[@]} \
-   --hf-checkpoint $BASE_DIR/GLM-4.5-355B-A32B/ \
-   --save $BASE_DIR/GLM-4.5-355B-A32B_torch_dist/
-```
-
-⚠️  如果出现找不到 slime 的问题，请在 slime 目录下 `pip install -e .`。
-
-#### Megatron torch_dist → HF ckpt
-
-将训练过程中的存储的 torch_dist ckpt 转为 hf ckpt：
-
-```bash
-cd slime/
-PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
-  --input-dir /path/to/torch_dist_ckpt/iter_xxx/ \
-  --output-dir /root/GLM-Z1-9B-0414-iter_xxx \
-  --origin-hf-dir /root/GLM-Z1-9B-0414
-```
-
-由于 Megatron 会对 embedding 做 padding，可能会出现转换出来的 hf checkpoint 的 embedding 形状不匹配的问题。这时需要在转换时设置 `--vocab-size`。
-
-⚠️ 由于 mbridge 转换的 torch_dist ckpt 目前不保存 args，不能基于上一步的 torch_dist ckpt 反转回 HF。
-
-#### 任意 Megatron ckpt → HF
-
-适用于自定义保存格式（如 `--ckpt-format torch`）。
-
-转化方式的原理是直接复用训练中，从 megatron 向 sglang 更新参数的函数，也就是直接复用一下训练脚本，将原先的：
-
-```bash
-ray job submit --address="http://127.0.0.1:8265" \
-   --runtime-env-json='{
-     "env_vars": { ...}
-   }' \
-   -- python3 train.py \
-   ... # 其他训练 args
-```
-
-改成：
-
-```bash
-torchrun --nproc_per_node ${NUM_GPU} tools/convert_to_hf.py \
-   --load /your/saved/megatron_ckpt \
-   --output-dir /your/converted/hf_ckpt \
-   ... # 其他训练 args
-```
-
-即，保持所有的参数不变，将：
-
-1. 任务启动从 ray 变成 torchrun，把 gpu 数量保存为 megatron 并行的不带 dp 的最小 gpu 数，例如如果是 tp4，就设成 4；
-2. 确认把 `--load` 改成了需要 load 的路径；
-3. 增加 `--output-dir` 对应要保存的 hf_ckpt。
-
-## 启动训练流程
-
-整个程序需要使用 ray 进行启动，首先需要启动一个 ray 集群，即在 node 0 运行：
-
-```bash
-# Node0（HEAD）
-ray start --head --node-ip-address ${MASTER_ADDR} \
-  --num-gpus 8 --disable-usage-stats
-
-# 其他 Node
-ray start --address=${MASTER_ADDR}:6379 --num-gpus 8
-```
-
-在 ray 集群启动后，可以在 node 0 提交任务，例如：
-
-```bash
-ray job submit --address="http://127.0.0.1:8265" \
-   --runtime-env-json='{
-     "env_vars": {
-        "PYTHONPATH": "/root/Megatron-LM/",
-        ... # e.g. no_proxy、接口变量等
-     }
-   }' \
-   -- python3 train.py \
-   --...（其他 Megatron/SGLang/slime 参数）
-```
-
-#### 参数说明
+## 参数说明
 
 参数分为三类：
 
 
@@ -1,5 +1,7 @@
 # slime Quick Start Guide
 
+![中文版](../zh/quick_start.md)
+
 This document will guide you through setting up the environment and getting started with slime within one hour, covering environment configuration, data preparation, training startup, and key code analysis and modifications.
 
 ## Basic Environment Setup
@@ -78,6 +80,22 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
     --save /root/GLM-Z1-9B-0414_torch_dist
 ```
 
+For larger models, you can use `torchrun` to start the covnersion script to convert with multi-gpus or even multi-nodes.
+
+### Convert from Megatron Format to Hugging Face Format
+
+You can use the following script to convert the saved Megatron chekcpoints back to Hugging Face format:
+
+```bash
+PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
+  --input-dir /path/to/torch_dist_ckpt/iter_xxx/ \
+  --output-dir /root/GLM-Z1-9B-0414-iter_xxx \
+  --origin-hf-dir /root/GLM-Z1-9B-0414
+```
+
+Note that as Megatron will do padding to embedding for better performance, it may happen that the converted embedding is not correct. In that case, please manually set `--vocab-size` during convertion.
+
+
 ## Training Script and Parameter Overview
 
 After completing the above preparation work, you can run the training script.
@@ -506,7 +524,32 @@ ROLLOUT_ARGS+=(
 )
 ```
 
-## Multi-Machine Training for Large-Scale MOE Models
+## Multi-Node Training for Large-Scale MOE Models
+
+To start a multi-node task, you need to first start a Ray cluster. On node 0, run:
+
+```bash
+# Node0 (HEAD)
+ray start --head --node-ip-address ${MASTER_ADDR} \
+  --num-gpus 8 --disable-usage-stats
+
+# Other Nodes
+ray start --address=${MASTER_ADDR}:6379 --num-gpus 8
+```
+
+After the Ray cluster has started, you can submit a job from node 0, for example:
+
+```bash
+ray job submit --address="http://127.0.0.1:8265" \
+   --runtime-env-json='{
+     "env_vars": {
+        "PYTHONPATH": "/root/Megatron-LM/",
+        ... # e.g., no_proxy, API variables, etc.
+     }
+   }' \
+   -- python3 train.py \
+   --... # Other Megatron/SGLang/slime arguments
+```
 
 slime has been deeply optimized for distributed training of large-scale Mixture of Experts (MoE) models. We provide some end-to-end training cases for reference:
 
 
@@ -1,5 +1,7 @@
 # slime 快速使用指南
 
+![English](../en/quick_start.md)
+
 本文档从搭建环境开始，在一小时内带您快速上手 slime，涵盖环境配置，数据准备，训练启动和关键代码解析和魔改。
 
 ## 基础环境搭建
@@ -79,6 +81,21 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
     --save /root/GLM-Z1-9B-0414_torch_dist
 ```
 
+对于更大的模型，可以使用 `torchrun` 来启动转换脚本，从而使用多张 GPU 甚至多机进行权重转换。
+
+### Megatron 格式 转换为 Hugging Face 格式
+
+可以通过这样的方式将训练过程中保存的 Megatron 格式的权重转换回 Huggingface 格式：
+
+```bash
+PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
+  --input-dir /path/to/torch_dist_ckpt/iter_xxx/ \
+  --output-dir /root/GLM-Z1-9B-0414-iter_xxx \
+  --origin-hf-dir /root/GLM-Z1-9B-0414
+```
+
+由于 Megatron 会对 embedding 做 padding，可能会出现转换出来的权重的 embedding 形状不匹配的问题。这时需要在转换时设置 `--vocab-size`。
+
 ## 训练脚本与参数概览
 
 完成上述准备工作后，即可运行训练脚本。
@@ -515,8 +532,33 @@ ROLLOUT_ARGS+=(
 )
 ```
 
-
 ## 大规模 MOE 模型的多机训练
+
+为了启动多机任务，首先需要启动一个 ray 集群，即在 node 0 运行：
+
+```bash
+# Node0（HEAD）
+ray start --head --node-ip-address ${MASTER_ADDR} \
+  --num-gpus 8 --disable-usage-stats
+
+# 其他 Node
+ray start --address=${MASTER_ADDR}:6379 --num-gpus 8
+```
+
+在 ray 集群启动后，可以在 node 0 提交任务，例如：
+
+```bash
+ray job submit --address="http://127.0.0.1:8265" \
+   --runtime-env-json='{
+     "env_vars": {
+        "PYTHONPATH": "/root/Megatron-LM/",
+        ... # e.g. no_proxy、接口变量等
+     }
+   }' \
+   -- python3 train.py \
+   --...（其他 Megatron/SGLang/slime 参数）
+```
+
 slime 针对大规模混合专家（MoE）模型的分布式训练进行了深度优化。我们提供了一些端到端的训练案例以供参考：
 
 - [示例：64xH100 训练 GLM-4.5](models/glm4.5-355B-A32B.md)