THUDM
diff --git a/‎docs/en/get_started/quick_start.md‎
Lines changed: 0 additions & 9 deletions b/‎docs/en/get_started/quick_start.md‎
Lines changed: 0 additions & 9 deletions
diff --git a/‎docs/en/get_started/usage.md‎
Lines changed: 2 additions & 62 deletions b/‎docs/en/get_started/usage.md‎
Lines changed: 2 additions & 62 deletions
diff --git a/‎docs/zh/examples/qwen3-next-80B-A3B.md‎
Lines changed: 0 additions & 10 deletions b/‎docs/zh/examples/qwen3-next-80B-A3B.md‎
Lines changed: 0 additions & 10 deletions
diff --git a/‎docs/zh/get_started/quick_start.md‎
Lines changed: 0 additions & 9 deletions b/‎docs/zh/get_started/quick_start.md‎
Lines changed: 0 additions & 9 deletions
diff --git a/‎docs/zh/get_started/usage.md‎
Lines changed: 2 additions & 60 deletions b/‎docs/zh/get_started/usage.md‎
Lines changed: 2 additions & 60 deletions
diff --git a/‎examples/DrGRPO/README.md‎
Lines changed: 0 additions & 50 deletions b/‎examples/DrGRPO/README.md‎
Lines changed: 0 additions & 50 deletions
diff --git a/‎examples/DrGRPO/custom_reducer.py‎
Lines changed: 0 additions & 67 deletions b/‎examples/DrGRPO/custom_reducer.py‎
Lines changed: 0 additions & 67 deletions
diff --git a/‎examples/README.md‎
Lines changed: 1 addition & 4 deletions b/‎examples/README.md‎
Lines changed: 1 addition & 4 deletions
@@ -105,15 +105,6 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
 
 Note that as Megatron will do padding to embedding for better performance, it may happen that the converted embedding is not correct. In that case, please manually set `--vocab-size` during convertion.
 
-For FSDP checkpoints (without `common.pt`), use the dedicated conversion script. Point `--input-dir` to the checkpoint directory (e.g. `iter_xxx` or `iter_xxx/model`) and provide the original Hugging Face directory:
-
-```bash
-python tools/convert_fsdp_to_hf.py \
-  --input-dir /path/to/fsdp_ckpt/iter_xxx \
-  --output-dir /root/fsdp-converted \
-  --origin-hf-dir /root/GLM-Z1-9B-0414
-```
-
 ## Training Script and Parameter Overview
 
 After completing the above preparation work, you can run the training script.
 
@@ -6,7 +6,7 @@
 When using slime, parameters are primarily passed for the following purposes:
 
 1.  To allocate a portion of the GPUs in the cluster for training and another portion for inference.
-2.  To load Megatron or FSDP for the training portion.
+2.  To load Megatron for the training portion.
 3.  To load SGLang for the inference portion.
 4.  To configure the hyperparameters required for RL training.
 
@@ -35,7 +35,7 @@ Additionally, slime supports Prefill and Decode disaggregation (PD Disaggregatio
 slime supports multiple training backends, which can be selected via the `--train-backend` parameter:
 
 - `megatron` (default): Uses Megatron-LM as the training backend, supporting efficient training of large-scale models.
-- `fsdp`: Uses PyTorch FSDP as the training backend, allowing direct loading of HuggingFace format weights without conversion.
+- `fsdp` (experimental): Uses PyTorch FSDP as the training backend, allowing direct loading of HuggingFace format weights without conversion.
 
 ### Loading Megatron
 
@@ -322,63 +322,3 @@ In some customized Megatron implementations, special operations need to be perfo
   - `--custom-megatron-init-path`: Adds some initialization calls.
   - `--custom-megatron-before-log-prob-hook-path`: Is called before calculating the log probability.
   - `--custom-megatron-before-train-step-hook-path`: Is called before each training step. You could use this to mix in special training losses, for example.
-
-## How to Use FSDP
-
-slime also support FSDP2 as the training backend, docs [here](https://lmsys.org/blog/2025-12-03-miles-fsdp/). 
-
-> FSDP automatically reads all architecture information via `AutoModelForCausalLM.from_pretrained()`, without manual specification. Megatron requires manual configuration of parameters to read model architecture information. FSDP can read entirely from `config.json`, directly avoiding the weight format conversion step.
-
-To run FSDP as the training backend, pass `--train-backend fsdp` to enable.
-
-### Parameters
-
-Parameters that FSDP used are shown as below in comparison to Megatron, more supports are coming on the way.
-
-| Configuration Category | Megatron Parameter | FSDP Parameter | Description |
-| --- | --- | --- | --- |
-| **Model Loading**         | `--load` (Megatron checkpoint) + architecture args (`--num-layers`, `--hidden-size` etc.) | `--hf-checkpoint` (Required)                           | **FSDP**: Directly uses HuggingFace format, no weight conversion needed, architecture inferred via `AutoConfig` |
-| **Tensor Parallel**       | `--tensor-model-parallel-size`                               | Coming Soon                                            |                                                              |
-| **Pipeline Parallel**     | `--pipeline-model-parallel-size`                             | Coming Soon                                            |                                                              |
-| **Expert Parallel**       | `--expert-model-parallel-size`                               | Coming Soon                                            |                                                              |
-| **Context Parallel**      | `--context-parallel-size`                                    | `--context-parallel-size`                              | Both support CP                                              |
-| **Initial Learning Rate** | `--lr`                                                       | `--lr`                                                 | Same parameter                                               |
-| **Learning Rate Decay**   | `--lr-decay-style` (linear/cosine etc.)                      | `--lr-decay-style`                     | Same parameter |
-| **Warmup**                | `--lr-warmup-iters` (steps)                                  | `--lr-warmup-iters`                   | Same parameter |
-| **Min Learning Rate**     | `--min-lr`                                                   | `--min-lr`                                  | Same parameter |
-| **Optimizer Type**        | `--optimizer` (adam/sgd etc.)                                | `--optimizer` (default adam)                           | Basically same                                               |
-| **Distributed Optimizer** | `--use-distributed-optimizer`                                | Built-in to FSDP                                       | FSDP uses distributed optimizer by default                   |
-| **Gradient Checkpoint**   | `--recompute-granularity`, `--recompute-method`              | `--gradient-checkpointing`                             | **FSDP**: Simplified to boolean switch                       |
-| **CPU Offload**           | Implemented via distributed optimizer                        | `--fsdp-cpu-offload`                                   | **FSDP**: Offload parameters/gradients/optimizer states to CPU |
-| **CPU Backend**           | Implemented via distributed optimizer | `--fsdp-cpu-backend`                                   | **FSDP**: Specify CPU backend and use hybrid backend when CPU offload is enabled |
-| **Attention Backend**     | Decided by Megatron Core                                     | `--attn-implementation` (flash_attention_2/sdpa/eager) | **FSDP**: Directly passed to HuggingFace                     |
-| **Mixed Precision**       | `--fp16` or `--bf16`                                         | `--fp16` (bf16 inferred automatically)                 | Basically same                                               |
-| **Training Backend**      | Default or `--train-backend megatron`                        | `--train-backend fsdp` (Required)                      | Used to switch backend                                       |
-| **Config**      |                         | `--config`                     | **FSDP**: Set additional parameters for FSDP backend |
-
-### Quick Start
-
-```bash
-# If you need to use WANDB, you need to set the environment variable WANDB_API_KEY in advance
-# Download model weights (Qwen3-4B)
-hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
-
-# Download training dataset (dapo-math-17k)
-hf download --repo-type dataset zhuzilin/dapo-math-17k \
-  --local-dir /root/dapo-math-17k
-
-# Download evaluation dataset (aime-2024)
-hf download --repo-type dataset zhuzilin/aime-2024 \
-  --local-dir /root/aime-2024
-  
-# Clone code and install dependencies
-git clone https://github.com/THUDM/slime.git
-cd slime
-pip install -e . --no-deps
-
-
-# FSDP does not require weight conversion, natively supports huggingface format
-# Enable reference model, train Qwen3-4B in colocate mode
-source /root/slime/scripts/run-qwen3-4B-fsdp.sh
-```
-
@@ -96,13 +96,3 @@ export BASE_FOLDER=/root
 export MASTER_ADDR=your_master_addr
 bash scripts/run-qwen3-next-80B-A3B.sh 
 ```
-
-## 执行训练 (FSDP)
-
-```bash
-export BASE_FOLDER=./models/
-export MASTER_ADDR=127.0.0.1
-
-bash scripts/run-qwen3-next-80B-A3B-fsdp.sh
-```
-
@@ -104,15 +104,6 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
 
 由于 Megatron 会对 embedding 做 padding，可能会出现转换出来的权重的 embedding 形状不匹配的问题。这时需要在转换时设置 `--vocab-size`。
 
-对于使用 FSDP 后端训练并保存的检查点（目录中没有 `common.pt` 的情况），请使用专门的转换脚本。将 `--input-dir` 指向检查点目录（例如 `iter_xxx` 或 `iter_xxx/model`），并提供原始 Hugging Face 模型路径：
-
-```bash
-python tools/convert_fsdp_to_hf.py \
-  --input-dir /path/to/fsdp_ckpt/iter_xxx \
-  --output-dir /root/fsdp-converted \
-  --origin-hf-dir /root/GLM-Z1-9B-0414
-```
-
 ## 训练脚本与参数概览
 
 完成上述准备工作后，即可运行训练脚本。
 
@@ -5,7 +5,7 @@
 在使用 slime 时，传参主要是为了如下几件事：
 
 1. 把集群中一部分 GPU 分配做训练，一部分分配做推理；
-2. 训练的部分加载 megatron或者FSDP；
+2. 训练的部分加载 megatron；
 3. 推理部分加载 sglang；
 4. 配置 RL 训练需要的超参。
 
@@ -38,7 +38,7 @@
 slime 支持多种训练后端，可以通过 `--train-backend` 参数进行选择：
 
 - `megatron`（默认）：使用 Megatron-LM 作为训练后端，支持大规模模型的高效训练；
-- `fsdp`：使用 PyTorch FSDP 作为训练后端，可以直接加载 HuggingFace 格式权重，无需转换。
+- `fsdp`（实验性）：使用 PyTorch FSDP 作为训练后端，可以直接加载 HuggingFace 格式权重，无需转换。
 
 ### 加载 megatron
 
@@ -321,61 +321,3 @@ if __name__ == "__main__":
 - `--custom-megatron-init-path`：会增加一些 init 的调用；
 - `--custom-megatron-before-log-prob-hook-path`：会在计算 log prob 之前调用；
 - `--custom-megatron-before-train-step-hook-path`：会在每个训练步之前调用。可以考虑用这种方式混入特殊的训练 loss 之类的。
-
-## FSDP 使用方法
-
-slime 同样也支持FSDP2作为训练后端，可以参考[文档](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/fsdp/readme.md)。
-
-> FSDP 通过 `AutoModelForCausalLM.from_pretrained()` 自动读取所有架构信息，无需手动指定。Megatron 需要手动配置参数读取 model 架构信息，FSDP可以全部从 `config.json` 自动读取，可以直接避免权重格式转换步骤。
-
-可以通过在命令行传递 `--train-backend fsdp` 来启动 FSDP 作为训练后端。
-
-### 参数
-
-FSDP和Megatron后端支持的参数的对比如下表所示，接下来FSDP会有更多的支持。
-
-| 配置类别 | Megatron 参数 | FSDP 参数 | 说明 |
-| --- | --- | --- | --- |
-| **模型加载** | `--load` (Megatron checkpoint) + 架构参数 (`--num-layers`, `--hidden-size` 等) | `--hf-checkpoint` (必需) | **FSDP**: 直接使用 HuggingFace 格式，无需转换权重，通过 `AutoConfig` 自动推断架构 |
-| **张量并行** | `--tensor-model-parallel-size` | Coming Soon |  |
-| **流水线并行** | `--pipeline-model-parallel-size` | Coming Soon |  |
-| **专家并行** | `--expert-model-parallel-size` | Coming Soon |  |
-| **上下文并行** | `--context-parallel-size` | `--context-parallel-size` | 两者都支持 CP |
-| **初始学习率** | `--lr` | `--lr` | 参数相同 |
-| **学习率衰减** | `--lr-decay-style` (linear/cosine 等) | `--lr-decay-style` | 参数相同 |
-| **Warmup** | `--lr-warmup-iters` (步数) | `--lr-warmup-iters` | 参数相同 |
-| **最小学习率** | `--min-lr` | `--min-lr` | 参数相同 |
-| **优化器类型** | `--optimizer` (adam/sgd 等) | `--optimizer` (默认 adam) | 基本相同 |
-| **分布式优化器** | `--use-distributed-optimizer` | 内置于 FSDP | FSDP 默认使用分布式优化器 |
-| **梯度检查点** | `--recompute-granularity`, `--recompute-method` | `--gradient-checkpointing` | **FSDP**: 简化为布尔开关 |
-| **CPU Offload** | 通过分布式优化器实现 | `--fsdp-cpu-offload` | **FSDP**: 将参数/梯度/优化器状态卸载到 CPU |
-| **CPU 后端** | 通过分布式优化器实现 | `--fsdp-cpu-backend` | **FSDP**: 指定CPU的后端并且当CPU offload时使用混合后端 |
-| **Attention 后端** | 由 Megatron Core 决定 | `--attn-implementation` (flash_attention_2/sdpa/eager) | **FSDP**: 直接透传给 HuggingFace |
-| **混合精度** | `--fp16` 或 `--bf16` | `--fp16` (bf16 自动推断) | 基本相同 |
-| **训练后端** | 默认或 `--train-backend megatron` | `--train-backend fsdp` (必需) | 用于切换后端 |
-| **参数配置** | | `--config` | **FSDP**: 为FSDP设置额外的参数 |
-
-### FSDP 一键启动
-
-```bash
-# 如果需要使用 WANDB，需要提前设置好环境变量 WANDB_API_KEY
-# 下载模型权重 (Qwen3-4B)
-hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
-
-# 下载训练数据集 (dapo-math-17k)
-hf download --repo-type dataset zhuzilin/dapo-math-17k \
-  --local-dir /root/dapo-math-17k
-
-# 下载评估数据集 (aime-2024)
-hf download --repo-type dataset zhuzilin/aime-2024 \
-  --local-dir /root/aime-2024
-  
-# 克隆代码并安装依赖
-git clone https://github.com/THUDM/slime.git
-cd slime
-pip install -e . --no-deps
-
-
-# FSDP不用进行权重转换，native 支持 huggingface 格式
-# 开启 reference model，在 colocated 模式下训练 Qwen3-4B
-source /root/slime/scripts/run-qwen3-4B-fsdp.sh
@@ -4,10 +4,7 @@ These examples provide concrete examples to leverage slime in your own RL workfl
 
 ## Directory Structure
 
-- **[DrGRPO](./DrGRPO)**: Custom reducer for Dr.GRPO algorithm.
-- **[eval](./eval)**: Documentation and setup for evaluation environments using NeMo-Skills.
-- **[eval_multi_task](./eval_multi_task)**: Example for supporting OOD evaluation tasks, e.g., GPQA, IFBench.
-- **[formal_math](./formal_math)**: Examples related to formal math reasoning tasks, including a single round demo.
+- **[eval_multi_task](./eval_multi_task)**: Example for supporting evaluation multiple tasks with different configs.
 - **[fully_async](./fully_async)**: Demonstrates fully asynchronous rollout generation for higher efficiency.
 - **[geo3k_vlm](./geo3k_vlm)**: Training VLMs with FSDP on a single-turn reasoning task using GRPO on the GEO3K dataset.
 - **[geo3k_vlm_multi_turn](./geo3k_vlm_multi_turn)**: VLM multi-turn training (FSDP backend) on Geo3k dataset.