|
6 | 6 | When using slime, parameters are primarily passed for the following purposes: |
7 | 7 |
|
8 | 8 | 1. To allocate a portion of the GPUs in the cluster for training and another portion for inference. |
9 | | -2. To load Megatron or FSDP for the training portion. |
| 9 | +2. To load Megatron for the training portion. |
10 | 10 | 3. To load SGLang for the inference portion. |
11 | 11 | 4. To configure the hyperparameters required for RL training. |
12 | 12 |
|
@@ -35,7 +35,7 @@ Additionally, slime supports Prefill and Decode disaggregation (PD Disaggregatio |
35 | 35 | slime supports multiple training backends, which can be selected via the `--train-backend` parameter: |
36 | 36 |
|
37 | 37 | - `megatron` (default): Uses Megatron-LM as the training backend, supporting efficient training of large-scale models. |
38 | | -- `fsdp`: Uses PyTorch FSDP as the training backend, allowing direct loading of HuggingFace format weights without conversion. |
| 38 | +- `fsdp` (experimental): Uses PyTorch FSDP as the training backend, allowing direct loading of HuggingFace format weights without conversion. |
39 | 39 |
|
40 | 40 | ### Loading Megatron |
41 | 41 |
|
@@ -322,63 +322,3 @@ In some customized Megatron implementations, special operations need to be perfo |
322 | 322 | - `--custom-megatron-init-path`: Adds some initialization calls. |
323 | 323 | - `--custom-megatron-before-log-prob-hook-path`: Is called before calculating the log probability. |
324 | 324 | - `--custom-megatron-before-train-step-hook-path`: Is called before each training step. You could use this to mix in special training losses, for example. |
325 | | - |
326 | | -## How to Use FSDP |
327 | | - |
328 | | -slime also support FSDP2 as the training backend, docs [here](https://lmsys.org/blog/2025-12-03-miles-fsdp/). |
329 | | - |
330 | | -> FSDP automatically reads all architecture information via `AutoModelForCausalLM.from_pretrained()`, without manual specification. Megatron requires manual configuration of parameters to read model architecture information. FSDP can read entirely from `config.json`, directly avoiding the weight format conversion step. |
331 | | -
|
332 | | -To run FSDP as the training backend, pass `--train-backend fsdp` to enable. |
333 | | - |
334 | | -### Parameters |
335 | | - |
336 | | -Parameters that FSDP used are shown as below in comparison to Megatron, more supports are coming on the way. |
337 | | - |
338 | | -| Configuration Category | Megatron Parameter | FSDP Parameter | Description | |
339 | | -| --- | --- | --- | --- | |
340 | | -| **Model Loading** | `--load` (Megatron checkpoint) + architecture args (`--num-layers`, `--hidden-size` etc.) | `--hf-checkpoint` (Required) | **FSDP**: Directly uses HuggingFace format, no weight conversion needed, architecture inferred via `AutoConfig` | |
341 | | -| **Tensor Parallel** | `--tensor-model-parallel-size` | Coming Soon | | |
342 | | -| **Pipeline Parallel** | `--pipeline-model-parallel-size` | Coming Soon | | |
343 | | -| **Expert Parallel** | `--expert-model-parallel-size` | Coming Soon | | |
344 | | -| **Context Parallel** | `--context-parallel-size` | `--context-parallel-size` | Both support CP | |
345 | | -| **Initial Learning Rate** | `--lr` | `--lr` | Same parameter | |
346 | | -| **Learning Rate Decay** | `--lr-decay-style` (linear/cosine etc.) | `--lr-decay-style` | Same parameter | |
347 | | -| **Warmup** | `--lr-warmup-iters` (steps) | `--lr-warmup-iters` | Same parameter | |
348 | | -| **Min Learning Rate** | `--min-lr` | `--min-lr` | Same parameter | |
349 | | -| **Optimizer Type** | `--optimizer` (adam/sgd etc.) | `--optimizer` (default adam) | Basically same | |
350 | | -| **Distributed Optimizer** | `--use-distributed-optimizer` | Built-in to FSDP | FSDP uses distributed optimizer by default | |
351 | | -| **Gradient Checkpoint** | `--recompute-granularity`, `--recompute-method` | `--gradient-checkpointing` | **FSDP**: Simplified to boolean switch | |
352 | | -| **CPU Offload** | Implemented via distributed optimizer | `--fsdp-cpu-offload` | **FSDP**: Offload parameters/gradients/optimizer states to CPU | |
353 | | -| **CPU Backend** | Implemented via distributed optimizer | `--fsdp-cpu-backend` | **FSDP**: Specify CPU backend and use hybrid backend when CPU offload is enabled | |
354 | | -| **Attention Backend** | Decided by Megatron Core | `--attn-implementation` (flash_attention_2/sdpa/eager) | **FSDP**: Directly passed to HuggingFace | |
355 | | -| **Mixed Precision** | `--fp16` or `--bf16` | `--fp16` (bf16 inferred automatically) | Basically same | |
356 | | -| **Training Backend** | Default or `--train-backend megatron` | `--train-backend fsdp` (Required) | Used to switch backend | |
357 | | -| **Config** | | `--config` | **FSDP**: Set additional parameters for FSDP backend | |
358 | | - |
359 | | -### Quick Start |
360 | | - |
361 | | -```bash |
362 | | -# If you need to use WANDB, you need to set the environment variable WANDB_API_KEY in advance |
363 | | -# Download model weights (Qwen3-4B) |
364 | | -hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B |
365 | | - |
366 | | -# Download training dataset (dapo-math-17k) |
367 | | -hf download --repo-type dataset zhuzilin/dapo-math-17k \ |
368 | | - --local-dir /root/dapo-math-17k |
369 | | - |
370 | | -# Download evaluation dataset (aime-2024) |
371 | | -hf download --repo-type dataset zhuzilin/aime-2024 \ |
372 | | - --local-dir /root/aime-2024 |
373 | | - |
374 | | -# Clone code and install dependencies |
375 | | -git clone https://github.com/THUDM/slime.git |
376 | | -cd slime |
377 | | -pip install -e . --no-deps |
378 | | - |
379 | | - |
380 | | -# FSDP does not require weight conversion, natively supports huggingface format |
381 | | -# Enable reference model, train Qwen3-4B in colocate mode |
382 | | -source /root/slime/scripts/run-qwen3-4B-fsdp.sh |
383 | | -``` |
384 | | - |
0 commit comments