Skip to content

Commit cae77c5

Browse files
authored
docs(training): clarify workflow and checkpoint guidance (#4259)
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
1 parent ac69176 commit cae77c5

11 files changed

Lines changed: 364 additions & 61 deletions

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ models/stepfun/index.md
5959
6060
training/config-container-overview.md
6161
training/entry-points.md
62+
training/data-preparation.md
6263
training/training-loop-settings.md
6364
training/optimizer-scheduler.md
6465
training/logging.md

docs/megatron-lm-to-megatron-bridge.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ uv run python scripts/translate_mlm_to_bridge.py --reverse \
5555
| `--ffn-hidden-size N` | `model.ffn_hidden_size=N` | |
5656
| `--num-attention-heads N` | `model.num_attention_heads=N` | |
5757
| `--num-query-groups N` | `model.num_query_groups=N` | |
58-
| `--seq-length N` | `model.seq_length=N dataset.sequence_length=N` | Dual mapping |
58+
| `--seq-length N` | `model.seq_length=N dataset.seq_length=N` | Dual mapping |
5959
| `--swiglu` | `model.gated_linear_unit=true model.activation_func=silu` | Expanded to two keys |
6060
| `--squared-relu` | `model.activation_func=squared_relu` | |
6161
| `--data-path PATH [W PATH...]` | `dataset.data_path=PATH` | Space-separated paths (and optional weights) |
@@ -95,15 +95,17 @@ Flags not present in Bridge (e.g., `--use-mcore-models`, `--use-flash-attn`) are
9595
9696
## Quick start
9797

98-
Run your example training entrypoint and override config keys directly:
98+
Run the generic recipe launcher and override config keys directly:
9999

100100
```bash
101-
uv run python examples/models/llama/pretrain_llama3_8b.py \
101+
uv run python scripts/training/run_recipe.py \
102+
--recipe llama3_8b_pretrain_config \
103+
--dataset llm-pretrain \
102104
train.micro_batch_size=2 \
103105
train.global_batch_size=128 \
104106
model.num_layers=32 model.hidden_size=4096 model.num_attention_heads=32 \
105107
model.max_position_embeddings=4096 \
106-
dataset.sequence_length=4096 \
108+
dataset.seq_length=4096 \
107109
checkpoint.save=/workspace/ckpts checkpoint.save_interval=1000 \
108110
logger.wandb_project=my_proj logger.wandb_exp_name=exp1
109111
```

docs/nemo2-migration-guide.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -286,12 +286,12 @@ recipe = llm.llama3_8b.pretrain_recipe(name="my_run", num_nodes=2)
286286

287287
**Megatron Bridge**: Recipes in `megatron.bridge.recipes/`
288288
```python
289-
from megatron.bridge.recipes.llama.llama3_8b import pretrain_config
290-
from megatron.bridge.training import pretrain
289+
from megatron.bridge.recipes.llama import llama3_8b_pretrain_config
291290
from megatron.bridge.training.gpt_step import forward_step
291+
from megatron.bridge.training.pretrain import pretrain
292292

293293
# Use pre-built recipe
294-
cfg = pretrain_config()
294+
cfg = llama3_8b_pretrain_config()
295295

296296
# Customize as needed
297297
cfg.train.train_iters = 10000
@@ -384,7 +384,7 @@ from megatron.bridge.training.config import (
384384
)
385385
from megatron.core.optimizer import OptimizerConfig
386386
from megatron.bridge.models import GPTModelProvider
387-
from megatron.bridge.training import pretrain
387+
from megatron.bridge.training.pretrain import pretrain
388388

389389
def llama3_8b_config(
390390
# Model/parallelism params
@@ -1221,8 +1221,9 @@ result = llm.finetune(
12211221
In Megatron Bridge, training entry points take a single `ConfigContainer` and a `forward_step_func`:
12221222

12231223
```python
1224-
from megatron.bridge.training import pretrain, finetune
12251224
from megatron.bridge.training.config import ConfigContainer
1225+
from megatron.bridge.training.finetune import finetune
1226+
from megatron.bridge.training.pretrain import pretrain
12261227

12271228
# Create unified configuration
12281229
cfg = ConfigContainer(
@@ -1285,8 +1286,8 @@ For GPT models, use the provided {py:func}`bridge.training.gpt_step.forward_step
12851286
Use `pretrain()` with `GPTDatasetConfig` for training models from scratch:
12861287

12871288
```python
1288-
from megatron.bridge.training import pretrain
12891289
from megatron.bridge.training.gpt_step import forward_step
1290+
from megatron.bridge.training.pretrain import pretrain
12901291

12911292
config = ConfigContainer(
12921293
model=GPTModelProvider(
@@ -1321,8 +1322,8 @@ Use `finetune()` with `FinetuningDatasetConfig` for both full fine-tuning (SFT)
13211322
Full fine-tuning without PEFT - all model parameters are updated:
13221323

13231324
```python
1324-
from megatron.bridge.training import finetune
13251325
from megatron.bridge.training.gpt_step import forward_step
1326+
from megatron.bridge.training.finetune import finetune
13261327

13271328
config = ConfigContainer(
13281329
model=GPTModelProvider(),

docs/recipe-usage.md

Lines changed: 55 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Using Recipes
22

3-
Megatron Bridge provides production-ready training recipes for several popular models. You can find an overview of supported recipes and 🤗 HuggingFace bridges [here](index.md#supported-models).
3+
Megatron Bridge provides production-ready training recipes for several popular models. You can find an overview of supported recipes and 🤗 Hugging Face bridges [here](index.md#supported-models).
44
This guide will cover the next steps to make use of a training recipe, including how to [override configuration](#overriding-configuration) and how to [launch a job](#launch-methods).
55

66
## Overview
@@ -10,23 +10,41 @@ This guide will cover the next steps to make use of a training recipe, including
1010
- **Integration**: Recipes return a single `ConfigContainer` that plugs directly into our training [entry points](training/entry-points.md) (see the published docs as well: https://docs.nvidia.com/nemo/megatron-bridge/latest/training/entry-points.html).
1111
- **Customization**: You can override any part of the recipe (Python, YAML, CLI) to adapt to your data, scale, and objectives.
1212

13+
## Choosing a recipe or a new config
14+
15+
Start from an exported recipe when the model family and workflow already exist in `megatron.bridge.recipes`. Recipe functions such as `llama3_8b_pretrain_config`, `llama32_1b_sft_config`, and `qwen3_8b_peft_config` provide model, optimizer, scheduler, precision, dataset, logger, and checkpoint defaults in one `ConfigContainer`. Override those defaults for your dataset, checkpoint paths, run length, parallelism, or precision before creating a new recipe.
16+
17+
Create a new recipe or config when the base model architecture is not represented by an existing model provider, the checkpoint conversion needs a new bridge, the forward step or dataset provider is model-specific, or you need a reusable configuration that will be shared across jobs. If the Hugging Face model is already supported by `AutoBridge`, you usually only need to start from the closest recipe and override the model provider or `hf_path`.
18+
19+
Training mode follows the recipe and dataset type:
20+
21+
| Workflow | Typical config | Entry point | Checkpoint expectation |
22+
|----------|----------------|-------------|------------------------|
23+
| LLM pretraining or continued pretraining | `GPTDatasetConfig` | `pretrain()` | No checkpoint for from-scratch runs; use `checkpoint.load` for full resume or `checkpoint.pretrained_checkpoint` for model-weight initialization |
24+
| Full SFT | `FinetuningDatasetConfig`, `HFDatasetConfig`, or a dataset provider | `finetune()` | Use `checkpoint.pretrained_checkpoint` for the base model, or `checkpoint.load` for a full native Megatron resume |
25+
| PEFT / LoRA / DoRA | Same as SFT, plus `cfg.peft` | `finetune()` | `checkpoint.pretrained_checkpoint` is required for the frozen base model; `checkpoint.load` resumes adapter training |
26+
| VLM SFT or PEFT | VLM dataset provider such as Energon, HF, or preloaded JSON provider | `finetune()` with a VLM step function | Use the model-specific checkpoint guidance in the recipe or model docs |
27+
28+
For dataset fields, prefer `seq_length` in Bridge examples. LLM pretraining uses `GPTDatasetConfig` with `data_path`, `blend`, or `blend_per_split`; SFT and PEFT use `dataset_root` for local JSONL data. Do not use `data_path` for SFT/PEFT JSONL roots.
29+
1330
## Overriding configuration
1431

1532
Recipes are provided through a {py:class}`~bridge.training.config.ConfigContainer` object. This is a dataclass that holds all configuration objects needed for training. You can find a more detailed overview of the `ConfigContainer` [here](training/config-container-overview.md).
1633
The benefit of providing the full recipe through a pythonic structure is that it is agnostic to any configuration approach that a user may prefer, whether that's YAML, `argparse` or something else. In other words, the user may override the recipe however they see fit.
1734

18-
The following sections detail a few different ways to override the configuration recipe. For a complete training script, please see [this example](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/llama/pretrain_llama3_8b.py).
35+
The following sections detail a few different ways to override the configuration recipe. For a generic recipe launcher, see [`scripts/training/run_recipe.py`](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/scripts/training/run_recipe.py).
1936

2037

2138
### Python
2239

2340
If you prefer to manage configuration in Python, you can directly modify attributes of the `ConfigContainer`:
2441

2542
```python
26-
from megatron.bridge.recipes.llama.llama3_8b import pretrain_config
43+
from megatron.bridge.recipes.llama import llama3_8b_pretrain_config
44+
from megatron.bridge.training.config import ConfigContainer
2745

2846
# Get the base ConfigContainer from the recipe
29-
cfg: ConfigContainer = pretrain_config()
47+
cfg: ConfigContainer = llama3_8b_pretrain_config()
3048

3149
# Apply overrides. Note the hierarchical structure
3250
cfg.train.train_iters = 20
@@ -38,32 +56,28 @@ cfg.logger.log_interval = 1
3856
You can also replace entire sub-configs of the `ConfigContainer`:
3957

4058
```python
41-
from megatron.bridge.recipes.llama.llama3_8b import pretrain_config
42-
from megatron.bridge.models.llama import Llama3ModelProvider
59+
from megatron.bridge.recipes.llama import llama32_1b_pretrain_config, llama3_8b_pretrain_config
60+
from megatron.bridge.training.config import ConfigContainer
4361

44-
cfg: ConfigContainer = pretrain_config()
62+
cfg: ConfigContainer = llama3_8b_pretrain_config()
4563

46-
small_llama = Llama3ModelProvider(
47-
num_layers=2,
48-
hidden_size=768,
49-
ffn_hidden_size=2688,
50-
num_attention_heads=16,
51-
)
52-
cfg.model = small_llama
64+
small_cfg: ConfigContainer = llama32_1b_pretrain_config()
65+
cfg.model = small_cfg.model
5366
```
5467

5568
### YAML
5669
Overriding a configuration recipe with a YAML file can be done using OmegaConf utilities:
5770

5871
```python
5972
from omegaconf import OmegaConf
60-
from megatron.bridge.recipes.llama.llama3_8b import pretrain_config
73+
from megatron.bridge.recipes.llama import llama3_8b_pretrain_config
74+
from megatron.bridge.training.config import ConfigContainer
6175
from megatron.bridge.training.utils.omegaconf_utils import (
6276
apply_overrides,
6377
create_omegaconf_dict_config,
6478
)
6579

66-
cfg: ConfigContainer = pretrain_config()
80+
cfg: ConfigContainer = llama3_8b_pretrain_config()
6781
yaml_filepath = "conf/llama3-8b-benchmark-cfg.yaml"
6882

6983
# Convert the initial Python dataclass to an OmegaConf DictConfig for merging
@@ -88,14 +102,15 @@ Megatron Bridge provides some utilities to update the ConfigContainer using Hydr
88102
```python
89103
import sys
90104
from omegaconf import OmegaConf
91-
from megatron.bridge.recipes.llama.llama3_8b import pretrain_config
105+
from megatron.bridge.recipes.llama import llama3_8b_pretrain_config
106+
from megatron.bridge.training.config import ConfigContainer
92107
from megatron.bridge.training.utils.omegaconf_utils import (
93108
apply_overrides,
94109
create_omegaconf_dict_config,
95110
parse_hydra_overrides,
96111
)
97112

98-
cfg: ConfigContainer = pretrain_config()
113+
cfg: ConfigContainer = llama3_8b_pretrain_config()
99114
cli_overrides = sys.argv[1:]
100115

101116
# Convert the initial Python dataclass to an OmegaConf DictConfig for merging
@@ -117,6 +132,27 @@ A script containing the above code could be called like so:
117132
uv run python -m torch.distributed.run <torchrun arguments> pretrain_cli_overrides.py model.tensor_model_parallel_size=4 train.train_iters=100000 ...
118133
```
119134

135+
Common dataset overrides:
136+
137+
```python
138+
from megatron.bridge.recipes.llama import llama32_1b_sft_config, llama3_8b_pretrain_config
139+
140+
pretrain_cfg = llama3_8b_pretrain_config()
141+
finetune_cfg = llama32_1b_sft_config()
142+
143+
# LLM pretraining data on a pretrain recipe:
144+
# prefix path without .bin/.idx suffixes
145+
pretrain_cfg.dataset.data_path = "/data/dclm/preprocessed_text_document"
146+
pretrain_cfg.dataset.seq_length = 8192
147+
148+
# SFT/PEFT local JSONL data on a finetune recipe:
149+
# directory containing training.jsonl, validation.jsonl, and optionally test.jsonl
150+
finetune_cfg.dataset.dataset_root = "/data/sft_jsonl"
151+
finetune_cfg.dataset.seq_length = 4096
152+
```
153+
154+
For more detail on accepted dataset layouts, see [Data Preparation](training/data-preparation.md).
155+
120156
## Launch methods
121157

122158
Megatron Bridge supports launching scripts with both `torchrun` and [NeMo-Run](https://github.com/NVIDIA-NeMo/Run).
@@ -184,7 +220,7 @@ if __name__ == "__main__":
184220
train_script = run.Script(..., args=args_to_fwd)
185221
```
186222

187-
For a complete example of the `run.Script` API, including argument forwarding, please see [this script](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/llama/pretrain_llama3_8b_nemo_run_script.py).
223+
For a complete example of the `run.Script` API, including argument forwarding, see [`scripts/training/launch_with_nemo_run.py`](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/scripts/training/launch_with_nemo_run.py).
188224

189225
#### Plugins
190226

docs/training/README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ This directory contains comprehensive documentation for training and customizing
77
### I want to
88

99
**🚀 Get started with training**
10-
→ Start with [Configuration Container Overview](config-container-overview.md) to understand the training setup
10+
→ Start with [Configuration Container Overview](config-container-overview.md) and [Data Preparation](data-preparation.md) to understand the training setup
1111

1212
**⚙️ Configure training parameters**
1313
→ See [Training Loop Settings](training-loop-settings.md) and [Optimizer & Scheduler](optimizer-scheduler.md)
@@ -32,6 +32,7 @@ This directory contains comprehensive documentation for training and customizing
3232
|----------|---------|--------------|
3333
| **[Configuration Container Overview](config-container-overview.md)** | Central configuration object for all training settings | First time setting up training |
3434
| **[Entry Points](entry-points.md)** | Training entry points and execution flow | Understanding how training starts |
35+
| **[Data Preparation](data-preparation.md)** | Dataset formats for pretraining, SFT, PEFT, and VLM fine-tuning | Preparing data or choosing dataset config fields |
3536
| **[Training Loop Settings](training-loop-settings.md)** | Training loop parameters and configuration | Configuring batch sizes, iterations, validation |
3637

3738
### Optimization and Performance
@@ -71,7 +72,7 @@ This directory contains comprehensive documentation for training and customizing
7172
A typical training workflow involves:
7273

7374
1. **Configure Training** - Set up `ConfigContainer` with model, data, and training parameters
74-
2. **Prepare Data** - Configure dataset loading and preprocessing
75+
2. **Prepare Data** - Configure dataset loading and preprocessing with the right data format
7576
3. **Set Optimization** - Configure optimizer, scheduler, and mixed precision
7677
4. **Enable Monitoring** - Set up logging and profiling
7778
5. **Configure Checkpointing** - Set up checkpoint saving and resuming
@@ -93,9 +94,10 @@ A typical training workflow involves:
9394
### 🆕 First-Time Training Setup
9495

9596
1. [Configuration Container Overview](config-container-overview.md) - Understand the configuration system
96-
2. [Entry Points](entry-points.md) - Learn how to start training
97-
3. [Training Loop Settings](training-loop-settings.md) - Configure basic training parameters
98-
4. [Logging](logging.md) - Set up monitoring
97+
2. [Data Preparation](data-preparation.md) - Choose the right dataset format and config fields
98+
3. [Entry Points](entry-points.md) - Learn how to start training
99+
4. [Training Loop Settings](training-loop-settings.md) - Configure basic training parameters
100+
5. [Logging](logging.md) - Set up monitoring
99101

100102
### ⚡ Performance Optimization
101103

0 commit comments

Comments
 (0)