You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/recipe-usage.md
+55-19Lines changed: 55 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Using Recipes
2
2
3
-
Megatron Bridge provides production-ready training recipes for several popular models. You can find an overview of supported recipes and 🤗 HuggingFace bridges [here](index.md#supported-models).
3
+
Megatron Bridge provides production-ready training recipes for several popular models. You can find an overview of supported recipes and 🤗 Hugging Face bridges [here](index.md#supported-models).
4
4
This guide will cover the next steps to make use of a training recipe, including how to [override configuration](#overriding-configuration) and how to [launch a job](#launch-methods).
5
5
6
6
## Overview
@@ -10,23 +10,41 @@ This guide will cover the next steps to make use of a training recipe, including
10
10
-**Integration**: Recipes return a single `ConfigContainer` that plugs directly into our training [entry points](training/entry-points.md) (see the published docs as well: https://docs.nvidia.com/nemo/megatron-bridge/latest/training/entry-points.html).
11
11
-**Customization**: You can override any part of the recipe (Python, YAML, CLI) to adapt to your data, scale, and objectives.
12
12
13
+
## Choosing a recipe or a new config
14
+
15
+
Start from an exported recipe when the model family and workflow already exist in `megatron.bridge.recipes`. Recipe functions such as `llama3_8b_pretrain_config`, `llama32_1b_sft_config`, and `qwen3_8b_peft_config` provide model, optimizer, scheduler, precision, dataset, logger, and checkpoint defaults in one `ConfigContainer`. Override those defaults for your dataset, checkpoint paths, run length, parallelism, or precision before creating a new recipe.
16
+
17
+
Create a new recipe or config when the base model architecture is not represented by an existing model provider, the checkpoint conversion needs a new bridge, the forward step or dataset provider is model-specific, or you need a reusable configuration that will be shared across jobs. If the Hugging Face model is already supported by `AutoBridge`, you usually only need to start from the closest recipe and override the model provider or `hf_path`.
18
+
19
+
Training mode follows the recipe and dataset type:
| LLM pretraining or continued pretraining |`GPTDatasetConfig`|`pretrain()`| No checkpoint for from-scratch runs; use `checkpoint.load` for full resume or `checkpoint.pretrained_checkpoint` for model-weight initialization |
24
+
| Full SFT |`FinetuningDatasetConfig`, `HFDatasetConfig`, or a dataset provider |`finetune()`| Use `checkpoint.pretrained_checkpoint` for the base model, or `checkpoint.load` for a full native Megatron resume |
25
+
| PEFT / LoRA / DoRA | Same as SFT, plus `cfg.peft`|`finetune()`|`checkpoint.pretrained_checkpoint` is required for the frozen base model; `checkpoint.load` resumes adapter training |
26
+
| VLM SFT or PEFT | VLM dataset provider such as Energon, HF, or preloaded JSON provider |`finetune()` with a VLM step function | Use the model-specific checkpoint guidance in the recipe or model docs |
27
+
28
+
For dataset fields, prefer `seq_length` in Bridge examples. LLM pretraining uses `GPTDatasetConfig` with `data_path`, `blend`, or `blend_per_split`; SFT and PEFT use `dataset_root` for local JSONL data. Do not use `data_path` for SFT/PEFT JSONL roots.
29
+
13
30
## Overriding configuration
14
31
15
32
Recipes are provided through a {py:class}`~bridge.training.config.ConfigContainer` object. This is a dataclass that holds all configuration objects needed for training. You can find a more detailed overview of the `ConfigContainer`[here](training/config-container-overview.md).
16
33
The benefit of providing the full recipe through a pythonic structure is that it is agnostic to any configuration approach that a user may prefer, whether that's YAML, `argparse` or something else. In other words, the user may override the recipe however they see fit.
17
34
18
-
The following sections detail a few different ways to override the configuration recipe. For a complete training script, please see [this example](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/llama/pretrain_llama3_8b.py).
35
+
The following sections detail a few different ways to override the configuration recipe. For a generic recipe launcher, see [`scripts/training/run_recipe.py`](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/scripts/training/run_recipe.py).
19
36
20
37
21
38
### Python
22
39
23
40
If you prefer to manage configuration in Python, you can directly modify attributes of the `ConfigContainer`:
24
41
25
42
```python
26
-
from megatron.bridge.recipes.llama.llama3_8b import pretrain_config
43
+
from megatron.bridge.recipes.llama import llama3_8b_pretrain_config
44
+
from megatron.bridge.training.config import ConfigContainer
27
45
28
46
# Get the base ConfigContainer from the recipe
29
-
cfg: ConfigContainer =pretrain_config()
47
+
cfg: ConfigContainer =llama3_8b_pretrain_config()
30
48
31
49
# Apply overrides. Note the hierarchical structure
32
50
cfg.train.train_iters =20
@@ -38,32 +56,28 @@ cfg.logger.log_interval = 1
38
56
You can also replace entire sub-configs of the `ConfigContainer`:
39
57
40
58
```python
41
-
from megatron.bridge.recipes.llama.llama3_8bimportpretrain_config
42
-
from megatron.bridge.models.llamaimportLlama3ModelProvider
59
+
from megatron.bridge.recipes.llama importllama32_1b_pretrain_config, llama3_8b_pretrain_config
60
+
from megatron.bridge.training.configimportConfigContainer
For more detail on accepted dataset layouts, see [Data Preparation](training/data-preparation.md).
155
+
120
156
## Launch methods
121
157
122
158
Megatron Bridge supports launching scripts with both `torchrun` and [NeMo-Run](https://github.com/NVIDIA-NeMo/Run).
@@ -184,7 +220,7 @@ if __name__ == "__main__":
184
220
train_script = run.Script(..., args=args_to_fwd)
185
221
```
186
222
187
-
For a complete example of the `run.Script` API, including argument forwarding, please see [this script](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/llama/pretrain_llama3_8b_nemo_run_script.py).
223
+
For a complete example of the `run.Script` API, including argument forwarding, see [`scripts/training/launch_with_nemo_run.py`](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/scripts/training/launch_with_nemo_run.py).
Copy file name to clipboardExpand all lines: docs/training/README.md
+7-5Lines changed: 7 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ This directory contains comprehensive documentation for training and customizing
7
7
### I want to
8
8
9
9
**🚀 Get started with training**
10
-
→ Start with [Configuration Container Overview](config-container-overview.md) to understand the training setup
10
+
→ Start with [Configuration Container Overview](config-container-overview.md)and [Data Preparation](data-preparation.md)to understand the training setup
11
11
12
12
**⚙️ Configure training parameters**
13
13
→ See [Training Loop Settings](training-loop-settings.md) and [Optimizer & Scheduler](optimizer-scheduler.md)
@@ -32,6 +32,7 @@ This directory contains comprehensive documentation for training and customizing
32
32
|----------|---------|--------------|
33
33
|**[Configuration Container Overview](config-container-overview.md)**| Central configuration object for all training settings | First time setting up training |
34
34
|**[Entry Points](entry-points.md)**| Training entry points and execution flow | Understanding how training starts |
35
+
|**[Data Preparation](data-preparation.md)**| Dataset formats for pretraining, SFT, PEFT, and VLM fine-tuning | Preparing data or choosing dataset config fields |
35
36
|**[Training Loop Settings](training-loop-settings.md)**| Training loop parameters and configuration | Configuring batch sizes, iterations, validation |
36
37
37
38
### Optimization and Performance
@@ -71,7 +72,7 @@ This directory contains comprehensive documentation for training and customizing
71
72
A typical training workflow involves:
72
73
73
74
1.**Configure Training** - Set up `ConfigContainer` with model, data, and training parameters
74
-
2.**Prepare Data** - Configure dataset loading and preprocessing
75
+
2.**Prepare Data** - Configure dataset loading and preprocessing with the right data format
75
76
3.**Set Optimization** - Configure optimizer, scheduler, and mixed precision
76
77
4.**Enable Monitoring** - Set up logging and profiling
77
78
5.**Configure Checkpointing** - Set up checkpoint saving and resuming
@@ -93,9 +94,10 @@ A typical training workflow involves:
93
94
### 🆕 First-Time Training Setup
94
95
95
96
1.[Configuration Container Overview](config-container-overview.md) - Understand the configuration system
96
-
2.[Entry Points](entry-points.md) - Learn how to start training
97
-
3.[Training Loop Settings](training-loop-settings.md) - Configure basic training parameters
98
-
4.[Logging](logging.md) - Set up monitoring
97
+
2.[Data Preparation](data-preparation.md) - Choose the right dataset format and config fields
98
+
3.[Entry Points](entry-points.md) - Learn how to start training
99
+
4.[Training Loop Settings](training-loop-settings.md) - Configure basic training parameters
0 commit comments