Skip to content

Commit 8473ee6

Browse files
authored
docs: add all copywrite changes from #1120 (#1123)
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
1 parent 8401c54 commit 8473ee6

File tree

2 files changed

+17
-27
lines changed

2 files changed

+17
-27
lines changed

docs/guides/configuration.md

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@ NeMo Automodel recipes are configured with YAML. Under the hood, YAML is parsed
77
- Supports environment variable interpolation inside YAML strings.
88
- Tries to make config printing safe by preserving original placeholders (to avoid leaking secrets).
99

10-
---
1110

1211
## Load Model and Dataset Configs
1312

@@ -30,7 +29,6 @@ Only **strings** are translated. Examples:
3029

3130
YAML-native types (like `step_size: 10` without quotes) are already typed by the YAML parser and remain unchanged.
3231

33-
---
3432

3533
## Use `_target_` for Instantiation
3634

@@ -55,7 +53,6 @@ By default, resolving targets is restricted:
5553
- Accessing private or dunder attributes is blocked by default.
5654
- Loading out-of-tree user code can be enabled with `NEMO_ENABLE_USER_MODULES=1` or by calling `set_enable_user_modules(True)`.
5755

58-
---
5956

6057
## Interpolate Environment Variables in YAML
6158

@@ -92,7 +89,6 @@ dataset:
9289
DATABRICKS_HTTP_PATH: ${DATABRICKS_HTTP_PATH}
9390
```
9491

95-
---
9692

9793
## Prevent Secret Leakage in Logs
9894

@@ -105,7 +101,6 @@ When an env var placeholder is resolved, the config keeps the original placehold
105101
Printing a **leaf value** (for example, `print(cfg.dataset.delta_storage_options.DATABRICKS_TOKEN)`) outputs the resolved secret. Instead, print the full config or use a redacted YAML dict.
106102
:::
107103

108-
---
109104

110105
## Configure Slurm (`automodel` CLI)
111106

docs/performance-summary.md

Lines changed: 17 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
1-
# NeMo AutoModel Performance Summary
1+
# NeMo Automodel Performance Summary
22

3-
This document provides performance benchmarks for various large language models using NeMo Pytorch backend - i.e. NeMo Automodel.
3+
This document provides performance benchmarks for various large language models using NeMo Automodel with the PyTorch backend.
44

55
## Pre-Training Performance
66

77
The table below shows training performance for full sequences with no padding across different model architectures and scales.
88

9-
#### System: DGX-H100, Precision: BF16
9+
### System: DGX-H100, Precision: BF16
1010

1111
| Model | #GPUs | GBS | MBS | LBS | GA | Seq Length | TP | PP | CP | EP | VP | FSDP | Kernel Optimizations | Time per Global Step (s) | Model TFLOPs/sec/GPU | Tokens/sec/GPU |
1212
|-------|------:|----:|----:|----:|---:|-----------:|---:|---:|---:|---:|---:|-----:|---------|-------------------------:|---------------------:|---------------:|
@@ -17,12 +17,11 @@ The table below shows training performance for full sequences with no padding ac
1717
| GPT-OSS 20B | 8 | 256 | 2 | 2 | 16 | 4096 | 1 | 1 | 1 | - | - | 8 | TE + DeepEP + FlexAttn | 10.04 | 279 | 13,058 |
1818
| GPT-OSS 120B | 64 | 512 | 2 | 2 | 4 | 4096 | 1 | 1 | 1 | - | - | 64 | TE + DeepEP + FlexAttn | 4.30 | 231 | 7,626 |
1919

20-
---
21-
22-
## Finetuning (LoRA) Performance
20+
## Fine-Tuning (LoRA) Performance
2321

2422
The table below shows finetuning (LoRA) performance for full sequences with no padding across different model architectures and scales.
2523

24+
### System: DGX-H100, Precision: BF16
2625
| Model | #GPUs | GBS | MBS | LBS | GA | Seq Length | TP | PP | CP | EP | VP | FSDP | Kernel Optimizations | Time per Global Step (s) | Model TFLOPs/sec/GPU | Tokens/sec/GPU |
2726
|-------|------:|----:|----:|----:|---:|-----------:|---:|---:|---:|---:|---:|-----:|---------|-------------------------:|---------------------:|---------------:|
2827
| Llama3 8B | 1 | 32 | 2 | 2 | 16 | 4096 | 1 | 1 | 1 | - | 1 | 1 | - | 10.51 | 402 | 12472.87 |
@@ -31,6 +30,7 @@ The table below shows finetuning (LoRA) performance for full sequences with no p
3130
| Qwen2.5 32B | 8 | 32 | 1 | 8 | 2 | 4096 | 1 | 4 | 1 | - | 8 | 1 | 2 | 8.40 | 261 | 1950.93 |
3231
| Llama3 70B 2-node | 16 | 32 | 1 | 4 | 2 | 4096 | 2 | 4 | 1 | - | 10 | 1 | 2 | 12.03 | 197 | 680.74 |
3332
| Qwen2.5 32B 2-node | 16 | 32 | 1 | 8 | 1 | 4096 | 1 | 4 | 1 | - | 8 | 1 | 4 | 4.48 | 244 | 1826.49 |
33+
3434
## Glossary
3535

3636
- **MFU**: Model FLOPs Utilization - ratio of achieved compute to peak hardware capability
@@ -45,9 +45,7 @@ The table below shows finetuning (LoRA) performance for full sequences with no p
4545
- **GA**: Gradient Accumulation - number of local-batches before optimizer step
4646
- **TE**: Transformer Engine kernel optimizations - RMSNorm, Linear and DotProductAttention
4747
- **DeepEP**: Deep Expert Parallelism - advanced EP routing for MoE models
48-
- **FlexAttn**: Pytorch's [Flex Attention](https://docs.pytorch.org/docs/stable/nn.attention.flex_attention.html)
49-
50-
---
48+
- **FlexAttn**: PyTorch's [Flex Attention](https://docs.pytorch.org/docs/stable/nn.attention.flex_attention.html)
5149

5250
## Configuration Files
5351

@@ -65,18 +63,15 @@ All benchmark configurations are available in [`examples/benchmark/configs/`](ht
6563
- [`Llama_70b_lora_2nodes.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/llama3_3/custom_llama3_3_70b_instruct_peft_benchmark_2nodes.yaml) - Llama-70B Finetuning (LoRA) optimized on 2 nodes
6664
- [`Qwen2_5_32b_lora_2nodes.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/qwen/qwen2_5_32b_peft_benchmark_2nodes.yaml) - Qwen2.5-32B Finetuning (LoRA) optimized on 2 nodes
6765

68-
---
69-
70-
## Notes
71-
72-
- All benchmarks use mock data for consistent performance measurement
73-
- Fake balanced gate is enabled to simulate ideal expert routing
74-
- No gradient clipping applied for pure performance measurement
75-
- MFU calculated using peak TFLOPs for the system (989 for BF16 H100)
76-
- Step times include forward and backward passes + optimizer step for the global batch
77-
78-
---
66+
:::{note}
67+
- All benchmarks use mock data for consistent performance measurement.
68+
- Fake balanced gate is enabled to simulate ideal expert routing.
69+
- No gradient clipping applied for pure performance measurement.
70+
- MFU calculated using peak TFLOPs for the system (989 for BF16 H100).
71+
- Step times include forward and backward passes + optimizer step for the global batch.
72+
:::
7973

8074

81-
**Last Updated**: 2025-10-02
82-
**NeMo AutoModel Version**: `main` Branch
75+
## Version Information
76+
- **Last Updated**: 2025-10-02
77+
- **NeMo AutoModel Version**: `main` Branch

0 commit comments

Comments
 (0)