docs: add all copywrite changes from #1120 (#1123)

akoumpa · web-flow · commit 8473ee64c0d2 · 2026-01-27T20:17:04.000Z
Signed-off-by: Alexandros Koumparoulis &lt;akoumparouli@nvidia.com&gt;
diff --git a/docs/guides/configuration.md b/docs/guides/configuration.md
@@ -7,7 +7,6 @@ NeMo Automodel recipes are configured with YAML. Under the hood, YAML is parsed
 - Supports environment variable interpolation inside YAML strings.
 - Tries to make config printing safe by preserving original placeholders (to avoid leaking secrets).
 
----
 
 ## Load Model and Dataset Configs
 
@@ -30,7 +29,6 @@ Only **strings** are translated. Examples:
 
 YAML-native types (like `step_size: 10` without quotes) are already typed by the YAML parser and remain unchanged.
 
----
 
 ## Use `_target_` for Instantiation
 
@@ -55,7 +53,6 @@ By default, resolving targets is restricted:
 - Accessing private or dunder attributes is blocked by default.
 - Loading out-of-tree user code can be enabled with `NEMO_ENABLE_USER_MODULES=1` or by calling `set_enable_user_modules(True)`.
 
----
 
 ## Interpolate Environment Variables in YAML
 
@@ -92,7 +89,6 @@ dataset:
     DATABRICKS_HTTP_PATH: ${DATABRICKS_HTTP_PATH}
 ```
 
----
 
 ## Prevent Secret Leakage in Logs
 
@@ -105,7 +101,6 @@ When an env var placeholder is resolved, the config keeps the original placehold
 Printing a **leaf value** (for example, `print(cfg.dataset.delta_storage_options.DATABRICKS_TOKEN)`) outputs the resolved secret. Instead, print the full config or use a redacted YAML dict.
 :::
 
----
 
 ## Configure Slurm (`automodel` CLI)
 
diff --git a/docs/performance-summary.md b/docs/performance-summary.md
@@ -1,12 +1,12 @@
-# NeMo AutoModel Performance Summary
+# NeMo Automodel Performance Summary
 
-This document provides performance benchmarks for various large language models using NeMo Pytorch backend - i.e. NeMo Automodel.
+This document provides performance benchmarks for various large language models using NeMo Automodel with the PyTorch backend.
 
 ## Pre-Training Performance
 
 The table below shows training performance for full sequences with no padding across different model architectures and scales.
 
-#### System: DGX-H100, Precision: BF16
+### System: DGX-H100, Precision: BF16
 
 | Model | #GPUs | GBS | MBS | LBS | GA | Seq Length | TP | PP | CP | EP | VP | FSDP | Kernel Optimizations | Time per Global Step (s) | Model TFLOPs/sec/GPU | Tokens/sec/GPU |
 |-------|------:|----:|----:|----:|---:|-----------:|---:|---:|---:|---:|---:|-----:|---------|-------------------------:|---------------------:|---------------:|
@@ -17,12 +17,11 @@ The table below shows training performance for full sequences with no padding ac
 | GPT-OSS 20B | 8 | 256 | 2 | 2 | 16 | 4096 | 1 | 1 | 1 | - | - | 8 | TE + DeepEP + FlexAttn | 10.04 | 279 | 13,058 |
 | GPT-OSS 120B | 64 | 512 | 2 | 2 | 4 | 4096 | 1 | 1 | 1 | - | - | 64 | TE + DeepEP + FlexAttn | 4.30 | 231 | 7,626 |
 
----
-
-## Finetuning (LoRA) Performance
+## Fine-Tuning (LoRA) Performance
 
 The table below shows finetuning (LoRA) performance for full sequences with no padding across different model architectures and scales.
 
+### System: DGX-H100, Precision: BF16
 | Model | #GPUs | GBS | MBS | LBS | GA | Seq Length | TP | PP | CP | EP | VP | FSDP | Kernel Optimizations | Time per Global Step (s) | Model TFLOPs/sec/GPU | Tokens/sec/GPU |
 |-------|------:|----:|----:|----:|---:|-----------:|---:|---:|---:|---:|---:|-----:|---------|-------------------------:|---------------------:|---------------:|
 | Llama3 8B | 1 | 32 | 2 | 2 | 16 | 4096 | 1 | 1 | 1 | - | 1 | 1 | - | 10.51 | 402 | 12472.87 |
@@ -31,6 +30,7 @@ The table below shows finetuning (LoRA) performance for full sequences with no p
 | Qwen2.5 32B | 8 | 32 | 1 | 8 | 2 | 4096 | 1 | 4 | 1 | - | 8 | 1 | 2 | 8.40 | 261 | 1950.93 |
 | Llama3 70B 2-node | 16 | 32 | 1 | 4 | 2 | 4096 | 2 | 4 | 1 | - | 10 | 1 | 2 | 12.03 | 197 | 680.74 |
 | Qwen2.5 32B 2-node | 16 | 32 | 1 | 8 | 1 | 4096 | 1 | 4 | 1 | - | 8 | 1 | 4 | 4.48 | 244 | 1826.49 |
+
 ## Glossary
 
 - **MFU**: Model FLOPs Utilization - ratio of achieved compute to peak hardware capability
@@ -45,9 +45,7 @@ The table below shows finetuning (LoRA) performance for full sequences with no p
 - **GA**: Gradient Accumulation - number of local-batches before optimizer step
 - **TE**: Transformer Engine kernel optimizations - RMSNorm, Linear and DotProductAttention
 - **DeepEP**: Deep Expert Parallelism - advanced EP routing for MoE models
-- **FlexAttn**: Pytorch's [Flex Attention](https://docs.pytorch.org/docs/stable/nn.attention.flex_attention.html)
-
----
+- **FlexAttn**: PyTorch's [Flex Attention](https://docs.pytorch.org/docs/stable/nn.attention.flex_attention.html)
 
 ## Configuration Files
 
@@ -65,18 +63,15 @@ All benchmark configurations are available in [`examples/benchmark/configs/`](ht
 - [`Llama_70b_lora_2nodes.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/llama3_3/custom_llama3_3_70b_instruct_peft_benchmark_2nodes.yaml) - Llama-70B Finetuning (LoRA) optimized on 2 nodes
 - [`Qwen2_5_32b_lora_2nodes.yaml`](https://github.com/NVIDIA-NeMo/Automodel/tree/main/examples/llm_finetune/qwen/qwen2_5_32b_peft_benchmark_2nodes.yaml) - Qwen2.5-32B Finetuning (LoRA) optimized on 2 nodes
 
----
-
-## Notes
-
-- All benchmarks use mock data for consistent performance measurement
-- Fake balanced gate is enabled to simulate ideal expert routing
-- No gradient clipping applied for pure performance measurement
-- MFU calculated using peak TFLOPs for the system (989 for BF16 H100)
-- Step times include forward and backward passes + optimizer step for the global batch
-
----
+:::{note}
+- All benchmarks use mock data for consistent performance measurement.
+- Fake balanced gate is enabled to simulate ideal expert routing.
+- No gradient clipping applied for pure performance measurement.
+- MFU calculated using peak TFLOPs for the system (989 for BF16 H100).
+- Step times include forward and backward passes + optimizer step for the global batch.
+:::
 
 
-**Last Updated**: 2025-10-02
-**NeMo AutoModel Version**: `main` Branch
+## Version Information
+- **Last Updated**: 2025-10-02
+- **NeMo AutoModel Version**: `main` Branch