feat(recipe): DSV3 GB200 MXFP8 full-iter CG recipe by dingqingy-nv · Pull Request #4226 · NVIDIA-NeMo/Megatron-Bridge

dingqingy-nv · 2026-06-09T04:28:50Z

Summary

Mirrors the GB300 MXFP8 full-iter CG recipe (PR #3983) on GB200. Recipe shape matches GB300 except for recompute_modules=["mla_up_proj"] to fit GB200's smaller HBM budget; GB300 can run with recompute_modules=[].

Changes

scripts/performance/configs/deepseek/deepseek_workload_base_configs.py

DEEPSEEK_V3_PRETRAIN_CONFIG_GB200_FP8_MX_V1 no longer aliases the bf16 V1 config. It now enables:
- cuda_graph_impl=\"full_iteration\", cuda_graph_scope=[]
- moe_a2a_overlap=True
- cutedsl_fused_grouped_mlp=True
- recompute_modules=[\"mla_up_proj\"]
V2 (GBS=4096) and VR200 mxfp8 V1/V2 inherit via the existing replace(..., global_batch_size=4096) / alias chain.

scripts/performance/configs/deepseek/deepseek_llm_pretrain.py

Adds the fp8_output_proj=True gate for the GB200 mxfp8 recipe, mirroring the GB300 gate that landed via PR [perf, recipe] feat: DSV3 GB300 MXFP8 full-iter CG recipe #3983.

Measured impact

64 GB200 nodes / 256 GPUs, DSv3-671B mxfp8, GBS=4096 (V2). Steady-state iters 14-19, averaged.

variant	iter (s)	TF/s/GPU
partial CG + `[core_attn]` offload + `[mlp]` recompute	13.89	1226
full CG + `[core_attn,attn_proj]` offload + no recompute	15.10	1128
full CG + no offload + `[mla_up_proj]` recompute (this PR)	12.94	1316

The third row is what this recipe ships: +7.3% throughput vs the prior partial-CG baseline, and a close match to the MLPerf reference config (~14 s/iter on similar parallelism shape).

Test plan

Empirical perf measured at 64 GB200 nodes, 256 GPUs (see iter-time / TF/s/GPU table above)
Loss curve healthy through iter 20 (iter 20 lm loss 8.149, mtp_1 0.081, grad norm 0.391)
`ruff check` and `ruff format --check` clean on both files
NeMo CI L0 (will run on push)

Mirror the GB300 MXFP8 full-iter CG recipe (PR NVIDIA-NeMo#3983) on GB200, with mla_up_proj recompute substituted for the no-recompute / no-offload strategy GB300 can afford with its larger HBM budget. GB200 mxfp8 V1 (and by inheritance V2 / VR200 mxfp8 V1/V2): - cuda_graph_impl=full_iteration, cuda_graph_scope=[] - moe_a2a_overlap=True, cutedsl_fused_grouped_mlp=True - recompute_modules=["mla_up_proj"] - fp8_output_proj=True gated on mxfp8 recipe in deepseek_v3_pretrain_config_gb200 (mirrors GB300 gate) Measured on 64 GB200 nodes / 256 GPUs, DSv3-671B mxfp8, GBS=4096: | variant | iter (s) | TF/s/GPU | |------------------------------------------------------|---------:|---------:| | partial CG + [core_attn] offload + [mlp] recompute | 13.89 | 1226 | | full CG + [core_attn,attn_proj] offload + no recomp | 15.10 | 1128 | | full CG + no offload + [mla_up_proj] recompute | 12.94 | 1316 | The third row is what this recipe ships: +7.3% throughput vs the partial-CG baseline (and the closest match to the MLPerf reference config that runs ~14 s/iter). Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

V2 was aliased directly to GB200_V2 (transformer_engine CG), bypassing the full-iteration CG / a2a overlap / cutedsl-fused-grouped-mlp / mla_up_proj-recompute overrides on FP8_MX_V1. Restore the replace(FP8_MX_V1, global_batch_size=4096) form so CONFIG_VARIANT=v2 exercises the same recipe as v1, with the V2 batch size. Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

dingqingy-nv · 2026-06-09T23:39:20Z

/claude review

claude · 2026-06-09T23:43:51Z

+DEEPSEEK_V3_PRETRAIN_CONFIG_GB200_FP8_MX_V1 = replace(
+    DEEPSEEK_V3_PRETRAIN_CONFIG_GB200_V1,
+    cuda_graph_impl="full_iteration",
+    cuda_graph_scope=[],
+    moe_a2a_overlap=True,
+    cutedsl_fused_grouped_mlp=True,
+    recompute_modules=["mla_up_proj"],
+)


Question: GB300 MX V1 sets fp8_dot_product_attention=True (line 74) but the new GB200 MX V1 does not. The PR description says "Recipe shape matches GB300 except for recompute_modules=["mla_up_proj"]" — is the omission of fp8_dot_product_attention intentional (e.g. incompatible with GB200's PP=4/VP=4 layout), or should it be added here?

claude · 2026-06-09T23:45:33Z

Light Code Review

Clean, well-scoped change. The GB200 MX V1 config correctly mirrors the GB300 full-iter CG recipe shape with the expected recompute_modules adjustment for GB200 smaller HBM, and V2 properly inherits from V1 with only global_batch_size=4096 changed. The fp8_output_proj gate in the GB200 config function mirrors the existing GB300 gate.

Questions

Missing fp8_dot_product_attention=True vs GB300 (inline comment posted) - DEEPSEEK_V3_PRETRAIN_CONFIG_GB300_FP8_MX_V1 sets fp8_dot_product_attention=True (line 74) but the new GB200_FP8_MX_V1 does not. The PR description states the recipe matches GB300 except for recompute_modules. Is this omission intentional?
VR200 missing fp8_output_proj gate (pre-existing, but now more impactful) - DEEPSEEK_V3_PRETRAIN_CONFIG_VR200_FP8_MX_V2 (line 218) aliases GB200_FP8_MX_V2, which now carries full-iter CG + mxfp8 settings. However, deepseek_v3_pretrain_config_vr200() (line 144) lacks the fp8_output_proj=True gate that both GB300 (line 81) and GB200 (line 121) have. If fp8_output_proj is needed for correctness with mxfp8, VR200 MX V2 would be missing it. This is pre-existing but worth noting since VR200 MX V2 now actually uses the mxfp8-tuned config rather than a plain alias.

Suggested test cases

No perf tests impacted. The changed configs (DEEPSEEK_V3_PRETRAIN_CONFIG_GB200_FP8_MX_V1, V2, and transitive VR200_FP8_MX_V2) are 256-GPU / 128-GPU scale perf recipes with no corresponding L0/L1/L2 functional test launch scripts. The existing test_deepseek_v3_perf_config_instantiation unit test only covers H100 BF16. Consider extending it to cover GB200 mxfp8 instantiation to catch config wiring issues in CI.

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

#4187) Signed-off-by: Dingqing Yang <dingqingy@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Dingqing Yang <dingqingy@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: malay-nagda <malayn@nvidia.com>

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

copy-pr-bot Bot temporarily deployed to public June 9, 2026 04:29 Inactive

dingqingy-nv requested review from ko3n1g and malay-nagda June 9, 2026 04:31

copy-pr-bot Bot temporarily deployed to public June 9, 2026 04:39 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 04:58 Inactive

yaoyu-33 added area:perf Performance optimizations and benchmarking feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer labels Jun 9, 2026

malay-nagda previously approved these changes Jun 9, 2026

View reviewed changes

dingqingy-nv dismissed malay-nagda’s stale review via dece659 June 9, 2026 21:21

copy-pr-bot Bot temporarily deployed to public June 9, 2026 21:22 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 21:35 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 21:36 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 21:37 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 21:56 Inactive

claude Bot reviewed Jun 9, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public June 9, 2026 23:59 Inactive

copy-pr-bot Bot temporarily deployed to public June 10, 2026 00:08 Inactive

copy-pr-bot Bot temporarily deployed to public June 10, 2026 00:09 Inactive

copy-pr-bot Bot temporarily deployed to public June 10, 2026 00:29 Inactive

dingqingy-nv force-pushed the dsv3-gb200-mxfp8-fullcg branch from 20c8fb7 to dece659 Compare June 10, 2026 00:48

copy-pr-bot Bot temporarily deployed to public June 10, 2026 00:48 Inactive

copy-pr-bot Bot temporarily deployed to public June 10, 2026 01:19 Inactive

copy-pr-bot Bot temporarily deployed to public June 10, 2026 01:20 Inactive

copy-pr-bot Bot temporarily deployed to public June 10, 2026 01:40 Inactive

malay-nagda approved these changes Jun 10, 2026

View reviewed changes

dingqingy-nv merged commit afb5dd4 into NVIDIA-NeMo:main Jun 10, 2026
86 checks passed

svcnvidia-nemo-ci pushed a commit that referenced this pull request Jun 10, 2026

feat(recipe): DSV3 GB200 MXFP8 full-iter CG recipe (#4226)

567d024

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com>

dingqingy-nv mentioned this pull request Jun 10, 2026

cp: GB300/GB200 MXFP8 full-iter CG recipes (#3983 + #4226) into r0.5.0 #4187

Merged

vasunvidia pushed a commit to vasunvidia/Megatron-Bridge that referenced this pull request Jun 10, 2026

feat(recipe): DSV3 GB200 MXFP8 full-iter CG recipe (NVIDIA-NeMo#4226)

cd3fc0c

Signed-off-by: Dingqing Yang <dingqingy@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recipe): DSV3 GB200 MXFP8 full-iter CG recipe#4226

feat(recipe): DSV3 GB200 MXFP8 full-iter CG recipe#4226
dingqingy-nv merged 2 commits into
NVIDIA-NeMo:mainfrom
dingqingy-nv:dsv3-gb200-mxfp8-fullcg

dingqingy-nv commented Jun 9, 2026 •

edited

Loading

Uh oh!

dingqingy-nv commented Jun 9, 2026

Uh oh!

claude Bot Jun 9, 2026

Uh oh!

claude Bot commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dingqingy-nv commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Measured impact

Test plan

Uh oh!

dingqingy-nv commented Jun 9, 2026

Uh oh!

claude Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jun 9, 2026

Light Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dingqingy-nv commented Jun 9, 2026 •

edited

Loading