feat(model): thread weight_dtype through HF export for plain-dtype DeepSeek-V4 output by Meirtz · Pull Request #4301 · NVIDIA-NeMo/Megatron-Bridge

Meirtz · 2026-06-11T11:01:32Z

What

Thread weight_dtype: Optional[torch.dtype] = None through the HF export path — export_hf_weights / save_hf_pretrained / stream_weights_megatron_to_hf — carried per-task via a new optional WeightConversionTask.weight_dtype field. When set, the DeepSeek-V4 bridge emits plain weights in that dtype (no *.scale companions) instead of re-creating the source repo's quantized layout. Default (None) keeps today's behavior. CLI: --export-weight-dtype on the export subcommand.

Why

DSv4 HF export unconditionally re-creates the source repo's quantized weight/scale layout (maybe_modify_converted_hf_weight → requantize_hf_weight_scale_pairs, from #3969). That's right for checkpoint conversion, but bf16-SFT'd weights get silently post-hoc quantized — a user found *.scale tensors in their SFT export and asked about train/inference parity.

Design (revised after reviewer feedback): the requantize hook runs on both export consumers — online weight streaming to rollout engines (export_hf_weights, e.g. verl RL weight sync) and on-disk checkpoints (save_hf_pretrained) — so a bridge-level boolean cannot configure them independently. A dtype-typed parameter on each public API lets callers choose per path (e.g. bf16 to rollout for RL parity, quantized to disk for serving-format artifacts, or vice versa). Hook signatures are unchanged (the dtype rides on the task), so the other bridges overriding this hook (dsv3, gemma4, kimi, mimo, flux) are unaffected; DSv3 can adopt the same field later.

Verified

the saver streams only yielded tensors (omitting .scale keys is safe); exported config.json is built fresh (torch_dtype: bfloat16, no quantization fields); safetensors index regenerated from written tensors;
unit tests cover dtype-set (plain weights, non-float tensors untouched) and default (requantize) paths.

Notes

AI-assisted (Claude); analysis, validation and review by the human author.

🤖 Generated with Claude Code

Meirtz · 2026-06-11T11:44:04Z

Reworked per reviewer feedback (offline discussion): the hook serves two export consumers — online weight streaming to rollout engines and on-disk checkpoints — so the bridge-level boolean is gone. Now weight_dtype: Optional[torch.dtype] on export_hf_weights / save_hf_pretrained, carried per-task (WeightConversionTask.weight_dtype), hook signatures unchanged. d10a8e7e.

Meirtz · 2026-06-11T13:59:07Z

Full-model E2E validation (DeepSeek-V4-Flash, 43 layers, real weights, TP1/PP4/EP8 on 8×GB300; same imported Megatron checkpoint for both runs):

export	tensors	`.scale` tensors	dtypes
default (quantized)	69,187	34,167	F8_E4M3, F8_E8M0, I8 (MXFP4), BF16, F32, I32
`--export-weight-dtype bfloat16`	35,020	0	BF16, I32

35,020 + 34,167 = 69,187 — the bf16 artifact contains exactly every weight with no scale companions (I32 = tid2eid routing buffers, correctly left untouched).

Two notes from the run: (1) the smoke caught a real bug in the first version of this PR — WeightConversionTask is a frozen dataclass, so the dtype is now applied via dataclasses.replace (ce66e82), and the unit tests now use real task instances instead of mocks; (2) with weight_dtype set, the saver's completeness check counts the source's .scale entries as "not written" (export proceeds with --not-strict) — cosmetic, can be polished in a follow-up by excluding scale keys from the expected set.

DSv4 HF export unconditionally re-creates the source repo's quantized weight/scale layout (FP8 attention / MXFP4 experts), so bf16-SFT'd weights get post-hoc quantized and the artifact carries *.scale tensors the training never saw. Add DeepSeekV4Bridge.export_quantized (default True, behavior unchanged) and a --no-quantized-export flag on the export CLI so SFT products can be exported as plain bf16 with exact train/inference parity. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Lingrui Mei <lmei@nvidia.com>

…epSeek-V4 output Rework after review: the requantize hook runs on BOTH export consumers — online weight streaming to rollout engines (export_hf_weights) and on-disk checkpoints (save_hf_pretrained) — so a bridge-level boolean cannot configure them independently. Add weight_dtype: Optional[torch.dtype] to export_hf_weights / save_hf_pretrained / stream_weights_megatron_to_hf, carried per-task via a new optional WeightConversionTask.weight_dtype field (hook signatures unchanged; other bridges unaffected). The DeepSeek-V4 bridge emits plain weights in that dtype (no *.scale) when set, and keeps re-creating the source repo's quantized layout by default. CLI: --export-weight-dtype on the export subcommand. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Lingrui Mei <lmei@nvidia.com>

…s.replace Caught by a full-model export smoke (the unit tests used MagicMock tasks, which do not enforce frozen). Tests now use real WeightConversionTask instances. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Lingrui Mei <lmei@nvidia.com>

Meirtz · 2026-06-11T18:29:36Z

/ok to test 1b93e3c

The SimpleNamespace stand-in lacks the new weight_dtype field, so the export hook's task.weight_dtype access raises AttributeError in the pre-existing quantized-export tests. Constructing the real (frozen) dataclass keeps the helper in sync with future field additions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Lingrui Mei <lmei@nvidia.com>

Meirtz · 2026-06-11T19:52:00Z

/ok to test 979e77d

export_hf_weights/save_hf_pretrained/save_hf_weights now forward the new weight_dtype kwarg, so the exact-call mock assertions need it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Lingrui Mei <lmei@nvidia.com>

Meirtz · 2026-06-11T21:36:20Z

/ok to test 5aeb8d1

Meirtz added feature New capabilities, enhancements, or enablement work area:model Model implementations and HF bridge logic labels Jun 11, 2026

copy-pr-bot Bot temporarily deployed to public June 11, 2026 11:02 Inactive

copy-pr-bot Bot temporarily deployed to test June 11, 2026 11:02 Inactive

yaoyu-33 added the needs-review PR is ready for code review and waiting on a reviewer label Jun 11, 2026

copy-pr-bot Bot temporarily deployed to public June 11, 2026 11:12 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 11:36 Inactive

Meirtz changed the title ~~feat(model): optional bf16 (non-quantized) HF export for DeepSeek-V4~~ feat(model): thread weight_dtype through HF export for plain-dtype DeepSeek-V4 output Jun 11, 2026

copy-pr-bot Bot temporarily deployed to public June 11, 2026 11:44 Inactive

copy-pr-bot Bot temporarily deployed to test June 11, 2026 11:44 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 11:53 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 11:54 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 12:15 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 13:05 Inactive

copy-pr-bot Bot temporarily deployed to test June 11, 2026 13:05 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 13:12 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 13:13 Inactive

Meirtz and others added 3 commits June 12, 2026 02:22

Meirtz force-pushed the fix/dsv4-bf16-export branch from ce66e82 to 1b93e3c Compare June 11, 2026 18:28

copy-pr-bot Bot temporarily deployed to public June 11, 2026 18:29 Inactive

copy-pr-bot Bot temporarily deployed to test June 11, 2026 18:29 Inactive

Meirtz requested review from cuichenx and yaoyu-33 June 11, 2026 18:38

copy-pr-bot Bot temporarily deployed to public June 11, 2026 18:46 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 18:47 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 19:06 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 19:52 Inactive

copy-pr-bot Bot temporarily deployed to test June 11, 2026 19:52 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 20:05 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 20:26 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 21:36 Inactive

copy-pr-bot Bot temporarily deployed to test June 11, 2026 21:37 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 21:46 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 21:47 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 22:11 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(model): thread weight_dtype through HF export for plain-dtype DeepSeek-V4 output#4301

feat(model): thread weight_dtype through HF export for plain-dtype DeepSeek-V4 output#4301
Meirtz wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
Meirtz:fix/dsv4-bf16-export

Meirtz commented Jun 11, 2026 •

edited

Loading

Uh oh!

Meirtz commented Jun 11, 2026

Uh oh!

Meirtz commented Jun 11, 2026

Uh oh!

Meirtz commented Jun 11, 2026

Uh oh!

Meirtz commented Jun 11, 2026

Uh oh!

Meirtz commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Meirtz commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Verified

Notes

Uh oh!

Meirtz commented Jun 11, 2026

Uh oh!

Meirtz commented Jun 11, 2026

Uh oh!

Meirtz commented Jun 11, 2026

Uh oh!

Meirtz commented Jun 11, 2026

Uh oh!

Meirtz commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Meirtz commented Jun 11, 2026 •

edited

Loading