fix: save_hf_weights drops boundary shards for MTP-less GLM-4.x glm4_moe_lite export by dinhxuanvu · Pull Request #4189 · NVIDIA-NeMo/Megatron-Bridge

dinhxuanvu · 2026-06-07T00:56:44Z

Summary

Exporting a GLM-4.x glm4_moe_lite model to HuggingFace safetensors via AutoBridge.save_hf_weights can silently write an incomplete checkpoint when the model was built without an MTP/nextn head.

GLM keeps its nextn layer as a regular decoder layer (model.layers.{num_hidden_layers}.*), not under the usual mtp.* prefix. An MTP-less model never produces those tensors, but they're still expected in the shard map — so the per-shard completeness check drops any shard that shares space with them, including ones holding embed_tokens, lm_head, and model.norm. The index stays self-consistent and no error is raised, so the loss is silent. For GLM-4.7-Flash this drops 3 of 48 shards.

See #4188 for the full breakdown.

Changes

Strip the GLM nextn layer (model.layers.{num_hidden_layers}.*) in addition to mtp.* when a model is exported without an MTP head.
Detect MTP-less models from the built model's mtp_num_layers (the old config-based check misread GLM's num_nextn_predict_layers=1 as MTP-enabled).
No-op for non-GLM models and for MTP-enabled exports.
Add unit and integration tests in test_auto_bridge.py.

Test plan

New test_auto_bridge.py cases are mock-based and need no GPU.
Verified end-to-end: a GLM-4.7-Flash MTP-less export now writes a complete checkpoint, with only the un-built nextn layer omitted.

Additional Information

Related to [bug] save_hf_weights drops boundary shards when exporting an MTP-less GLM-4.x glm4_moe_lite model #4188

copy-pr-bot · 2026-06-07T00:56:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-06-09T05:16:39Z

/ok to test 96b7b68

dinhxuanvu · 2026-06-09T07:06:41Z

Pushed a commit to fix ruff lint issue.

adityavavreNVDA · 2026-06-09T16:03:01Z

/ok to test dafd239

dinhxuanvu · 2026-06-09T21:23:54Z

Had to rebase PR due to conflicts from recently merged PRs. Only impacted the test code.

gautham-kollu · 2026-06-11T18:22:43Z

/ok to test c5acab5

gautham-kollu · 2026-06-11T18:23:23Z

Enabled auto-merge. Once tests pass, should get merged automatically.

dinhxuanvu · 2026-06-12T04:26:42Z

The Launch_Unit_Tests_Core runner timed out as it took too long. Everything else passed. This looks like flaky-infra failure. I don't have permission to rerun the test unfortunately.

gautham-kollu · 2026-06-15T17:31:10Z

/ok to test fff2d82

When a Megatron model is built without an MTP head (provider mtp_num_layers is None or 0, e.g. SkyRL's GLM-4.x glm4_moe_lite override), the export generator never yields the MTP/nextn tensors. For GLM the nextn layer is stored in the source safetensors index as a regular decoder layer at index num_hidden_layers (model.layers.{N}.*), not under a dedicated mtp.* prefix. Because those keys remain in the expected source sharding map, the strict non-distributed save_generator completeness gate can never complete any shard that co-locates a nextn key with real (non-MTP) tensors. Those shards are dropped wholesale, taking pipeline-boundary params (embed_tokens, layer 0, last layer, final norm, lm_head) with them. For GLM-4.7-Flash this deterministically produced a 45/48-shard checkpoint missing shards 00001/00047/00048. Generalize the existing mtp.* ignore mechanism: - _model_omits_mtp(model_config): detect an MTP-less built model from the provider's mtp_num_layers (falsy => omitted), distinct from _config_disables_mtp which only inspects HF config fields. - _mtp_source_key_prefixes(source, *configs): resolve the source-key prefixes to strip, covering both DeepSeek-style mtp.* and GLM's model.layers.{num_hidden_layers}.*, each gated on source.has_glob so only prefixes actually present are stripped. Stripping model.layers.{N}. lets the boundary shards complete with their real keys (47-shard self-consistent index) while the pure-nextn shard is correctly omitted. Signed-off-by: Vu Dinh <vudinh@outlook.com>

- fold ignored_source_key_prefixes selection inline - document hf_config-before-model_config ordering and 0-indexed layer assumption; note _mtp_source_key_prefixes precondition - add save_hf_weights integration tests asserting the nextn prefix is stripped only when the built model omits the MTP head - use SimpleNamespace instead of Mock(spec=[]) for unset-attr check Signed-off-by: Vu Dinh <vudinh@outlook.com>

Collapse a multi-line assert that ruff-format (v0.9.9) wants on a single line; no logic change. Signed-off-by: Vu Dinh <vudinh@outlook.com>

…tests The _run_save_hf_weights helper assigned hf_pretrained.state directly, but state is a read-only property on PreTrainedBase (no setter), raising AttributeError. Patch it via PropertyMock on the mock subclass instead. Signed-off-by: Vu Dinh <vudinh@outlook.com>

dinhxuanvu · 2026-06-16T01:11:11Z

@gautham-kollu The CI revealed a failure on the test code. I pushed a commit to correct it. PTAL. Thanks.

github-actions Bot added the community-request label Jun 7, 2026

yaoyu-33 added area:ckpt Checkpoint conversion, loading, export, and save paths bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer labels Jun 7, 2026

yaoyu-33 approved these changes Jun 8, 2026

View reviewed changes

yaoyu-33 added the ready-to-merge PR is approved, current, and only waiting for CI to pass before merge label Jun 8, 2026

yaoyu-33 previously approved these changes Jun 8, 2026

View reviewed changes

yaoyu-33 removed the needs-review PR is ready for code review and waiting on a reviewer label Jun 9, 2026

copy-pr-bot Bot temporarily deployed to public June 9, 2026 05:17 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 06:21 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 06:22 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 06:43 Inactive

dinhxuanvu dismissed yaoyu-33’s stale review via beb4ae6 June 9, 2026 07:03

copy-pr-bot Bot temporarily deployed to public June 9, 2026 16:03 Inactive

copy-pr-bot Bot temporarily deployed to test June 9, 2026 16:04 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 16:50 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 16:51 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 17:09 Inactive

dinhxuanvu force-pushed the vdinh/glm4moe-mtp-boundary-shard-fix branch from dafd239 to c5acab5 Compare June 9, 2026 21:22

gautham-kollu previously approved these changes Jun 11, 2026

View reviewed changes

gautham-kollu enabled auto-merge (squash) June 11, 2026 18:22

copy-pr-bot Bot temporarily deployed to public June 11, 2026 18:23 Inactive

copy-pr-bot Bot temporarily deployed to test June 11, 2026 18:23 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 19:16 Inactive

copy-pr-bot Bot temporarily deployed to public June 11, 2026 19:36 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 17:31 Inactive

copy-pr-bot Bot temporarily deployed to test June 15, 2026 17:32 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 18:37 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 18:40 Inactive

copy-pr-bot Bot temporarily deployed to public June 15, 2026 19:04 Inactive

auto-merge was automatically disabled June 16, 2026 01:05
Head branch was pushed to by a user without write access

dinhxuanvu dismissed gautham-kollu’s stale review via 0a34d2e June 16, 2026 01:05

dinhxuanvu added 4 commits June 15, 2026 21:07

style: apply ruff-format to test_auto_bridge.py

3a7f9a1

Collapse a multi-line assert that ruff-format (v0.9.9) wants on a single line; no logic change. Signed-off-by: Vu Dinh <vudinh@outlook.com>

dinhxuanvu force-pushed the vdinh/glm4moe-mtp-boundary-shard-fix branch from 0a34d2e to 2d19db5 Compare June 16, 2026 01:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: save_hf_weights drops boundary shards for MTP-less GLM-4.x glm4_moe_lite export#4189

fix: save_hf_weights drops boundary shards for MTP-less GLM-4.x glm4_moe_lite export#4189
dinhxuanvu wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
dinhxuanvu:vdinh/glm4moe-mtp-boundary-shard-fix

dinhxuanvu commented Jun 7, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 7, 2026

Uh oh!

yaoyu-33 commented Jun 9, 2026

Uh oh!

dinhxuanvu commented Jun 9, 2026

Uh oh!

adityavavreNVDA commented Jun 9, 2026

Uh oh!

dinhxuanvu commented Jun 9, 2026

Uh oh!

gautham-kollu commented Jun 11, 2026

Uh oh!

gautham-kollu commented Jun 11, 2026

Uh oh!

dinhxuanvu commented Jun 12, 2026 •

edited

Loading

Uh oh!

gautham-kollu commented Jun 15, 2026

Uh oh!

dinhxuanvu commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

dinhxuanvu commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 7, 2026

Uh oh!

yaoyu-33 commented Jun 9, 2026

Uh oh!

dinhxuanvu commented Jun 9, 2026

Uh oh!

adityavavreNVDA commented Jun 9, 2026

Uh oh!

dinhxuanvu commented Jun 9, 2026

Uh oh!

gautham-kollu commented Jun 11, 2026

Uh oh!

gautham-kollu commented Jun 11, 2026

Uh oh!

dinhxuanvu commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gautham-kollu commented Jun 15, 2026

Uh oh!

dinhxuanvu commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dinhxuanvu commented Jun 7, 2026 •

edited

Loading

dinhxuanvu commented Jun 12, 2026 •

edited

Loading