Skip to content

fix: save_hf_weights drops boundary shards for MTP-less GLM-4.x glm4_moe_lite export#4189

Open
dinhxuanvu wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
dinhxuanvu:vdinh/glm4moe-mtp-boundary-shard-fix
Open

fix: save_hf_weights drops boundary shards for MTP-less GLM-4.x glm4_moe_lite export#4189
dinhxuanvu wants to merge 4 commits into
NVIDIA-NeMo:mainfrom
dinhxuanvu:vdinh/glm4moe-mtp-boundary-shard-fix

Conversation

@dinhxuanvu

@dinhxuanvu dinhxuanvu commented Jun 7, 2026

Copy link
Copy Markdown

Summary

Exporting a GLM-4.x glm4_moe_lite model to HuggingFace safetensors via AutoBridge.save_hf_weights can silently write an incomplete checkpoint when the model was built without an MTP/nextn head.

GLM keeps its nextn layer as a regular decoder layer (model.layers.{num_hidden_layers}.*), not under the usual mtp.* prefix. An MTP-less model never produces those tensors, but they're still expected in the shard map — so the per-shard completeness check drops any shard that shares space with them, including ones holding embed_tokens, lm_head, and model.norm. The index stays self-consistent and no error is raised, so the loss is silent. For GLM-4.7-Flash this drops 3 of 48 shards.

See #4188 for the full breakdown.

Changes

  • Strip the GLM nextn layer (model.layers.{num_hidden_layers}.*) in addition to mtp.* when a model is exported without an MTP head.
  • Detect MTP-less models from the built model's mtp_num_layers (the old config-based check misread GLM's num_nextn_predict_layers=1 as MTP-enabled).
  • No-op for non-GLM models and for MTP-enabled exports.
  • Add unit and integration tests in test_auto_bridge.py.

Test plan

  • New test_auto_bridge.py cases are mock-based and need no GPU.
  • Verified end-to-end: a GLM-4.7-Flash MTP-less export now writes a complete checkpoint, with only the un-built nextn layer omitted.

Additional Information

@copy-pr-bot

copy-pr-bot Bot commented Jun 7, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33 yaoyu-33 added area:ckpt Checkpoint conversion, loading, export, and save paths bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer labels Jun 7, 2026
@yaoyu-33 yaoyu-33 added the ready-to-merge PR is approved, current, and only waiting for CI to pass before merge label Jun 8, 2026
yaoyu-33
yaoyu-33 previously approved these changes Jun 8, 2026
@yaoyu-33 yaoyu-33 removed the needs-review PR is ready for code review and waiting on a reviewer label Jun 9, 2026
@yaoyu-33

yaoyu-33 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

/ok to test 96b7b68

@dinhxuanvu

Copy link
Copy Markdown
Author

Pushed a commit to fix ruff lint issue.

@adityavavreNVDA

Copy link
Copy Markdown
Contributor

/ok to test dafd239

@dinhxuanvu

Copy link
Copy Markdown
Author

Had to rebase PR due to conflicts from recently merged PRs. Only impacted the test code.

gautham-kollu
gautham-kollu previously approved these changes Jun 11, 2026
@gautham-kollu

Copy link
Copy Markdown
Contributor

/ok to test c5acab5

@gautham-kollu gautham-kollu enabled auto-merge (squash) June 11, 2026 18:22
@gautham-kollu

Copy link
Copy Markdown
Contributor

Enabled auto-merge. Once tests pass, should get merged automatically.

@dinhxuanvu

dinhxuanvu commented Jun 12, 2026

Copy link
Copy Markdown
Author

The Launch_Unit_Tests_Core runner timed out as it took too long. Everything else passed. This looks like flaky-infra failure. I don't have permission to rerun the test unfortunately.

@gautham-kollu

Copy link
Copy Markdown
Contributor

/ok to test fff2d82

auto-merge was automatically disabled June 16, 2026 01:05

Head branch was pushed to by a user without write access

When a Megatron model is built without an MTP head (provider
mtp_num_layers is None or 0, e.g. SkyRL's GLM-4.x glm4_moe_lite
override), the export generator never yields the MTP/nextn tensors.
For GLM the nextn layer is stored in the source safetensors index as a
regular decoder layer at index num_hidden_layers (model.layers.{N}.*),
not under a dedicated mtp.* prefix.

Because those keys remain in the expected source sharding map, the
strict non-distributed save_generator completeness gate can never
complete any shard that co-locates a nextn key with real (non-MTP)
tensors. Those shards are dropped wholesale, taking pipeline-boundary
params (embed_tokens, layer 0, last layer, final norm, lm_head) with
them. For GLM-4.7-Flash this deterministically produced a 45/48-shard
checkpoint missing shards 00001/00047/00048.

Generalize the existing mtp.* ignore mechanism:
- _model_omits_mtp(model_config): detect an MTP-less built model from the
  provider's mtp_num_layers (falsy => omitted), distinct from
  _config_disables_mtp which only inspects HF config fields.
- _mtp_source_key_prefixes(source, *configs): resolve the source-key
  prefixes to strip, covering both DeepSeek-style mtp.* and GLM's
  model.layers.{num_hidden_layers}.*, each gated on source.has_glob so
  only prefixes actually present are stripped.

Stripping model.layers.{N}. lets the boundary shards complete with their
real keys (47-shard self-consistent index) while the pure-nextn shard is
correctly omitted.

Signed-off-by: Vu Dinh <vudinh@outlook.com>
- fold ignored_source_key_prefixes selection inline
- document hf_config-before-model_config ordering and 0-indexed layer
  assumption; note _mtp_source_key_prefixes precondition
- add save_hf_weights integration tests asserting the nextn prefix is
  stripped only when the built model omits the MTP head
- use SimpleNamespace instead of Mock(spec=[]) for unset-attr check

Signed-off-by: Vu Dinh <vudinh@outlook.com>
Collapse a multi-line assert that ruff-format (v0.9.9) wants on a single
line; no logic change.

Signed-off-by: Vu Dinh <vudinh@outlook.com>
…tests

The _run_save_hf_weights helper assigned hf_pretrained.state directly, but
state is a read-only property on PreTrainedBase (no setter), raising
AttributeError. Patch it via PropertyMock on the mock subclass instead.

Signed-off-by: Vu Dinh <vudinh@outlook.com>
@dinhxuanvu dinhxuanvu force-pushed the vdinh/glm4moe-mtp-boundary-shard-fix branch from 0a34d2e to 2d19db5 Compare June 16, 2026 01:08
@dinhxuanvu

Copy link
Copy Markdown
Author

@gautham-kollu The CI revealed a failure on the test code. I pushed a commit to correct it. PTAL. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:ckpt Checkpoint conversion, loading, export, and save paths bug Something isn't working community-request ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants