Skip to content

fix(deepseek): gate H100 fused kernel defaults#4338

Open
cuichenx wants to merge 2 commits into
mainfrom
codex/dsv4-h100-disable-fused-kernels
Open

fix(deepseek): gate H100 fused kernel defaults#4338
cuichenx wants to merge 2 commits into
mainfrom
codex/dsv4-h100-disable-fused-kernels

Conversation

@cuichenx

@cuichenx cuichenx commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add a DeepSeek V4 helper that enables Blackwell-only fused kernels only when CUDA capability is 10.x or newer.
  • Use that helper for AutoBridge defaults so fused mHC and fused DSA default off on Hopper/H100 while staying on for Blackwell.
  • Apply the same mHC gating in DeepSeek V4 recipes; DSA remains explicitly unfused in recipes and RoPE behavior is unchanged.

Validation

  • H100 inference comparison on DFW, 2 nodes / 16 H100 GPUs, TP=1 PP=1 EP=16, Bridge bfb60bb8, MCore dev 2f1004963dcb1718804f3b858f5fb2fc73819694, NeMo 26.06 rc3 container:
    • 12760509, use_fused_mhc=False, apply_rope_fusion=False, apply_dsa_kernel_fusion=True: failed before model construction with AssertionError: apply_dsa_kernel_fusion requires SM100+ (Blackwell or later), but current device has compute capability 9.0.
    • 12760510, use_fused_mhc=False, apply_rope_fusion=False, apply_dsa_kernel_fusion=False: completed one-token inference, after_model: cuda_allocated_gib=47.57 cuda_reserved_gib=48.36, generated Hello2, exit 0:0.
  • Containerized targeted tests passed: 58 tests from tests/unit_tests/models/deepseek/test_deepseek_v4_bridge.py and tests/unit_tests/recipes/test_deepseek_recipes.py in nvcr.io/nvidia/pytorch:26.04-py3.
  • uv run --no-sync pre-commit run --all-files passed.
  • uv run --no-sync ruff check src/megatron/bridge/models/deepseek/deepseek_v4_bridge.py src/megatron/bridge/recipes/deepseek/deepseek_v4.py tests/unit_tests/models/deepseek/test_deepseek_v4_bridge.py tests/unit_tests/recipes/test_deepseek_recipes.py passed.
  • uv run --no-sync ruff format --check src/megatron/bridge/models/deepseek/deepseek_v4_bridge.py src/megatron/bridge/recipes/deepseek/deepseek_v4.py tests/unit_tests/models/deepseek/test_deepseek_v4_bridge.py tests/unit_tests/recipes/test_deepseek_recipes.py passed.
  • git diff --check passed.

Signed-off-by: Chen Cui <chcui@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cuichenx cuichenx marked this pull request as ready for review June 12, 2026 21:08
@cuichenx cuichenx requested review from Meirtz and weijiac0619 June 12, 2026 21:08
@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Code Review - [High] Existing recipe tests will fail on H100 CI runners. Location: tests/unit_tests/recipes/test_deepseek_recipes.py:315-318 (_build_deepseek_v4_recipe). Problem: _build_deepseek_v4_recipe monkeypatches AutoBridge but not deepseek_v4_supports_blackwell_fused_kernels. After this PR, use_fused_mhc is set dynamically by that helper. On H100 CI runners (sm_90, CUDA available), the helper returns False, so use_fused_mhc will be False -- but five existing tests assert it is True. Affected tests: test_deepseek_v4_adam_mxfp8_recipe_uses_validated_optimizer_defaults (line 342), test_deepseek_v4_muon_recipe_uses_validated_optimizer_defaults (line 372), test_deepseek_v4_base_recipe_uses_blackwell_defaults (line 387), test_deepseek_v4_flash_sft_recipe_uses_fused_mhc (line 423), test_deepseek_v4_flash_no_mtp_sft_recipe_disables_mtp (line 435). Suggested fix: Add monkeypatch.setattr for deepseek_v4_supports_blackwell_fused_kernels (lambda: True) inside _build_deepseek_v4_recipe. Suggested test cases: No perf tests impacted.

weijiac0619
weijiac0619 previously approved these changes Jun 12, 2026
Signed-off-by: Chen Cui <chcui@nvidia.com>

@Meirtz Meirtz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@yaoyu-33 yaoyu-33 added area:perf Performance optimizations and benchmarking bug Something isn't working ready-to-merge PR is approved, current, and only waiting for CI to pass before merge labels Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:perf Performance optimizations and benchmarking bug Something isn't working ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants