Skip to content

feat: support ascend npu sft#43

Merged
Randomizez merged 5 commits into
stepfun-ai:devfrom
donpromax:ascend_sft_patch
Mar 27, 2026
Merged

feat: support ascend npu sft#43
Randomizez merged 5 commits into
stepfun-ai:devfrom
donpromax:ascend_sft_patch

Conversation

@donpromax

Copy link
Copy Markdown
Contributor

Summary

This PR adds Ascend/NPU support for SFT workloads in StepTronOSS.

It introduces a NPU runtime patch entrypoint, NPU-specific optimization backends for MoE dispatch and grouped GEMM, Ascend SFT experiment entrypoints,
benchmark scripts, tests, and user-facing documentation.

Main Changes

  • add manual NPU runtime patching through steptronoss.utils.npu_patch.apply_npu_patch()
  • add grouped_gemm="npu_gmm" backed by MindSpeed npu_gmm_v2
  • add TokenDispatcher="npu_alltoall" for EP token routing on NPU
  • patch several CUDA-specific helper paths to work on NPU runtime
  • register NPU backends when apply_npu_patch() is called
  • add Ascend SFT experiment entrypoints for:
    • Qwen3-1.7B
    • Qwen3-30A3B
    • Step3 toy SFT
    • Step3.5-Flash Midtrain SFT with Muon
  • add a real-data preparation script for Qwen3 SFT smoke and validation runs
  • add NPU benchmarks and test cases for dispatcher and grouped GEMM
  • add user documentation:
    • docs/ASCEND.md
    • docs/ASCEND_ZH.md

Implementation Notes

  • Ascend patching is now explicitly enabled by calling apply_npu_patch() before importing modules that depend on NPU runtime behavior or before selecting NPU backends
  • npu_alltoall builds a unique token-rank routing layout before communication to avoid duplicated routing to the same remote rank
  • the fast path uses MindSpeed npu_moe_token_permute / npu_moe_token_unpermute on the NPU bf16 path
  • the dispatcher keeps an index_select / index_add fallback path for compatibility
  • npu_gmm keeps the same semantic interface as the existing grouped_gemm path and normalizes batch_sizes internally

Tested Environment

Validated with:

  • CANN 8.3.RC2
  • torch_npu 2.8
  • MindSpeed v2.2.0_core_r0.12.1

All other dependencies are kept aligned with the project's uv environment.

Validation

  • added dispatcher unit tests in tests/test_npu_alltoall_dispatcher.py
  • added grouped GEMM correctness test in tests/test_grouped_gemm_npu.py
  • added microbenchmarks:
    • benchmarks/benchmark_dispatcher_npu.py
    • benchmarks/benchmark_grouped_gemm_npu.py
  • ran end-to-end SFT validation on representative Qwen3 and Step3.5 configs

Training Results

Qwen3-1.7B sft smoke test (compared to l40s):
910b_compare_to_l40s

Step-3.5-Flash-Base-Midtrain sft: tp8pp8vpp3ep8 seq_len=8192 with recompute, sequence_parallel and offload_optimizer_state
image

Kernel Optimization

NPU grouped GEMM speedup vs baseline:

[npu_grouped_gemm] name=moe_like_large group=36 batch=3256 k=4096 n=2560 dtype=bf16 trans_b=True
backend, fw_ms, bw_ms, total_ms
baseline, 11.976, 235.589, 247.565
npu_gmm_v2, 8.257, 19.858, 28.115
speedup_vs_baseline, fw=1.45x, bw=11.86x, total=8.81x
metric, close, max_abs_diff
forward, True, 0.000000
forward_bw_run, True, 0.000000
grad_a, True, 0.000000
grad_b, True, 0.000000

NPU MoE AllToAll Dispatcher speedup vs baseline:
dispatcher_npu_seq_len_speedup_20260325 (1)

@donpromax donpromax requested a review from a team March 27, 2026 07:58
@donpromax donpromax force-pushed the ascend_sft_patch branch 2 times, most recently from d6245e1 to 688c0e1 Compare March 27, 2026 08:16

@Randomizez Randomizez left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉有很多可以简化的地方,请看一下。另外alternatives尽量直接注册,在patch内注册有点太过隐式了。

Comment thread benchmarks/benchmark_dispatcher_npu.py Outdated
Comment thread benchmarks/benchmark_dispatcher_npu.py
Comment thread steptronoss/utils/npu_patch.py Outdated
Comment thread steptronoss/utils/npu_patch.py Outdated
Comment thread steptronoss/utils/npu_patch.py Outdated
Comment thread playground/sft/step3/npu/step3p5_flash_sft_step3_data_muon_npu.py

@Randomizez Randomizez left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Randomizez Randomizez merged commit 21b0ecf into stepfun-ai:dev Mar 27, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants