feat: support ascend npu sft by donpromax · Pull Request #43 · stepfun-ai/SteptronOss

donpromax · 2026-03-27T07:58:33Z

Summary

This PR adds Ascend/NPU support for SFT workloads in StepTronOSS.

It introduces a NPU runtime patch entrypoint, NPU-specific optimization backends for MoE dispatch and grouped GEMM, Ascend SFT experiment entrypoints,
benchmark scripts, tests, and user-facing documentation.

Main Changes

add manual NPU runtime patching through steptronoss.utils.npu_patch.apply_npu_patch()
add grouped_gemm="npu_gmm" backed by MindSpeed npu_gmm_v2
add TokenDispatcher="npu_alltoall" for EP token routing on NPU
patch several CUDA-specific helper paths to work on NPU runtime
register NPU backends when apply_npu_patch() is called
add Ascend SFT experiment entrypoints for:
- Qwen3-1.7B
- Qwen3-30A3B
- Step3 toy SFT
- Step3.5-Flash Midtrain SFT with Muon
add a real-data preparation script for Qwen3 SFT smoke and validation runs
add NPU benchmarks and test cases for dispatcher and grouped GEMM
add user documentation:
- docs/ASCEND.md
- docs/ASCEND_ZH.md

Implementation Notes

Ascend patching is now explicitly enabled by calling apply_npu_patch() before importing modules that depend on NPU runtime behavior or before selecting NPU backends
npu_alltoall builds a unique token-rank routing layout before communication to avoid duplicated routing to the same remote rank
the fast path uses MindSpeed npu_moe_token_permute / npu_moe_token_unpermute on the NPU bf16 path
the dispatcher keeps an index_select / index_add fallback path for compatibility
npu_gmm keeps the same semantic interface as the existing grouped_gemm path and normalizes batch_sizes internally

Tested Environment

Validated with:

CANN 8.3.RC2
torch_npu 2.8
MindSpeed v2.2.0_core_r0.12.1

All other dependencies are kept aligned with the project's uv environment.

Validation

added dispatcher unit tests in tests/test_npu_alltoall_dispatcher.py
added grouped GEMM correctness test in tests/test_grouped_gemm_npu.py
added microbenchmarks:
- benchmarks/benchmark_dispatcher_npu.py
- benchmarks/benchmark_grouped_gemm_npu.py
ran end-to-end SFT validation on representative Qwen3 and Step3.5 configs

Training Results

Qwen3-1.7B sft smoke test (compared to l40s):

Step-3.5-Flash-Base-Midtrain sft: tp8pp8vpp3ep8 seq_len=8192 with recompute, sequence_parallel and offload_optimizer_state

Kernel Optimization

NPU grouped GEMM speedup vs baseline:

[npu_grouped_gemm] name=moe_like_large group=36 batch=3256 k=4096 n=2560 dtype=bf16 trans_b=True
backend, fw_ms, bw_ms, total_ms
baseline, 11.976, 235.589, 247.565
npu_gmm_v2, 8.257, 19.858, 28.115
speedup_vs_baseline, fw=1.45x, bw=11.86x, total=8.81x
metric, close, max_abs_diff
forward, True, 0.000000
forward_bw_run, True, 0.000000
grad_a, True, 0.000000
grad_b, True, 0.000000

NPU MoE AllToAll Dispatcher speedup vs baseline:

Randomizez

感觉有很多可以简化的地方，请看一下。另外alternatives尽量直接注册，在patch内注册有点太过隐式了。

…m optimizations

…cope

Randomizez

LGTM

donpromax requested a review from a team March 27, 2026 07:58

donpromax force-pushed the ascend_sft_patch branch 2 times, most recently from d6245e1 to 688c0e1 Compare March 27, 2026 08:16

feat: support ascend npu sft

606636d

donpromax force-pushed the ascend_sft_patch branch from 688c0e1 to 606636d Compare March 27, 2026 08:27

Randomizez reviewed Mar 27, 2026

View reviewed changes

lvdong added 4 commits March 27, 2026 17:41

feat(npu): add native flash-attn, alltoall dispatcher and grouped-gem…

3917222

…m optimizations

feat: raise exception when npu not available

d6df66a

feat(ascend): simplify npu playground code

5cd559d

docs(ascend): clarify native NPU alternative registration and patch s…

2c8b285

…cope

Randomizez approved these changes Mar 27, 2026

View reviewed changes

Randomizez merged commit 21b0ecf into stepfun-ai:dev Mar 27, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support ascend npu sft#43

feat: support ascend npu sft#43
Randomizez merged 5 commits into
stepfun-ai:devfrom
donpromax:ascend_sft_patch

donpromax commented Mar 27, 2026

Uh oh!

Randomizez left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Randomizez left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

donpromax commented Mar 27, 2026

Summary

Main Changes

Implementation Notes

Tested Environment

Validation

Training Results

Kernel Optimization

Uh oh!

Randomizez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Randomizez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants