option to limit number of SMs by mingruimingrui · Pull Request #26 · Dao-AILab/sonic-moe

mingruimingrui · 2026-02-01T13:40:46Z

This MR is to provide a feature to limit the number of SMs that sonic-moe uses. It will be useful to avoid oversubscribe the GPU when you run sonic-moe concurrently with other persistent kernels like DeepEP.

tridao · 2026-02-01T14:13:37Z

This might help for some of the kernels, some other kernels (varlen_k in the backward) uses dynamic persistent scheduler that wont' oversubscribe.

I think in the case of other kernels running concurrently we can also just set dynamic_persistent=True instead of limiting the number of SMs (which requires tuning). I haven't tested this though.

mingruimingrui · 2026-02-01T15:32:58Z

Dynamic scheduling is excellent, it would make things far more convenient.
I was actually looking for this exact feature so it's not necessary to "reserve SMs" all the time.

But I have a noob question... I'm kinda new to cutlass and blackwell there's the CLC to do this.
https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/blackwell/dense_gemm_persistent_dynamic.py

I've tried to do some research but I wasn't able to quite figure out how to do this on Hopper.
Could you help to point me the right way?

tridao · 2026-02-02T04:18:05Z

For Hopper: we maintain a counter / semaphore in gmem, initialized to zero. Whenever a cluster finishes its work, it atomically increment the counter and get the index of the next work that it should be doing, until there's no more work to do.
https://github.com/Dao-AILab/quack/blob/9e774f8225fa890aa0654a9a2350991407b9f2c2/quack/tile_scheduler.py#L263
https://github.com/Dao-AILab/quack/blob/9e774f8225fa890aa0654a9a2350991407b9f2c2/quack/tile_scheduler.py#L296

mingruimingrui · 2026-02-03T11:25:10Z

I've mis-understood the overallocation problem.
The intuition is actually really simple...

When some SMs are already occuppied, if we still schedule assuming that all SMs are still available, some SMs will receive double the work as other SMs.

Currently, the scheduling is pre-calculated so that each persistent kernel receives exactly one work.
So to implement dynamic persistent kernel properly, I will have to split the work into multiple small parts (maybe also modify the prolog/epilogue). This will also have to be benchmarked against the existing implementation. I hope that my interpretation is correct.

Mind if I shelf this MR until the weekend when I would have more time to play with this?
Or perhaps to reference wentao's SM100 implementation

tridao · 2026-02-03T11:49:04Z

I think dynamic_persistent (already implemented) should just work for this case.

mingruimingrui · 2026-02-03T18:48:03Z

Thank you @tridao for the advice, I've managed to enable dynamic_persistent mode on my other branch https://github.com/mingruimingrui/sonic-moe/tree/feat/enable-dynamic-scheduling
However I realize the perfect solution might not be as simple as I had hoped...

By using dynamic persistent kernels, I noticed it can sometimes block the timely execution of other kernels on higher priority streams. Let me show you what I mean.

When launching communication kernels along-side sonic-moe, ideally we want the communication kernel to execute ASAP.

However with dynamic persistent kernels, it can greedily preallocate all available SMs, leading to late execution of other kernels.

tridao · 2026-02-04T14:32:39Z

I see. In this case I don't think even CLC can help (I'm not certain as I haven't tested). The gemm kernel is launched first and will run to completion and cannot be preempted (as it launches on all SMs)?
I guess one would have to limit the number of SMs. Or use CUDA green context (which seems equivalent to limiting number of SMs?)
https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/green-contexts.html

mingruimingrui added 2 commits February 1, 2026 19:51

added option to choose number of sms

579ae93

fix typo

ef21956

mingruimingrui changed the title ~~[Feat] limit number of SMs~~ option to limit number of SMs Feb 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

option to limit number of SMs#26

option to limit number of SMs#26
mingruimingrui wants to merge 2 commits intoDao-AILab:mainfrom
mingruimingrui:feat/limit-sms

mingruimingrui commented Feb 1, 2026 •

edited

Loading

Uh oh!

tridao commented Feb 1, 2026

Uh oh!

mingruimingrui commented Feb 1, 2026 •

edited

Loading

Uh oh!

tridao commented Feb 2, 2026

Uh oh!

mingruimingrui commented Feb 3, 2026

Uh oh!

tridao commented Feb 3, 2026

Uh oh!

mingruimingrui commented Feb 3, 2026 •

edited

Loading

Uh oh!

tridao commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mingruimingrui commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tridao commented Feb 1, 2026

Uh oh!

mingruimingrui commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tridao commented Feb 2, 2026

Uh oh!

mingruimingrui commented Feb 3, 2026

Uh oh!

tridao commented Feb 3, 2026

Uh oh!

mingruimingrui commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tridao commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mingruimingrui commented Feb 1, 2026 •

edited

Loading

mingruimingrui commented Feb 1, 2026 •

edited

Loading

mingruimingrui commented Feb 3, 2026 •

edited

Loading