Skip to content

option to limit number of SMs#26

Open
mingruimingrui wants to merge 2 commits intoDao-AILab:mainfrom
mingruimingrui:feat/limit-sms
Open

option to limit number of SMs#26
mingruimingrui wants to merge 2 commits intoDao-AILab:mainfrom
mingruimingrui:feat/limit-sms

Conversation

@mingruimingrui
Copy link
Contributor

@mingruimingrui mingruimingrui commented Feb 1, 2026

This MR is to provide a feature to limit the number of SMs that sonic-moe uses. It will be useful to avoid oversubscribe the GPU when you run sonic-moe concurrently with other persistent kernels like DeepEP.

@mingruimingrui mingruimingrui changed the title [Feat] limit number of SMs option to limit number of SMs Feb 1, 2026
@tridao
Copy link
Member

tridao commented Feb 1, 2026

This might help for some of the kernels, some other kernels (varlen_k in the backward) uses dynamic persistent scheduler that wont' oversubscribe.

I think in the case of other kernels running concurrently we can also just set dynamic_persistent=True instead of limiting the number of SMs (which requires tuning). I haven't tested this though.

@mingruimingrui
Copy link
Contributor Author

mingruimingrui commented Feb 1, 2026

Dynamic scheduling is excellent, it would make things far more convenient.
I was actually looking for this exact feature so it's not necessary to "reserve SMs" all the time.

But I have a noob question... I'm kinda new to cutlass and blackwell there's the CLC to do this.
https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/blackwell/dense_gemm_persistent_dynamic.py

I've tried to do some research but I wasn't able to quite figure out how to do this on Hopper.
Could you help to point me the right way?

@tridao
Copy link
Member

tridao commented Feb 2, 2026

For Hopper: we maintain a counter / semaphore in gmem, initialized to zero. Whenever a cluster finishes its work, it atomically increment the counter and get the index of the next work that it should be doing, until there's no more work to do.
https://github.com/Dao-AILab/quack/blob/9e774f8225fa890aa0654a9a2350991407b9f2c2/quack/tile_scheduler.py#L263
https://github.com/Dao-AILab/quack/blob/9e774f8225fa890aa0654a9a2350991407b9f2c2/quack/tile_scheduler.py#L296

@mingruimingrui
Copy link
Contributor Author

I've mis-understood the overallocation problem.
The intuition is actually really simple...

When some SMs are already occuppied, if we still schedule assuming that all SMs are still available, some SMs will receive double the work as other SMs.
image

Currently, the scheduling is pre-calculated so that each persistent kernel receives exactly one work.
So to implement dynamic persistent kernel properly, I will have to split the work into multiple small parts (maybe also modify the prolog/epilogue). This will also have to be benchmarked against the existing implementation. I hope that my interpretation is correct.

Mind if I shelf this MR until the weekend when I would have more time to play with this?
Or perhaps to reference wentao's SM100 implementation

@tridao
Copy link
Member

tridao commented Feb 3, 2026

I think dynamic_persistent (already implemented) should just work for this case.

@mingruimingrui
Copy link
Contributor Author

mingruimingrui commented Feb 3, 2026

Thank you @tridao for the advice, I've managed to enable dynamic_persistent mode on my other branch https://github.com/mingruimingrui/sonic-moe/tree/feat/enable-dynamic-scheduling
However I realize the perfect solution might not be as simple as I had hoped...

By using dynamic persistent kernels, I noticed it can sometimes block the timely execution of other kernels on higher priority streams. Let me show you what I mean.

When launching communication kernels along-side sonic-moe, ideally we want the communication kernel to execute ASAP.
image

However with dynamic persistent kernels, it can greedily preallocate all available SMs, leading to late execution of other kernels.
image

@tridao
Copy link
Member

tridao commented Feb 4, 2026

I see. In this case I don't think even CLC can help (I'm not certain as I haven't tested). The gemm kernel is launched first and will run to completion and cannot be preempted (as it launches on all SMs)?
I guess one would have to limit the number of SMs. Or use CUDA green context (which seems equivalent to limiting number of SMs?)
https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/green-contexts.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants