Conversation
|
This might help for some of the kernels, some other kernels (varlen_k in the backward) uses dynamic persistent scheduler that wont' oversubscribe. I think in the case of other kernels running concurrently we can also just set dynamic_persistent=True instead of limiting the number of SMs (which requires tuning). I haven't tested this though. |
|
Dynamic scheduling is excellent, it would make things far more convenient. But I have a noob question... I'm kinda new to cutlass and blackwell there's the CLC to do this. I've tried to do some research but I wasn't able to quite figure out how to do this on Hopper. |
|
For Hopper: we maintain a counter / semaphore in gmem, initialized to zero. Whenever a cluster finishes its work, it atomically increment the counter and get the index of the next work that it should be doing, until there's no more work to do. |
|
I think |
|
Thank you @tridao for the advice, I've managed to enable dynamic_persistent mode on my other branch https://github.com/mingruimingrui/sonic-moe/tree/feat/enable-dynamic-scheduling By using dynamic persistent kernels, I noticed it can sometimes block the timely execution of other kernels on higher priority streams. Let me show you what I mean. When launching communication kernels along-side sonic-moe, ideally we want the communication kernel to execute ASAP. However with dynamic persistent kernels, it can greedily preallocate all available SMs, leading to late execution of other kernels. |
|
I see. In this case I don't think even CLC can help (I'm not certain as I haven't tested). The gemm kernel is launched first and will run to completion and cannot be preempted (as it launches on all SMs)? |



This MR is to provide a feature to limit the number of SMs that sonic-moe uses. It will be useful to avoid oversubscribe the GPU when you run sonic-moe concurrently with other persistent kernels like DeepEP.