[Bug Fix]Bypass launch grids for SM120 Kernel with SM90 Mainloop & SM100 TileScheduler #2865

HydraQYH · 2025-12-09T11:48:34Z

Follow: #2719. Some SM120 kernels will use SM90 Cooperative/Pingpong Mainloop with SM100 TileScheduler. SM100 TileScheduler do not have is_last_tile function. This can cause compilation errors when we remove unnecessary conditional compilation.
In this PR, If SM120 blockscaled is detected at compile time, then bypass launch_dependent_grids.

HydraQYH · 2025-12-09T11:51:23Z

@Junkai-Wu I believe the current solution is only temporary, and in the future we should add is_last_tile to the SM100 TileScheduler. cc @hwu36 @IonThruster @d-k-b

include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp

Junkai-Wu · 2025-12-10T05:29:40Z

@Junkai-Wu I believe the current solution is only temporary, and in the future we should add is_last_tile to the SM100 TileScheduler. cc @hwu36 @IonThruster @d-k-b

Agreed. @HydraQYH Do you think if you can do the corresponding changes in this PR? I think just an empty const function with false return should be OK.

HydraQYH · 2025-12-10T14:35:23Z

@Junkai-Wu I believe the current solution is only temporary, and in the future we should add is_last_tile to the SM100 TileScheduler. cc @hwu36 @IonThruster @d-k-b

Agreed. @HydraQYH Do you think if you can do the corresponding changes in this PR? I think just an empty const function with false return should be OK.

@Junkai-Wu Thank you for your suggestion, I will do it as soon as possible.

HydraQYH · 2025-12-10T14:46:54Z

@Junkai-Wu I rebase code and just add is_last_tile for SM100 TileScheduler. After testing with following command:

cmake .. -DCUTLASS_NVCC_ARCHS=120a -DCUTLASS_BUILD_FOR_PROFILER_REGRESSIONS=ON
make VERBOSE=1 cutlass_profiler -j16

There are no compilation errors anymore. And this PR is ready for review.

include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp

include/cutlass/gemm/kernel/sm100_tile_scheduler_stream_k.hpp

Junkai-Wu · 2025-12-11T00:51:37Z

@HydraQYH The changes look much cleaner now. Thanks for the quick action. Leave minor comments above.

Junkai-Wu · 2025-12-12T08:14:13Z

@HydraQYH I ran the internal pipeline with these changes and got a timeout issue of this unit test: https://github.com/NVIDIA/cutlass/blob/main/test/unit/gemm/device/sm120_tensorop_gemm/CMakeLists.txt#L51

The issue disappeared when I added the macro back on the is_last_tile function. I've asked corresponding developer to help identify the issue. You can also investigate it if possible.

HydraQYH · 2025-12-13T15:37:12Z

@Junkai-Wu Sorry, I don't have an SM120 device, so all I can do is compile. However, I noticed that removing this macro in Pingpong causes the program to return immediately sometime:

cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp

Lines 803 to 814 in d4e16f5

    
                 #ifdef CUTLASS_ENABLE_GDC_FOR_SM90 
        
                 // It is possible to have work tiles start off invalid, 
        
                 // so we have to check that first. 
        
                 if (not work_tile_info.is_valid()) { 
        
                   // Hint on an early release of global memory resources. 
        
                   // The timing of calling this function only influences performance, 
        
                   // not functional correctness. 
        
                   cutlass::arch::launch_dependent_grids(); 
        
                   return; 
        
                 } 
        
                 #endif

I think this radical approach may be risky, and may result in some semaphores not being released yet. In contrast, other changes will not affect the program's execution. I've reverted the changes to the early stop part. Could you try this?

Junkai-Wu · 2025-12-15T01:59:57Z

@HydraQYH After investigation, the timeout issue is caused by sm120 kernel calling sm90 scheduler where sm120 kernel didn't need to call is_last_tile before and after the removal of the macro, it will call this function which may cause some schedule issue. Maybe revert to previously implementation which using IsBlockScaled but with a proper name would be a safe option.

This reverts commit 246cb42.

Refine name again.

HydraQYH · 2025-12-15T06:23:13Z

@Junkai-Wu Okay. I rebaseed the code and reverted to previously implementation. I changed is_blockscaled to IsBlockScaledDispatchPolicy, which I believe is reasonable because the template parameter is just a DispatchPolicy. I also fixed the typo in Cooperative. And it's ready for review.

Junkai-Wu · 2025-12-16T09:29:25Z

@HydraQYH The internal pipeline still fails. After checking, I found all sm120 kernels should not call the is_last_tile function, not just sm120 blockscaled kernels. I refactored the change and reran the internal pipeline. I'll leave comments in this PR after the pipeline passes.

include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp

include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp

Junkai-Wu · 2025-12-17T09:36:42Z

include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp

    else if (warp_group_role == WarpGroupRole::Consumer0 || warp_group_role == WarpGroupRole::Consumer1) {
      cutlass::arch::warpgroup_reg_alloc<MmaRegisterRequirement>();

-      #ifdef CUTLASS_ENABLE_GDC_FOR_SM90


Add if constexpr (!IsSm120Family) condition here.

include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp

Junkai-Wu · 2025-12-17T09:39:24Z

@HydraQYH The internal pipeline passes. Leave suggested implementations in the comments.

Skip `is_last_tile` for all sm120 kernels. Co-authored-by: Junkai-Wu <[email protected]>

HydraQYH · 2025-12-17T15:19:33Z

@Junkai-Wu Thank you very much for your help. I have picked all your suggestions.

HydraQYH mentioned this pull request Dec 9, 2025

Support PDL for SM90 Array TMA GEMM #2719

Merged

Junkai-Wu reviewed Dec 10, 2025

View reviewed changes

include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp Outdated Show resolved Hide resolved

Junkai-Wu reviewed Dec 10, 2025

View reviewed changes

include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp Outdated Show resolved Hide resolved

HydraQYH force-pushed the dev_fix_sm120_with_pdl_support branch from 42d17b3 to 4ece099 Compare December 10, 2025 14:43

Junkai-Wu reviewed Dec 11, 2025

View reviewed changes

include/cutlass/gemm/kernel/sm100_tile_scheduler.hpp Outdated Show resolved Hide resolved

Junkai-Wu reviewed Dec 11, 2025

View reviewed changes

include/cutlass/gemm/kernel/sm100_tile_scheduler_stream_k.hpp Outdated Show resolved Hide resolved

HydraQYH added 5 commits December 15, 2025 12:45

Delete unused #ifdef/#endif. Bypass sm120 case.

d969d79

Add todo.

246cb42

Fix pingpong.

5b6f529

Revert "Add todo."

7e25b06

This reverts commit 246cb42.

Refine name.

ae1f832

Refine name again.

HydraQYH force-pushed the dev_fix_sm120_with_pdl_support branch from 67a08ba to ae1f832 Compare December 15, 2025 06:13

Junkai-Wu reviewed Dec 17, 2025

View reviewed changes

HydraQYH and others added 3 commits December 17, 2025 22:35

Apply suggestions from code review

ed5c2e2

Skip `is_last_tile` for all sm120 kernels. Co-authored-by: Junkai-Wu <[email protected]>

Skip early stop for sm120 kernel.

1ef69c3

Fix typo.

e9c4b30

Junkai-Wu approved these changes Dec 18, 2025

View reviewed changes

Junkai-Wu merged commit ebf3165 into NVIDIA:main Dec 18, 2025

[Bug Fix]Bypass launch grids for SM120 Kernel with SM90 Mainloop & SM100 TileScheduler #2865

[Bug Fix]Bypass launch grids for SM120 Kernel with SM90 Mainloop & SM100 TileScheduler #2865

Conversation

HydraQYH commented Dec 9, 2025

Uh oh!

HydraQYH commented Dec 9, 2025

Uh oh!

Uh oh!

Uh oh!

Junkai-Wu commented Dec 10, 2025

Uh oh!

HydraQYH commented Dec 10, 2025

Uh oh!

HydraQYH commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Junkai-Wu commented Dec 11, 2025

Uh oh!

Junkai-Wu commented Dec 12, 2025

Uh oh!

HydraQYH commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Junkai-Wu commented Dec 15, 2025

Uh oh!

HydraQYH commented Dec 15, 2025

Uh oh!

Junkai-Wu commented Dec 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Junkai-Wu Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

HydraQYH Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Junkai-Wu commented Dec 17, 2025

Uh oh!

HydraQYH commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HydraQYH commented Dec 10, 2025 •

edited

Loading

HydraQYH commented Dec 13, 2025 •

edited

Loading