Skip to content

Conversation

@HydraQYH
Copy link
Contributor

@HydraQYH HydraQYH commented Dec 9, 2025

Follow: #2719. Some SM120 kernels will use SM90 Cooperative/Pingpong Mainloop with SM100 TileScheduler. SM100 TileScheduler do not have is_last_tile function. This can cause compilation errors when we remove unnecessary conditional compilation.
In this PR, If SM120 blockscaled is detected at compile time, then bypass launch_dependent_grids.

@HydraQYH
Copy link
Contributor Author

HydraQYH commented Dec 9, 2025

@Junkai-Wu I believe the current solution is only temporary, and in the future we should add is_last_tile to the SM100 TileScheduler. cc @hwu36 @IonThruster @d-k-b

@Junkai-Wu
Copy link
Collaborator

@Junkai-Wu I believe the current solution is only temporary, and in the future we should add is_last_tile to the SM100 TileScheduler. cc @hwu36 @IonThruster @d-k-b

Agreed. @HydraQYH Do you think if you can do the corresponding changes in this PR? I think just an empty const function with false return should be OK.

@HydraQYH
Copy link
Contributor Author

@Junkai-Wu I believe the current solution is only temporary, and in the future we should add is_last_tile to the SM100 TileScheduler. cc @hwu36 @IonThruster @d-k-b

Agreed. @HydraQYH Do you think if you can do the corresponding changes in this PR? I think just an empty const function with false return should be OK.

@Junkai-Wu Thank you for your suggestion, I will do it as soon as possible.

@HydraQYH HydraQYH force-pushed the dev_fix_sm120_with_pdl_support branch from 42d17b3 to 4ece099 Compare December 10, 2025 14:43
@HydraQYH
Copy link
Contributor Author

HydraQYH commented Dec 10, 2025

@Junkai-Wu I rebase code and just add is_last_tile for SM100 TileScheduler. After testing with following command:

cmake .. -DCUTLASS_NVCC_ARCHS=120a -DCUTLASS_BUILD_FOR_PROFILER_REGRESSIONS=ON
make VERBOSE=1 cutlass_profiler -j16

There are no compilation errors anymore. And this PR is ready for review.

@Junkai-Wu
Copy link
Collaborator

@HydraQYH The changes look much cleaner now. Thanks for the quick action. Leave minor comments above.

@Junkai-Wu
Copy link
Collaborator

@HydraQYH I ran the internal pipeline with these changes and got a timeout issue of this unit test: https://github.com/NVIDIA/cutlass/blob/main/test/unit/gemm/device/sm120_tensorop_gemm/CMakeLists.txt#L51

The issue disappeared when I added the macro back on the is_last_tile function. I've asked corresponding developer to help identify the issue. You can also investigate it if possible.

@HydraQYH
Copy link
Contributor Author

HydraQYH commented Dec 13, 2025

@Junkai-Wu Sorry, I don't have an SM120 device, so all I can do is compile. However, I noticed that removing this macro in Pingpong causes the program to return immediately sometime:

#ifdef CUTLASS_ENABLE_GDC_FOR_SM90
// It is possible to have work tiles start off invalid,
// so we have to check that first.
if (not work_tile_info.is_valid()) {
// Hint on an early release of global memory resources.
// The timing of calling this function only influences performance,
// not functional correctness.
cutlass::arch::launch_dependent_grids();
return;
}
#endif

I think this radical approach may be risky, and may result in some semaphores not being released yet. In contrast, other changes will not affect the program's execution. I've reverted the changes to the early stop part. Could you try this?

@Junkai-Wu
Copy link
Collaborator

@HydraQYH After investigation, the timeout issue is caused by sm120 kernel calling sm90 scheduler where sm120 kernel didn't need to call is_last_tile before and after the removal of the macro, it will call this function which may cause some schedule issue. Maybe revert to previously implementation which using IsBlockScaled but with a proper name would be a safe option.

@HydraQYH HydraQYH force-pushed the dev_fix_sm120_with_pdl_support branch from 67a08ba to ae1f832 Compare December 15, 2025 06:13
@HydraQYH
Copy link
Contributor Author

@Junkai-Wu Okay. I rebaseed the code and reverted to previously implementation. I changed is_blockscaled to IsBlockScaledDispatchPolicy, which I believe is reasonable because the template parameter is just a DispatchPolicy. I also fixed the typo in Cooperative. And it's ready for review.

@Junkai-Wu
Copy link
Collaborator

@HydraQYH The internal pipeline still fails. After checking, I found all sm120 kernels should not call the is_last_tile function, not just sm120 blockscaled kernels. I refactored the change and reran the internal pipeline. I'll leave comments in this PR after the pipeline passes.

else if (warp_group_role == WarpGroupRole::Consumer0 || warp_group_role == WarpGroupRole::Consumer1) {
cutlass::arch::warpgroup_reg_alloc<MmaRegisterRequirement>();
#ifdef CUTLASS_ENABLE_GDC_FOR_SM90
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add if constexpr (!IsSm120Family) condition here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Junkai-Wu
Copy link
Collaborator

@HydraQYH The internal pipeline passes. Leave suggested implementations in the comments.

HydraQYH and others added 3 commits December 17, 2025 22:35
@HydraQYH
Copy link
Contributor Author

@Junkai-Wu Thank you very much for your help. I have picked all your suggestions.

@Junkai-Wu Junkai-Wu merged commit ebf3165 into NVIDIA:main Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants