Understanding sm90_mma_tma_gmma_ss_warpspecialized_fp8 #1600

Aya-ZIbra · 2024-06-26T18:45:04Z

Aya-ZIbra
Jun 26, 2024

I am trying to understand the implementation in include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp

I am confused about some points in this implementation. The questions are so nOOb but I just started with CUTLASS

why we need these prologue iterations before mainloop. https://github.com/NVIDIA/cutlass/blob/637b15906358191cb4238af419d408a65819d7ec/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp#L452C5-L453C61
why fp8 gemm is implemented with warpspecialization? Does making use of TMA dataloading neccessarily means we need to do warpspecialization?

Appreciate any explanation or directing me to some documentation.

thakkarV · 2024-06-26T18:53:35Z

the prologue loop is not required for warp-specialied kernels and can be removed without the loss of much performance. In this case it is a minor perf boost because we use the MMA GMMA::ScaleOut::Zero to clear the first MMA's accumulator rather than explicitly setting the register values to zero
Warp specialization is used for all data types in Hopper, not just FP8. It is not necessary, however, it does help performance in specific cases

1 reply

It seems that basically prologue is used to fill the pipeline at the beginning. We need to zero the tensor core accumulator every some predetermined number of MMAs and add their values to a main accumulator. So, we do the zeroing in the mainloop as well.
The data loading is already pipelined by TMA. To my understanding, warp specialization has two options to be useful: Pingpong and Cooperative. In both cases, we need multiple consumers but I cannot see how this applied in the code.
Thank you!