Replies: 1 comment 1 reply
-
        
  | 
  
Beta Was this translation helpful? Give feedback.
                  
                    1 reply
                  
                
            
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to understand the implementation in include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp
I am confused about some points in this implementation. The questions are so nOOb but I just started with CUTLASS
why we need these prologue iterations before mainloop. https://github.com/NVIDIA/cutlass/blob/637b15906358191cb4238af419d408a65819d7ec/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp#L452C5-L453C61
why fp8 gemm is implemented with warpspecialization? Does making use of TMA dataloading neccessarily means we need to do warpspecialization?
Appreciate any explanation or directing me to some documentation.
Beta Was this translation helpful? Give feedback.
All reactions