-
| 
         The question is just as in the title. For instance: auto tiled_mma = make_tiled_mma(SM80_16x8x16_F32F16F16F32_TN{}, Layout<Shape<_1, _1, _2>>{});Intuitively, the number of threads in each CTA will be doubled, but then do the register fragments for each thread only holds the reduced sum of half of the  Thanks!  | 
  
Beta Was this translation helpful? Give feedback.
      
      
          Answered by
          
            ccecka
          
      
      
        Mar 28, 2024 
      
    
    Replies: 1 comment 1 reply
-
| 
         That's right! The kernel would probably want to perform some kind of reduction in the epilogue or atomically update the global memory tile.  | 
  
Beta Was this translation helpful? Give feedback.
                  
                    1 reply
                  
                
            
      Answer selected by
        hyhieu
  
    Sign up for free
    to join this conversation on GitHub.
    Already have an account?
    Sign in to comment
  
        
    
That's right! The kernel would probably want to perform some kind of reduction in the epilogue or atomically update the global memory tile.