Heuristic for num_warps in layer normalization #6046

alfin3 · 2025-02-27T19:33:54Z

alfin3
Feb 27, 2025

I am developing intuition for the heuristic used to compute the number of warps.

The _layer_norm_fwd_fused kernel uses BLOCK_SIZE for accumulation in for loops over an input vector. But the forward method of the LayerNorm class sets BLOCK_SIZE to the next power of two of the input vector dimensionality. There is one iteration in the _layer_norm_fwd_fused kernel and no need for accumulation.

There is also no accumulation in the _layer_norm_bwd_dx_fused kernel that takes BLOCK_SIZE value from the context set in the forward pass, suggesting that the for loops are not used for compiler prompting. Because BLOCK_SIZE_N in the _layer_norm_bwd_dx_fused kernel cannot be less than the input vector dimensionality, the use of the for loops in the _layer_norm_fwd_fused kernel can lead to potential bugs.

An instance of the _layer_norm_{fwd_fused, bwd_dx_fused} kernel processes one entire input vector and uses the number of warps according to the heuristic. Upto 8 warps are used resulting in upto 256 threads. To get a number of warps other than 8, BLOCK_SIZE is divided by 256. This suggests that i) 256 contiguous 16-bit elements are accessed by 32 threads in a warp, and ii) a thread accesses 8 such elements in a single 128-bit vectorized load/store transaction.

In the _layer_norm_bwd_dwdb kernel, the accesses to partial weight and bias gradients are in contiguous row segments of 128 16-bit elements. One half of a warp would access a contiguous row segment with one 128-bit vectorized transaction per thread, and another half of the warp would similarly access another non-adjacent contiguous row segment. Note that the default number of warps appears to be used here, in contrast to the layer_norm_{fwd_fused, bwd_dx_fused} kernels.

Based on this analysis, the heuristic for computing the number of warps uses the following criteria:

one 128-bit vectorized load/store transaction per thread per data block,
threads in a warp access one contiguous segment, or two non-adjacent contiguous segments, and
a thread block preferably has upto 256 threads; 256 threads may be suitable for high occupancy across NVIDIA architectures.

I also assume that the .sum and += accumulations are automatically optimized by the compiler into parallel scans with O(log N) step complexity.

The purpose of the for loops in the layer_norm_fwd_fused kernel remains unclear. Any comments regarding the heuristic are also appreciated. Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heuristic for num_warps in layer normalization #6046

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Heuristic for num_warps in layer normalization #6046

Uh oh!

alfin3 Feb 27, 2025

Replies: 0 comments

alfin3
Feb 27, 2025