https://github.com/Infatoshi/cuda-course/blob/master/08_Triton/02_softmax.cu
line 12 float max_val = input[offset + tid];
In the kernel softmax_cuda (line 12), each thread initializes max_val with its own element input[offset + tid]. This leads to incorrect results because each thread only compares its own value, not the global maximum for the entire batch.
Problem:
The max_val is calculated per thread, causing incorrect softmax calculations.
Only the thread responsible for the maximum value computes it correctly, while others compute incorrect results.
Solution:
float max_val = input[offset];