I encountered NaNs when training wan-2.1, NaN tensor in shape (1536, 1560) . Upon investigation, I found that some triton kernels in coat don't use boundary checks for load and store operations. This leads to memory corruption, which in turn causes NaN outputs.
kernels: transpose, division, division_transpose...