-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Which component has the problem?
CuTe DSL
Bug Report
Describe the bug
tcgen05.ld.red operations emitted via CUTLASS DSL's cute.copy with
TmemLoadRedOp are lowered to llvm.inline_asm WITHOUT has_side_effects=True.
This causes LLVM's GVN/CSE pass to merge multiple tcgen05.ld.red instructions
that share the same TMEM source address, even though the TMEM contents may
change between reads (e.g. due to intervening MMA operations writing new
accumulator values).
In contrast, regular tcgen05.ld (non-reduction) operations are lowered to the
native nvvm.tcgen05_ld intrinsic, which carries proper memory effects and is
not subject to CSE.
Steps/Code to reproduce bug
Requires a GB300 and nvidia-cutlass-dsl >= 4.4.1.
The reproduction is a standalone Blackwell GEMM kernel (based on the CuTeDSL
tutorial) that reads the TMEM accumulator twice in the epilogue. One variant
uses tcgen05.ld.red.max, the other uses regular tcgen05.ld:
python repro_cse_ldred.py
See attached repro_cse_ldred.py for the full self-contained script (~280 lines).
The key section is the epilogue, which does two reads from the same TMEM
partition:
# In the epilogue loop:
for i in cutlass.range(cute.size(tDtC, mode=[2])):
# First ld.red read
cute.copy(tmem_tiled_copy, tDtC[None, None, i], [tCrAcc, tCrRedval])
tCrC.store(tCrAcc.load().to(io_dtype))
cute.autovec_copy(tCrC, tDgC[None, None, i])
# Second ld.red read from SAME source -- should produce another set
# of ld.red instructions, but LLVM CSE merges them with the first.
cute.copy(tmem_tiled_copy, tDtC[None, None, i], [tCrAcc, tCrRedval])
tCrC.store(tCrAcc.load().to(io_dtype))
cute.autovec_copy(tCrC, tDgC[None, None, i])
Output:
[1] Compiling with tcgen05.ld.red.max (LdRed32x32bOp)...
tcgen05.ld.red instructions: 1 <-- should be 2
[2] Compiling with tcgen05.ld (Ld32x32bOp, for comparison)...
tcgen05.ld.sync instructions: 2 <-- correct
Expected behavior
Both tcgen05.ld.red reads should be preserved in the emitted PTX (2
instructions). The second read at a different program point should NOT be
merged with the first.
Observed: only 1 tcgen05.ld.red instruction emitted (the second is eliminated
by LLVM CSE/GVN).
For comparison, regular tcgen05.ld correctly emits 2 instructions for the
same two-read pattern.
Environment details
- GPU: NVIDIA GB300 (SM 103a)
- nvidia-cutlass-dsl==4.4.1
- CUDA 13 / Driver 580.99
- Python 3.11