Skip to content

[BUG] tcgen05 ld.red CSEd incorrectly #3090

@skrider

Description

@skrider

Which component has the problem?

CuTe DSL

Bug Report

Describe the bug

tcgen05.ld.red operations emitted via CUTLASS DSL's cute.copy with
TmemLoadRedOp are lowered to llvm.inline_asm WITHOUT has_side_effects=True.
This causes LLVM's GVN/CSE pass to merge multiple tcgen05.ld.red instructions
that share the same TMEM source address, even though the TMEM contents may
change between reads (e.g. due to intervening MMA operations writing new
accumulator values).

In contrast, regular tcgen05.ld (non-reduction) operations are lowered to the
native nvvm.tcgen05_ld intrinsic, which carries proper memory effects and is
not subject to CSE.

Steps/Code to reproduce bug

Requires a GB300 and nvidia-cutlass-dsl >= 4.4.1.

The reproduction is a standalone Blackwell GEMM kernel (based on the CuTeDSL
tutorial) that reads the TMEM accumulator twice in the epilogue. One variant
uses tcgen05.ld.red.max, the other uses regular tcgen05.ld:

python repro_cse_ldred.py

See attached repro_cse_ldred.py for the full self-contained script (~280 lines).
The key section is the epilogue, which does two reads from the same TMEM
partition:

# In the epilogue loop:
for i in cutlass.range(cute.size(tDtC, mode=[2])):
    # First ld.red read
    cute.copy(tmem_tiled_copy, tDtC[None, None, i], [tCrAcc, tCrRedval])
    tCrC.store(tCrAcc.load().to(io_dtype))
    cute.autovec_copy(tCrC, tDgC[None, None, i])

    # Second ld.red read from SAME source -- should produce another set
    # of ld.red instructions, but LLVM CSE merges them with the first.
    cute.copy(tmem_tiled_copy, tDtC[None, None, i], [tCrAcc, tCrRedval])
    tCrC.store(tCrAcc.load().to(io_dtype))
    cute.autovec_copy(tCrC, tDgC[None, None, i])

Output:

[1] Compiling with tcgen05.ld.red.max (LdRed32x32bOp)...
    tcgen05.ld.red instructions: 1       <-- should be 2

[2] Compiling with tcgen05.ld (Ld32x32bOp, for comparison)...
    tcgen05.ld.sync instructions: 2      <-- correct

Expected behavior

Both tcgen05.ld.red reads should be preserved in the emitted PTX (2
instructions). The second read at a different program point should NOT be
merged with the first.

Observed: only 1 tcgen05.ld.red instruction emitted (the second is eliminated
by LLVM CSE/GVN).

For comparison, regular tcgen05.ld correctly emits 2 instructions for the
same two-read pattern.

Environment details

  • GPU: NVIDIA GB300 (SM 103a)
  • nvidia-cutlass-dsl==4.4.1
  • CUDA 13 / Driver 580.99
  • Python 3.11

repro_cse_ldred-y8xqiayzgpd3pnoc1say7c1adw.py

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions