[BUG] tcgen05 ld.red CSEd incorrectly

### Which component has the problem?

CuTe DSL

### Bug Report

Describe the bug
-----------------
tcgen05.ld.red operations emitted via CUTLASS DSL's cute.copy with
TmemLoadRedOp are lowered to llvm.inline_asm WITHOUT has_side_effects=True.
This causes LLVM's GVN/CSE pass to merge multiple tcgen05.ld.red instructions
that share the same TMEM source address, even though the TMEM contents may
change between reads (e.g. due to intervening MMA operations writing new
accumulator values).

In contrast, regular tcgen05.ld (non-reduction) operations are lowered to the
native nvvm.tcgen05_ld intrinsic, which carries proper memory effects and is
not subject to CSE.




Steps/Code to reproduce bug
----------------------------
Requires a GB300 and nvidia-cutlass-dsl >= 4.4.1.

The reproduction is a standalone Blackwell GEMM kernel (based on the CuTeDSL
tutorial) that reads the TMEM accumulator twice in the epilogue. One variant
uses tcgen05.ld.red.max, the other uses regular tcgen05.ld:

    python repro_cse_ldred.py

See attached repro_cse_ldred.py for the full self-contained script (~280 lines).
The key section is the epilogue, which does two reads from the same TMEM
partition:

    # In the epilogue loop:
    for i in cutlass.range(cute.size(tDtC, mode=[2])):
        # First ld.red read
        cute.copy(tmem_tiled_copy, tDtC[None, None, i], [tCrAcc, tCrRedval])
        tCrC.store(tCrAcc.load().to(io_dtype))
        cute.autovec_copy(tCrC, tDgC[None, None, i])

        # Second ld.red read from SAME source -- should produce another set
        # of ld.red instructions, but LLVM CSE merges them with the first.
        cute.copy(tmem_tiled_copy, tDtC[None, None, i], [tCrAcc, tCrRedval])
        tCrC.store(tCrAcc.load().to(io_dtype))
        cute.autovec_copy(tCrC, tDgC[None, None, i])

Output:

    [1] Compiling with tcgen05.ld.red.max (LdRed32x32bOp)...
        tcgen05.ld.red instructions: 1       <-- should be 2

    [2] Compiling with tcgen05.ld (Ld32x32bOp, for comparison)...
        tcgen05.ld.sync instructions: 2      <-- correct


Expected behavior
-----------------
Both tcgen05.ld.red reads should be preserved in the emitted PTX (2
instructions). The second read at a different program point should NOT be
merged with the first.

Observed: only 1 tcgen05.ld.red instruction emitted (the second is eliminated
by LLVM CSE/GVN).

For comparison, regular tcgen05.ld correctly emits 2 instructions for the
same two-read pattern.

Environment details
-------------------
- GPU: NVIDIA GB300 (SM 103a)
- nvidia-cutlass-dsl==4.4.1
- CUDA 13 / Driver 580.99
- Python 3.11

[repro_cse_ldred-y8xqiayzgpd3pnoc1say7c1adw.py](https://github.com/user-attachments/files/25788324/repro_cse_ldred-y8xqiayzgpd3pnoc1say7c1adw.py)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] tcgen05 ld.red CSEd incorrectly #3090

Which component has the problem?

Bug Report

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Environment details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] tcgen05 ld.red CSEd incorrectly #3090

Description

Which component has the problem?

Bug Report

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Environment details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions