Skip to content

Add missing NVGPUToLLVMPass dependency#9398

Merged
ThomasRaoux merged 2 commits intotriton-lang:mainfrom
bmyerz0:dev/bmyerz0/nvgppputollvmpass-cmake
Feb 7, 2026
Merged

Add missing NVGPUToLLVMPass dependency#9398
ThomasRaoux merged 2 commits intotriton-lang:mainfrom
bmyerz0:dev/bmyerz0/nvgppputollvmpass-cmake

Conversation

@bmyerz0
Copy link
Copy Markdown
Contributor

@bmyerz0 bmyerz0 commented Feb 7, 2026

Cmake can sometimes fail with the message below. NVGPUToLLVMPass has a dependency on TritonGPUOpInterfacesIncGen.

FAILED: Compilers/triton/third_party/nvidia/lib/NVGPUToLLVM/CMakeFiles/NVGPUToLLVM.dir/NVGPUToLLVMPass.cpp.o 

In file included from /__w/1/s/src/Compilers/triton/third_party/nvidia/lib/NVGPUToLLVM/NVGPUToLLVMPass.cpp:11:
In file included from /__w/1/s/src/Compilers/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/Utility.h:8:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/TritonGPUToLLVM/Utility.h:7:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/TritonGPUToLLVM/TargetInfoBase.h:4:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/MLIRTypes.h:5:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/Dialect.h:11:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/Attributes.h:6:
/__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/TritonGPUInterfaces.h:10:10: fatal error: 'triton/Dialect/TritonGPU/IR/OpInterfaces.h.inc' file not found
   10 | #include "triton/Dialect/TritonGPU/IR/OpInterfaces.h.inc"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because it is fixing a CMakeLists.txt bug
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@bmyerz0 bmyerz0 marked this pull request as ready for review February 7, 2026 08:47
@bmyerz0 bmyerz0 requested a review from ptillet as a code owner February 7, 2026 08:47
@ThomasRaoux ThomasRaoux merged commit 650b9d4 into triton-lang:main Feb 7, 2026
9 checks passed
adstraw added a commit to adstraw/triton that referenced this pull request Apr 16, 2026
* [PROTON] Metadata profiling with periodic flushing (#9236)

* Fix infinite rewrite loop in latest LLVM (#9249)

* [Kernels] Enable high occupancy persistent matmul (#9248)

Having independent work within a single SM allows the warp scheduler to
hide some of the bubbles in our pipelines.

This brings persistent bf16 x mxfp4 MoE from 2290 GBps -> 2640 GBps on
h200, or a 15% improvement.

* [CONSAN] Optimize compilation time (#9240)

Reduce compilation time with enabled consan by not emitting alias
matrices if the buffers trivially don't alias. It should also have
positive impact on execution speed.
~15% compilation time reduction in tutorials/01-attention-forward.py
may have even bigger impact in cases where both smem and tmem buffers
won't ever alias.

* Explicitly convert numpy array into scalar because it is required by numpy 2.4.0 (#9172)

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

Fixes #9171

Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com>

* [feat] add sentinel when async compile (#9251)

* [triton_kernels] Fix device descriptor allocator to keep a pool (#9259)

When used in a CUDA graph, the per stream tensor allocator could leave
dangling pointers in the CUDA graph. Suppose during graph capture, an 8
KB tensor is allocated as Triton scratch memory. It is recorded in the
CUDA graph. However, if the allocator needs to resize and allocators 16
KB, the previous tensor is decref'd and gc'd by Python. When replaying
the CUDA graph, the first 8 KB tensor that was captured is now dangling.
Fix this by keeping old allocs alive.

* [AMD] Add clampf lowering via v_med3 (#9256)

`v_med3_*` selects the median of three inputs, or returns the minimum if
any of them is NaN.

* [PROTON] Further reduce unnecessary locks (#9257)

* [AMD] Added i8xi8xi32 v3 to wmma database (#9267)

* [AMD][gfx1250] Added missing f64.16x16x4.f64 to wmma database (#9271)

Added support for f64 case for GFX1250 arch.

Note, 

- we are skipping `llvm.amdgcn.wmma.f32.32x16x128.f4` support because
there is no fp4 builtin type in Triton.
- `llvm.amdgcn.wmma.f32.16x16x128.f8f6f4` can be added to the WMMA
database as well. This requires refactoring; currently it is hard-codded
and bypasses the database

* [ci] Pin pandas < 3.0 (#9273)

We're seeing test failures in proton, and it looks like pandas updating
is the cause. Lets pin the version for now.

* [BACKEND] Add support for out of tree Triton Dialect Plugins (#8523)

* Fix dotCanBeProperlyAsync when wgmma is not yielded by loop (#9274)

* [Backend][AMD] Change membarFilter to take bufferIDs into account (#9265)

When a local memory location is deallocated and later reused, a
`local_load` from the old allocation can still be in-flight when a
subsequent `async_copy` writes into the reallocated memory, which
creates a data hazard due to the membarFilter.
We need to insert a `ttg.barrier local` to make sure the local load has
finished fetching the data.
Add Allocation to membarFilter to do that: only filter out when we are
sure that local_load and async have the same bufferID. If they access a
different bufferID then we shouldn't filter the barrier.

* [Kernels] Don't flatten persistent hopper mixed precision matmul (#9279)

For some reason, overlapping the epilogue with the prologue of the next
tile is actually slower here. Disabling it gives a 150 GBps speedup on
h200, from 2650 GBps -> 2800 GBps.

My hypothesis is that because we use occupancy here, it allows the other
block in the SM to better utilise the tensor cores while we process the
epilogue. Similar to a ping-pong schedule but done by the warp
scheduler.

* [AMD][NFC] used IsFPClass MLIR op for checkIsNan (#9276)

The PR removes explicit intrinsic call and replaces it with a dedicated
MLIR op

* Use variadic argument pre-compiled cuda launcher (#6788)

* create new variadic launcher in driver.c  
* remove C string / string substitution logic to create a cuda kernel
from driver.py
* add logic to parse arguments & remove constexpr / flatten tuples in
the new launcher

The launch overhead using the scripts in comments below show no
regressions.

* [NVIDIA,Membar] Resurrect NVIDIA::canSkipBarSync (#9246)

I was debugging Membar, and confused why `triton-opt
--test-print-membar` gives different results that the normal compiler
pass. It turns out this `canSkipBarSync` filter is enabled in testing
mode but NOT in real compilation. It was removed during Blackwell
enablement, presumably because it's making assumptions about the kernel
structure.

This scales it back to something that's more conservative, and
re-enables it in normal compilation. With this I see a 18 GBps
improvement in persistent bf16 x mfxp4 MoE.

* [Docs] Clarify divisibility reset logic for contiguous dimensions in AxisInfo (#9266)

## What does this PR do?
This PR improves the code documentation in `AxisInfo.cpp`, specifically
within `MulIOp` visitor.

## Why is it important?
The logic `lhsDivisibility = 1` when `contiguity > 1` is a critical
safety guardrail. Without explicit explanation, it looks like an
arbitrary pessimization.

This change clarifies that:
1. Contiguous sequences (e.g., `range(0, N)`) inherently have a GCD of 1
across the dimension.
2. Propagating the base value's divisibility would lead to incorrect
alignment assumptions in downstream passes (like `Coalesce` or
`LoadStoreOp` vectorization), potentially causing illegal memory
accesses or miscompiled code.

This documentation helps future contributors understand the interaction
between `Contiguity` and `Divisibility` without needing to derive the
number theory from scratch.

* Fix infinite loop in dotCanBeProperlyAsync (#9282)

The `checkOperand` traversal can run forever if either:
1. A block argument participates in a cycle containing only permitted
instructions.
2. A block argument is defined outside of `forOp`, in which case we
never advance transitiveOperand.

To fix (1), track the set of visited block arguments. If we visit the
same block argument again, that means that we are in a cycle originating
in the init value of the iter arg, which is outside the loop.

To fix (2), check for values defined outside the loop as we iterate.
This way, we know that if we are evaluating a block argument, it must be
an iter arg to the loop.

* [triton_kernels] fix test case for distributed routing kernels (#9258)

* [AMD][gfx1250] Add Assumptions and Fix Predicate in MXGEMM Kernel (#9285)

This PR
- added assumptions to loop boundary
- fix predicate to eliminate readfirstlane instrs
- moved `static_profile` and `composition` to a shared location

* [AMD] Don't use s_waitcnt to lower global barrier for now (#9287)

Atomics tests are failing a small % of time because of an invalid
transformation in the backend.
This will be reverted once the backend is fixed and LLVM is bumped.

* Adding Triton community meetup notes for 20260106 (#9288)

Topics and discussions covered:
- Update on triton-shared (Haishan Zhu and Nhat Nguyen, Meta)
- Update on the plugin system infrastructure - what's upstream today and
roadmap (Puyan Lotfi and Corbin Robeck, Meta)
- Status of Triton Plugin repository and example: loop unfolding (Simon
Waters, kernelize.ai)

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [AMD][gfx1250] Enable some TDM features (#9283)

This pull request adds basic threading for supporting
descriptor load/store. More changes are coming.

For now this allows to run:

python3 -m pytest -s
test/unit/language/test_tensor_descriptor.py::test_make_tensor_descriptor_matmul[128-128-16-1-1]

* [Consan] Reduce number of configs in coverage tests (#9294)

* Revert "[AMD][gfx1250] Added missing f64.16x16x4.f64 to wmma database" (#9291)

This reverts https://github.com/triton-lang/triton/commit/53b0eafd76debe074965a5d751dd21c593097eb2.

* [AMD] Add PrepareIfCombining pass to enable scf.if combining (#9253)

Add a new pass that moves operations out from between scf.if pairs that
share the same condition. This enables the canonicalizer to combine
adjacent if operations.

Also, remove moveUpTranspose optimization from the ReorderInstructions
pass as we no longer need it with this new pass.

* [Backend] Separate out additive kReg smem padding contribution calculation (#9286)

LLVM was struggling to use immediate values in the padded offset
calculation. This change makes things easier for LLVM, and we get a nice
reduction in register usage. For non-padded pathways, there should be no
difference in the LLVM IR generated.

* [ANALYSIS] Enhance divisibility handling in AxisInfo for addition and subtraction operations (#9297)

Unfortunately it's probably difficult to derive a very generic rule for
all ops in triton, though these `affine` ops share many common
characteristics

* [Nvidia] Enable TMA im2col mode -- Tensor Descriptor (#9225)

# Summary
This is the second PR in a series that enables TMA im2col mode (in
addition to the existing tiled mode) for NVIDIA GPUs. The goal of the
series is to support TMA im2col mode in Gluon DSL.

- First PR: https://github.com/triton-lang/triton/pull/9202
- -> Second PR: https://github.com/triton-lang/triton/pull/9225

PTX ISA documentation for TMA im2col mode:
https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode
TMA tensor descriptor documentation:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html

# Summary of changes
Add general `TT_TensorDescInterface`;
Add `TTNG_TensorDescIm2ColType` for Nvidia TMA im2col mode;
Add lit test;


# Note
The additional im2col parameters `elementStrides`,
`pixelBoxLowerCorner`, `pixelBoxUpperCorner`, `channelsPerPixel`,
`pixelsPerColumn`, will be passed to tensor descriptor from host at
runtime. No need to be in the kernel IR.

In im2col mode, `blockType` must be rank-2: [`channelsPerPixel` x
`pixelsPerColumn`]

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

---------

Co-authored-by: Thomas Raoux <thomas.raoux@openai.com>

* [AMD][gluon][gfx1250] Support tensor async scatter (#9299)

Supports tensor async scatter using TDM. Gluon only at the moment.

* [GLUON] Add get_view() to Gluon layouts (#9270)

Adds new method `get_view(shape, use_hw_view)` to all Gluon layout
classes. This can be called from outside a kernel context, useful in
development for inspecting / debugging layouts.

Exports new `gluon_ir.get_layout_view` function that creates a temporary
MLIR context and converts to a LinearLayout.

Returns the same output as the `triton-tensor-layout`, but without
needing to convert to TTG IR. Tested by asserting output is the same as
the CLI tool.

## Example:

```python
layout = gl.BlockedLayout([2, 1], [1, 4], [1, 2], [1, 0])
print(layout.get_view([2, 8]))

[[T0:0, T1:0, T2:0, T3:0, T4:0, T5:0, T6:0, T7:0]
[ T0:1, T1:1, T2:1, T3:1, T4:1, T5:1, T6:1, T7:1]]

layout = gl.NVMMASharedLayout.get_default_for([16, 16], gl.float16)
print(layout.get_view([16, 16]))

[[( 0: 0),( 0: 1),( 0: 2),( 0: 3),( 0: 4),( 0: 5),( 0: 6),( 0: 7),( 0: 8),( 0: 9),( 0:10),( 0:11),( 0:12),( 0:13),( 0:14),( 0:15)]
[ ( 1: 0),( 1: 1),( 1: 2),( 1: 3),( 1: 4),( 1: 5),( 1: 6),( 1: 7),( 1: 8),( 1: 9),( 1:10),( 1:11),( 1:12),( 1:13),( 1:14),( 1:15)]
[ ( 2: 0),( 2: 1),( 2: 2),( 2: 3),( 2: 4),( 2: 5),( 2: 6),( 2: 7),( 2: 8),( 2: 9),( 2:10),( 2:11),( 2:12),( 2:13),( 2:14),( 2:15)]
[ ( 3: 0),( 3: 1),( 3: 2),( 3: 3),( 3: 4),( 3: 5),( 3: 6),( 3: 7),( 3: 8),( 3: 9),( 3:10),( 3:11),( 3:12),( 3:13),( 3:14),( 3:15)]
[ ( 4: 8),( 4: 9),( 4:10),( 4:11),( 4:12),( 4:13),( 4:14),( 4:15),( 4: 0),( 4: 1),( 4: 2),( 4: 3),( 4: 4),( 4: 5),( 4: 6),( 4: 7)]
[ ( 5: 8),( 5: 9),( 5:10),( 5:11),( 5:12),( 5:13),( 5:14),( 5:15),( 5: 0),( 5: 1),( 5: 2),( 5: 3),( 5: 4),( 5: 5),( 5: 6),( 5: 7)]
[ ( 6: 8),( 6: 9),( 6:10),( 6:11),( 6:12),( 6:13),( 6:14),( 6:15),( 6: 0),( 6: 1),( 6: 2),( 6: 3),( 6: 4),( 6: 5),( 6: 6),( 6: 7)]
[ ( 7: 8),( 7: 9),( 7:10),( 7:11),( 7:12),( 7:13),( 7:14),( 7:15),( 7: 0),( 7: 1),( 7: 2),( 7: 3),( 7: 4),( 7: 5),( 7: 6),( 7: 7)]
[ ( 8: 0),( 8: 1),( 8: 2),( 8: 3),( 8: 4),( 8: 5),( 8: 6),( 8: 7),( 8: 8),( 8: 9),( 8:10),( 8:11),( 8:12),( 8:13),( 8:14),( 8:15)]
[ ( 9: 0),( 9: 1),( 9: 2),( 9: 3),( 9: 4),( 9: 5),( 9: 6),( 9: 7),( 9: 8),( 9: 9),( 9:10),( 9:11),( 9:12),( 9:13),( 9:14),( 9:15)]
[ (10: 0),(10: 1),(10: 2),(10: 3),(10: 4),(10: 5),(10: 6),(10: 7),(10: 8),(10: 9),(10:10),(10:11),(10:12),(10:13),(10:14),(10:15)]
[ (11: 0),(11: 1),(11: 2),(11: 3),(11: 4),(11: 5),(11: 6),(11: 7),(11: 8),(11: 9),(11:10),(11:11),(11:12),(11:13),(11:14),(11:15)]
[ (12: 8),(12: 9),(12:10),(12:11),(12:12),(12:13),(12:14),(12:15),(12: 0),(12: 1),(12: 2),(12: 3),(12: 4),(12: 5),(12: 6),(12: 7)]
[ (13: 8),(13: 9),(13:10),(13:11),(13:12),(13:13),(13:14),(13:15),(13: 0),(13: 1),(13: 2),(13: 3),(13: 4),(13: 5),(13: 6),(13: 7)]
[ (14: 8),(14: 9),(14:10),(14:11),(14:12),(14:13),(14:14),(14:15),(14: 0),(14: 1),(14: 2),(14: 3),(14: 4),(14: 5),(14: 6),(14: 7)]
[ (15: 8),(15: 9),(15:10),(15:11),(15:12),(15:13),(15:14),(15:15),(15: 0),(15: 1),(15: 2),(15: 3),(15: 4),(15: 5),(15: 6),(15: 7)]]
```

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [Nvidia] Enable TMA im2col mode - Fix tma load op (#9303)

# Summary
This is the third PR in a series that enables TMA im2col mode (in
addition to the existing tiled mode) for NVIDIA GPUs. The goal of the
series is to support TMA im2col mode in Gluon DSL.

- First PR: https://github.com/triton-lang/triton/pull/9202
- Second PR: https://github.com/triton-lang/triton/pull/9225
- -> Third PR: https://github.com/triton-lang/triton/pull/9303

PTX ISA documentation for TMA im2col mode:
https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode
TMA tensor descriptor documentation:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html

# Summary of changes
This PR is to fix the AsyncTMACopyGlobalToLocalOp:
- Delete the TTNG_TensorModeAttr, since we can infer the descriptor mode
from the desc type
- Remove TTNG_TensorModeAttr from TTNG_AsyncTMACopyGlobalToLocalOp 
- Fix the op verification and lit test accordingly


<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [BACKEND][AMD] Add option to swap MIR (#8711)

We use TRITON_SWAP_MIR to specify the path to externally-provided MIR
file that can be used for compilation to ASM in LLVM backend, overriding
the one generated by regular pipeline.

This can be used to apply custom transformations on MIR and use it to
generate binary.

* [BACKEND][AMD] Use addOccurrence for LLVM options to properly disable… (#9311)

… schedulers

Use addOccurrence instead of setValue when setting LLVM command-line
options like enable-misched=false. This is necessary because LLVM's
scheduler passes check getNumOccurrences() to determine if the option
was explicitly set on the command line (see
llvm/lib/CodeGen/MachineScheduler.cpp).

Also add test to verify the MIR swap pipeline starts before
machine-scheduler and that both machine-scheduler and post-RA scheduler
are disabled.

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [PROTON][BLACKWELL] Enable low-overhead hardware trace (#9307)

* Increment index in plugin example (#9315)

In the Triton Plugin example, the plugin pass enumerator does not
increment the index into `passNames`. If this example were used
elsewhere with multiple passes, this would overwrite the first returned
pass name. It's a minor tweak and this is just an example, but it seems
best to fix this prior to this getting heavy use.

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment. [this is debatable--this change is quite minor!]

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [x] This PR does not need a test because it's for an example.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [triton] Add `tl.cat(can_reorder=False)` implementation (#9312)

This resurrects the old PR that replaced the implementation entirely. I
also fixed `tl.cat` to be equivalent in semantics to `torch.cat`

* [BACKEND][AMD] Add ScopedNoAliasAAWrapperPass to MIR swap pipeline (#9309)

* Fix tolerance for float8 x mx combinations in matmul tests (#9316)

Increases tolerance from 3e-2 to 2e-1 for float8 x mx combinations to
account for compounded quantization errors.

* [triton_kernels] Unfuse fma in reduce kernel for numeric stability (#9320)

* [Nvidia] Enable TMA im2col mode - driver support (#9305)

# Summary
This is the fourth PR in a series that enables TMA im2col mode (in
addition to the existing tiled mode) for NVIDIA GPUs. The goal of the
series is to support TMA im2col mode in Gluon DSL.

- First PR: https://github.com/triton-lang/triton/pull/9202
- Second PR: https://github.com/triton-lang/triton/pull/9225
- Third PR: https://github.com/triton-lang/triton/pull/9303
- -> Fourth PR: https://github.com/triton-lang/triton/pull/9305

PTX ISA documentation for TMA im2col mode:
https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode
TMA tensor descriptor documentation:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html

# Summary of changes
This PR adds the driver function for create the tensor descriptor for
TMA im2col mode. The driver function can be found in the doc:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html


<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [x] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [PROTON] Fix default value for TRITON_ENABLE_HW_TRACE in CuptiProfiler (#9324)

* [AMD][gfx1250] Update f16 GEMM examples (#9326)

* Factored out common utilites
* Added support for L2 prefetch

Co-authored-by: Alexander Weinrauch <Alexander.Weinrauch@amd.com>
Co-authored-by: Jungwook Park <jungwook.park@amd.com>
Co-authored-by: Xiaohu Guo <Xiaohu.Guo@amd.com>

* [Backend] Bump LLVM to 2eb709b95d8f (#9264)

Bumps LLVM to `2eb709b95d8f` to consume 2 fixes related to `gfx1250`:

- https://github.com/llvm/llvm-project/pull/176206
- https://github.com/llvm/llvm-project/pull/176355

---------

Co-authored-by: neildhar <neildhar@users.noreply.github.com>
Co-authored-by: Lei Zhang <antiagainst@gmail.com>

* [triton_kernels] Add unpadded batch size handling to reduce (#9332)

* [AMD] Use v_perm instruction for convert_layout acceleration (#9014)

This PR introduces AMD specific ttg->llvm pattern which uses v_perm
instructions instead of combinations of shifts and logical operations.

Limitations of this pattern:

- Applied only for 8 bit data types;
- Conversion required to be bijective;
- No permutation across threads in workgroup.

---------

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>

* [Kernels] Re-enable 4-warp persistent kernel (#9331)

Looks like some of the interceding changes have removed the code
sequence was causing issues with ptxas. I also had to disable
`warp_specialize` as I'm getting an invalid IR error from inside the
pass.

* Partial revert of [AMD][NFC] cleanup llvm instrinsic calls; replaced with rocdl ops (#9334)

`getClusterCTAId` should return the local CTA id in the cluster.
https://github.com/triton-lang/triton/pull/9222 replaced it with the
overall cluster id which is wrong.

I could not find a ROCDL Op exposing the workgroup id inside the cluster
so this PR reverts it back to the intrinsic.

* Revert "Fix tolerance for float8 x mx combinations in matmul tests (#… (#9338)

…9316)"

This reverts commit bcbcabdd0cff6539c7168299075992b2a23ff38e.

* [AMD] NFC: Commonise some checks into functions. (#9335)

This makes feature checks a bit tidier

* [FRONTEND] Add option to round f32 to tf32 in descriptor (#9295)

add an option to do rounding on the fly of float32 inputs into rounded
tf32 values.

* [AMD][Gluon] Warp-pipeline fixes and priority hint (#9301)

These changes sync internal warp‑pipeline fixes into upstream to align
correctness (warp size), configurability (priority hints), and safety
(dependency analysis / barrier placement) across Gluon and LLVM lowering
paths.

- Correct warp size handling in the warp‑pipeline per given AMDGPU HW
arch.
- Introduce optional priority hints on warp‑pipeline stages, enabling
more explicit scheduling intent for stage ordering/perf tuning.
- Fix a bug in warp‑pipeline dependency analysis.

* [AMD][gfx1250] Skip dot tests for bf16x3 and bf16x6 types (#9343)

* Apply rounding to TF32 in MoE (#9296)

* [AMD][BACKEND] Fix TDM shape adjustment to factor in the CGA offset (#9341)

TDM needs to adjust the shape in the tensor descriptor to account for
the tile offset when computing OOB. However, we did not take the CGA
offset into account.

I adjusted the test to overallocate the buffers so the test can catch
out of bounds writes and we do not rely on a random segfault to detect
the OOB write.

* [AMD][gluon][gfx1250] Add tensor async gather support using TDM (#9313)

Implements tensor async_gather using TDM in a similar fashion to
https://github.com/triton-lang/triton/pull/9299 on Gluon.

* [AMD] Replace ReorderInstructions with MoveUpPrologueLoads (#9328)

After the recent changes, the ReorderInstructions pass had only one
optimization left: moving prologue loads early for prefetching. Add a
new pass for that optimization, refactor the implementation, and add
more tests.

* [AMD] Support TDM AsyncWait in UpdateAsyncWaitCount (#9352)

Since TDM scatter and gather might produce more than 1
intrinsic/assembly instruction we need to compute the correct waitcnt
based on the #intrinsics instead of the number of TGGIR ops. This
piggy-backs on the analysis pass we already use for
`ttg.async_copy_global_to_local`.
This allows us to use the number of TTGIR TDM ops in Gluon and also
works with the token based approach from the pipeliner.

* [AMD][BACKEND] Support multi-CTA for `AMDWmmaEncodingAttr` (#9340)

Adjusts the verifier of DotOperandEncoding to allow for different
CGALayouts between the accumulator and the operands. The dot verifier
ensure that the accumulator does not broadcast and A is broadcasting
along dim1 and B along dim0.

If this turns out to be too restrictive, e.g. if the case comes up where
we want to do duplicated work on CTAs, we can relax this later.

* [Gluon] Refactor multi-CTA support of PaddedSharedLayouts (#9336)

Refactors PaddedSharedLayouts to make them work nicer in multi-CTA.

- Instead of `block_bases` it now holds the `CGALayout` in Gluon.
- `with_identity_for` has to use the `shape_per_cta` to properly build
the identity `linear_component`
- `get_padded_shared_layout` should multiply the CTA and CGA layout
- Adjust the verifier to broadcast in the `block` dimension. We still do
not support broadcasts in the `offset` dimension

* [AMD] Introduce PartitionedSharedEncodingAttr (#9314)

This PR introduces PartitionSharedEncodingAttr.

An encoding for tensors whose elements are partitioned across multiple
separate shared memory allocations. This reduces shared memory partition
conflicts by splitting a tensor along a specific dimension into separate
allocations.

Parameters:
- numPartitions: Number of distinct memory partitions (and separate
buffers).
Buffers in different partitions MUST be placed in different physical
      shared memory slots.
    - numGroups: Number of groups. Each group contains numPartitions
      consecutive pieces of the tensor.
    - partitionDim: The dimension along which the tensor is partitioned.
    - partitionLayout: The shared memory layout used within each piece.

The total number of logical pieces is numPartitions * numGroups.
Pieces are organized as:
[Group 0: pieces 0..numPartitions-1]
[Group 1: pieces numPartitions..2*numPartitions-1]
...

## Memory Allocation

The allocator creates numPartitions buffers (NOT numLogicalPieces
buffers).
Each buffer contains all pieces from all groups that belong to that
partition,
concatenated together. Buffer size = pieceSize * numGroups.

For example, with numPartitions=2, numGroups=4 and
partitionDim=0 on a [128, 32] tensor:
- Total 8 logical pieces, each of size [16, 32] (pieceSize = 16*32
elements)
- Piece layout: [0, 1, 2, 3, 4, 5, 6, 7]
- Partitions:    0  1  0  1  0  1  0  1
- Groups:       |Grp0||Grp1||Grp2||Grp3|

Physical allocation (2 buffers, each containing 4 pieces):
- Buffer 0 (Partition 0): [Piece0 | Piece2 | Piece4 | Piece6]
- Buffer 1 (Partition 1): [Piece1 | Piece3 | Piece5 | Piece7]

Physical allocation guarantee: Buffers in different partitions MUST
reside
in distinct physical shared memory partitions.

TODO: Implement lowering of operations with PartitionSharedEncodingAttr.

---------

Co-authored-by: Ognjen Plavsic <plognjen@amd.com>

* [AMD][GLUON] Add multi-CTA GEMM example for `gfx1250` (#9342)

Adds a multi-CTA GEMM example for `gfx1250` where we multicast A and B
sub-tiles which are shared by CTAs of the cluster. `ctas_per_cga` is
uses to control the cluster size and how we partition the accumulator
across CTAs.

 This PR requires the following PRs to pass its correctness tests:
- https://github.com/triton-lang/triton/pull/9341
- https://github.com/triton-lang/triton/pull/9340
- https://github.com/triton-lang/triton/pull/9336
- https://github.com/triton-lang/triton/pull/9334

* [AMD]Support 8-Warp Pingpong and Refactor MXGEMM Kernel on GFX1250 (#9356)

This PR:
- Refactored MXGEMM kernel to support various schedules
- Supported 8-warp scheduling and 8-warp pingpong scheduling

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>

* [TRITON_KERNELS] some more tweaks (#9350)

* [AMD][GLUON] Support AutoLayout for `offsets` and `mask` in `buffer_store` (#9353)

`buffer_store` can infer the layout of `mask` and `offsets` based on the
layout of `stored_value`. Currently we only infer from `offsets` to
`masks` which is unnecessarily restrictive.

* [docs] Add topk operation to language documentation (#9345)

## Summary
- Add `topk` to Scan/Sort Ops section in `triton.language.rst`
- Add docstring to `topk` function describing parameters, return value,
and usage example

## Motivation
Fixes #9278 

The `topk` operation is implemented and exported in `triton.language`
but was not mentioned in the API documentation at
https://triton-lang.org/main/python-api/triton.language.html

## Changes
1. **docs/python-api/triton.language.rst**: Added `topk` entry under
Scan/Sort Ops section
2. **python/triton/language/standard.py**: Added comprehensive docstring
to `topk` function including:
   - Description of functionality
   - Parameter documentation
   - Return type
   - Usage example

## Testing
Documentation builds correctly with the new entry.

* [AMD][BACKEND] Support padding in TDM store if interval equals the inner dimension (#9360)

TDM store does not support padding. However, we can adjust the tile dim
to include padding and shrink the tensor shape to drop stores for
padding bytes.

This only works for a single padding interval which equals the inner
dimensions. Other padding configs are not supported.

* [AMD][gfx1250] Add clamp operand to WMMA int intrinsic (#9359)

* reland upgrade ptxas for blackwell (#9363)

* [AMD][gfx1250] Roll up some small changes (#9365)

* Register gfx1250 to various target capability queries
* Tighten some error checking logic
* Adjust tests for gfx1250

* Revert "[AMD] Introduce PartitionedSharedEncodingAttr" (#9367)

Reverts triton-lang/triton#9314 causing functional regression in fence
insertion pass

* [PROTON] Vector metrics (#9329)

* Fix static srcMap bug causing multi-arch FP8 compilation failures (#9364)

Compiling FP8 kernels for multiple GPU architectures in the same process
fails with "failed to legalize operation 'tt.fp_to_fp'".

The bug is in ElementwiseOpToLLVM.cpp's
`FpToFpOpConversion::getConversionFunc()`. The srcMap that maps
(srcType, dstType, roundingMode) to converter functions was declared
`static`, causing it to be initialized once with the first
architecture's isaFamily. Subsequent architectures would reuse the wrong
converters.

Example failure scenario:
1. Compile FP8 kernel for gfx950 (CDNA4) first
- static srcMap initialized with CDNA4 converters (e.g.,
Fp8E4M3FNUZ_to_Fp16)
2. Compile same kernel for gfx942 (CDNA3)
   - srcMap already initialized, returns CDNA4 converters
   - CDNA3 lacks CDNA4-specific instructions, compilation fails

The performance impact of removing static is negligible, since the map
has only ~25 entries and is built per-conversion (not per-instruction).

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [TRITON_KERNELS] added global scale to reduction (#9372)

* [AMD][gfx1250][gluon] TDM: Fix OOB handling for Scatter/Gather (#9371)

Fixes the incorrect offset handling when the tile size `N` is not a
multiple of `BLOCK_N` causing the last block of a TDM Scatter/Gather to
be a partial block.

This happened because unlike in `fillTDMDescriptor` for standard TDM,
the tensor shapes weren't updated based on offsets within
`fillTDMDescriptorForGatherScatter` for gather/scatter.

* [AMD][GLUON] Add 4 and 8 warps stream-k gluon kernels for gfx1250 (#9370)

- add 8 warp pingpong stream-k kernel implementation
- persistent loop prefetch to enable overlap between prologue and
epilogue
- implement 4 warp stream-k specific partial tile processing with atomic
based spinning locks

* Consider rematerialisation cost when hoisting over ext (#9194)

Avoid hoisting converts if the cost of rematerialising the slice and the
new convert is higher than the original convert. To do this, factor out
the cost computation from `backwardRematerialization` into
`isRematBeneficial` and call it from `hoistConvertOnTopOfExtOrBroadcast
`.

The added test demonstrates a simple example where the hoisted convert
is more expensive than the original.

It is also possible to extend this to the other hoisting functions,
although it requires slightly more care (e.g. a convert hoisted into a
conditional should probably carry less cost).

* Fix assert in `_p_matmul` (#9376)

The current check prevents calling the persistent matmul on MXFP8,
thought per the comment it is intended to block MXFP4 weights only.

* [Backend] Bump to llvm/llvm-project@ac5dc54d5091 (#9333)

* Reapply "[AMD] Introduce PartitionedSharedEncodingAttr" (#9367) (#9374)

The initial patch was reverted because ProxyFenceInsertion.cpp still
used
getBufferIds instead of getAllBufferIdsWithAliases. Previously,
getBufferIds returned
buffer IDs for both a value and its aliases. With partitioned tensor
support,
a single value can now hold multiple buffers, so the API was split:
getBufferIds
returns only the buffer IDs for a value, while
getAllBufferIdsWithAliases includes aliased
buffer IDs as well.

Co-authored-by: Ognjen Plavsic <plognjen@amd.com>

* [AMD] Support PaddedSharedLayout in TDM Gather on GFX1250 (#9369)

This PR supported `PaddedSharedLayout` in TDM Gather
for limited cases.

* [AMD][gfx1250] test_dot fix for small K (#9358)

* [ConSan] First pass at improving ConSan compile times (#9366)

When compiling `01-attention-forward.py` with consan enabled, before:

```
# times.ir_initialization=790100      0.79 sec
# stage='ttgir' duration=182893       0.18 sec
# stage='llir' duration=99643940      99.6 sec
# stage='llvmir' duration=27658211    27.7 sec
# stage='ptx' duration=11120513       11.1 sec
# stage='cubin' duration=1149355658   19.15 min
```

After:

```
# times.ir_initialization=796533      0.79 sec
# stage='ttgir' duration=184720       0.18 sec
# stage='llir' duration=16957192      17.0 sec
# stage='llvmir' duration=3735579     3.74 sec
# stage='ptx' duration=1972309        1.97 sec
# stage='cubin' duration=34357675     34.36 sec
```

This PR does quite a number of things at once:

* Custom CanonicalizeLLVMIR pass that adds a pattern for `select
%false|%true, %a, %b` since LLVM dialect is missing this (and is opposed
to adding it)
* Cache global constants to avoid creating many copies of the same
string when lowering asserts
* Fix warp specialize lowering to handle function calls and deduplicate
barrier lowering code between NVIDIA and AMD backends. To support
function calls, non-kernel functions are rewritten to accept a barrier
handle argument that is passed down from the call site
* Rewrite `createMultiColumnMask` to generate a constant tensor rather
than computing it from a bunch of `make_range` and masking. This single
function was generating gigabytes of IR
* Pick warp-local layouts in consan instrumentation. Previously, consan
used thread-local layouts where every thread has a copy of the tensor.
This was to avoid using shared memory. We can switch to warp-local
layouts where each warp has a copy of the tensor distributed across its
threads to reduce the generated IR (and register usage) by a factor 32,
plus some extra IR needed for shuffles.
* To support warp-local layouts, I added a two flags: `uniform` to
`tt.assert` and replaced `tti.experiment_assert_in_thread` with a
`tt.reduce` + `tt.assert uniform`. Uniform just means only the first
thread in the warp group will trigger the assert since the condition is
uniform.
* I added `always_use_warp_shuffle` function-level flag to force
`convert_layout` lowering to use warp shuffles even when the performance
heuristic picks shared memory to avoid using shared memory for the
layout conversions inside consan helpers
* Changed the lowering of `arith.constant` with non-splat dense elements
attribute to generate a constant global array where each thread loads
into it
* Generate global stores in the main function into a helper function to
reduce bloat by deduplicating. This also enables separate compilation
later.

* [AMD][BACKEND] Fix shared layout order for AsyncCopy on GFX9 (#9373)

The heuristic used `getElementsPerThread` instead of
`getContigPerThread` to retrieve the contiguity. It also has to use to
clamp the `vecSize` to a supported width by the hardware to see if we
need to carry over the `regOrder`.

Before this PR, for the added lit tests we select a order for the shared
layout which does not allow for coalesced direct-to-lds writes.

* [AMD][GFX1250] Roll up some minor changes (#9380)

* Add pattern to optimize tanh
* Fix asserts for WMMA
* Fix some TDM field updates

* [AMD][BACKEND] Fix RangeAnalysis tripCount calculation (#9383)

We currently check if the `optional<bool>` holds a value or not instead
of checking the actual held value to decide if we take the upper or the
lower limit.
If the optional empty is empty we take the minimum to preserve the
behavior from before this PR.

Note that the first adjusted lit test only tests if we fold a `scf.ifOp`
after the `scf.forLoop` so the changes in the loop are not removing the
actual test.

* [AMD][GLUON] Add CGA Layout to wmma scale (#9381)

Attach CGA Layout for scale's linear layout. This is needed when we use
wmma scaled in multi-cta kernels.

* [AMD][gluon] Add gfx1250 warp‑pipeline f16 GEMM example (#9382)

Provide a warp‑pipeline TDM GEMM kernel with shared‑memory buffering.

* add a second gb200 node to CI (#9384)

* [BACKEND] Improve and simplify ReduceOp's lowering (#9219)

Stacked PRs:
 * #9327
 * #9318
 * #9317
 * #9221
 * #9220
 * __->__#9219


--- --- ---

### [BACKEND] Improve and simplify ReduceOp's lowering


We implement a LinearLayout-based `ReduceOp` lowering. This has a
number of benefits:

- The logic is noticeably simpler as we barely have to implement
anything. ConvertLayout and some LL helpers do all the heavy lifting
- We get shmem swizzling for free
- We sometimes save a shmem round-trip (before we did it
unconditionally)
- It is now clear that we have a `tmpLl` variable we can carefully
choose (we'll do so in a future PR)
- It opens the door to returning an arbitrary layout (fusing a
`convert_layout` into this op)
- It is now really simple to generalise this op to perform cross-cluster
reductions, provided that `convert_layout` supports them.
- We fix some latent issues the previous implementation had when run on
arbitrary linear layouts. We add a funky regression test that used to
fail and now passes.
- All this while being LOC-neutral!

In future PRs we will improve the choice fo `tmpLl` to avoid in many
cases the last `convert_layout`, and we will pack the inputs in shmem to
be able to vectorize the load/stores for full reductions with multiple
inputs.

This PR was the result of quite a long (but rather successful)
vibe-coding session together with `gpt-5.2-codex`. I found particularly
useful being able to emit a ConvertLayout within this lowering rather
than having to call the lowering of the function manually. This
simplifies the code quite a bit and I would have struggled to convince
MLIR to do so myself.

* [BACKEND] Perform tree reductions on in-thread values (#9220)

Stacked PRs:
 * #9327
 * #9318
 * #9317
 * #9221
 * __->__#9220


--- --- ---

### [BACKEND] Perform tree reductions on in-thread values


We generate ternary trees for suitable integer ops and binary trees for
everything else.

We manually generate `{add,mul}.{f16,f32}x2` ops. This brings a speed-up
to some gluon attention kernels.

* [Nvidia] Enable TMA im2col mode - LLVM lowering (#9322)

# Summary
This is the fifth PR in a series that enables TMA im2col mode (in
addition to the existing tiled mode) for NVIDIA GPUs. The goal of the
series is to support TMA im2col mode in Gluon DSL.

- First PR: https://github.com/triton-lang/triton/pull/9202
- Second PR: https://github.com/triton-lang/triton/pull/9225
- Third PR: https://github.com/triton-lang/triton/pull/9303
- Fourth PR: https://github.com/triton-lang/triton/pull/9305
- -> Fifth PR: https://github.com/triton-lang/triton/pull/9322

PTX ISA documentation for TMA im2col mode:
https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode
TMA tensor descriptor documentation:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html

# Summary of Changes

Added LLVM lowering logic for `AsyncTMACopyGlobalToLocalOpConversion` to
support im2col mode.

## Im2col Mode Constraints

### pixelsPerColumn (non-contiguous dimension)
- **Maximum size**: 1024 elements
- **Corresponds to**: Spatial dimensions (N, D, H, W)
- **Block shape**: Restricted to match `shapePerCTA` (no splitting)
- **Rationale**: Avoids generating multiple TMA messages along spatial
dimensions, eliminating complex offset calculations that would depend on
input tensor shape and padding
- **Note**: 1024 is sufficient for most practical use cases

### channelsPerPixel (contiguous dimension)
- **Maximum size**: 256 elements, or swizzle byte size if swizzle is
enabled
- **Multiple messages**: Supported when channel dimension exceeds block
size
- **Offset application**: Only coord[0] (channel coordinate in PTX
order) receives non-zero offsets

## Key Implementation Details

1. **Offset application**: For im2col mode, only the channel dimension
receives non-zero offsets; spatial dimension offsets are always 0
(verified by assertion)

2. **Im2col offsets reversal**: Spatial offsets (e.g., `off_w`, `off_h`)
are reversed to match PTX/CUDA innermost-to-outermost ordering,
consistent with coordinate handling

3. **Alignment with tiled mode**: These constraints align with tiled
mode behavior used for GEMM operations



<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [] I have not added any `lit` tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [FPSAN] Introducing FpSan - floating point sanitizer (#9337)

FpSanitizer ensures that arithmetically equivalent kernels produce the
exact same results by replacing floating‑point ops with integer ones.
For every binary op we cast operands to int, perform the equivalent
integer op, then cast back to float. Unary ops are replaced with
identity. For dot ops we cast operands to int and perform a matmul in
registers.
The resulting kernel is of course numerically incorrect, but it should
be independent of operation order. If two implementations match under
fpsan, they are very likely algorithmically equivalent even if FP
results differ due to ordering/reassociation/etc.
Note about tensor memory: tcgen05_mma is lowered by loading operands
into registers and manually emulating MMA. Because tcgen05_mma can be
issued from a partition with 1 warp group while tmem load/stores require
4 or 8 warps, we replace all tmem uses with global scratch memory.

* [KERNELS] enable swap_xw on blackwell for non-mx matmuls (#9390)

helps significantly for ragged matmuls where the slice size is small.
otherwise, blackwell will compile to mma.sync

* [BACKEND] Implement support for cross-CTA tt.reduce (#9221)

Stacked PRs:
 * #9327
 * #9318
 * #9317
 * __->__#9221


--- --- ---

### [BACKEND] Implement support for cross-CTA tt.reduce


The title of this PR is a bit of a lie. Even though the lowering is now
implemented to support cross-CTA reductions, it depends on
`convert_layout` supporting them, and it doesn't currently support
LinearLayouts. We should generalise this one first and then enable it
here. We should also emit the correct cross-CTA barrier from
`targetInfo` in the case of cross-CTA memory reuse.

In this PR, we take the chance to also generalise the lowering to avoid
convert layouts whenever possible.

* [UT] Generalize Proton tests by providing `device` fixture (#9351)

The same approach as with other unit tests.

---------

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>

* [Gluon] Add Cluster Launch Control (CLC) support for Blackwell GPUs (#9361)

This adds support for NVIDIA's Cluster Launch Control (CLC) feature on
Blackwell (SM100+) GPUs, enabling dynamic work distribution for
persistent kernels.

CLC allows running workers to cancel not-yet-launched clusters and take
over their work, improving load balancing when SM availability varies.

New Gluon API (triton.experimental.gluon.language.nvidia.blackwell.clc):
- try_cancel(result, mbar): Issue async CLC request to cancel a pending
cluster
- is_canceled(result): Check if cancellation succeeded (returns
non-zero)
- get_first_ctaid(result, dim): Get the canceled cluster's first CTA ID

MLIR ops added:
- ttng.clc_try_cancel: Lowers to clusterlaunchcontrol.try_cancel.async
PTX
- ttng.clc_is_canceled: Lowers to
clusterlaunchcontrol.query_cancel.is_canceled
- ttng.clc_get_first_ctaid: Lowers to
clusterlaunchcontrol.query_cancel.get_first_ctaid

All ops include SM100+ compute capability checks and emit errors on
older GPUs.

Tutorial included demonstrating CLC matmul achieving 92.5% of cuBLAS
performance on 8192x8192x8192 FP16 matrices.

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

---------

Co-authored-by: Peter Bell <peterbell10@openai.com>

* [ConSan] Use `-Ofc mid` for ConSan compilation  (#9394)

I tested a variety of `ptxas` flags and found the following (using
`01-attention-forward.py`)

```
         compile (sec)    execution (sec)
-O0         19               81
-O1         49               22
-O3         52               21
-Ofc max    19               82 
-Ofc mid    21               9
-Ofc min    21               9
-Ofc 0      19               21
```

Shockingly, `-Ofc mid|min` yield best total compilation+execution times
for consan.

* [TritonGPU] Run Gluon canonicalizer in LLVM lowering when ConSan is on (#9396)

Regular canonicalizer will rewrite layout conversions

* Add missing NVGPUToLLVMPass dependency (#9398)

Cmake can sometimes fail with the message below. NVGPUToLLVMPass has a
dependency on TritonGPUOpInterfacesIncGen.

```
FAILED: Compilers/triton/third_party/nvidia/lib/NVGPUToLLVM/CMakeFiles/NVGPUToLLVM.dir/NVGPUToLLVMPass.cpp.o 

In file included from /__w/1/s/src/Compilers/triton/third_party/nvidia/lib/NVGPUToLLVM/NVGPUToLLVMPass.cpp:11:
In file included from /__w/1/s/src/Compilers/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/Utility.h:8:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/TritonGPUToLLVM/Utility.h:7:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/TritonGPUToLLVM/TargetInfoBase.h:4:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/MLIRTypes.h:5:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/Dialect.h:11:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/Attributes.h:6:
/__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/TritonGPUInterfaces.h:10:10: fatal error: 'triton/Dialect/TritonGPU/IR/OpInterfaces.h.inc' file not found
   10 | #include "triton/Dialect/TritonGPU/IR/OpInterfaces.h.inc"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
- [x] This PR does not need a test because it is fixing a CMakeLists.txt
bug

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

* [Nvidia] Enable TMA im2col mode - Gluon API (#9391)

# Summary
This is the sixth PR in a series that enables TMA im2col mode (in
addition to the existing tiled mode) for NVIDIA GPUs. The goal of the
series is to support TMA im2col mode in Gluon DSL.

- First PR: https://github.com/triton-lang/triton/pull/9202
- Second PR: https://github.com/triton-lang/triton/pull/9225
- Third PR: https://github.com/triton-lang/triton/pull/9303
- Fourth PR: https://github.com/triton-lang/triton/pull/9305
- Fifth PR: https://github.com/triton-lang/triton/pull/9322
- -> Sixth PR: https://github.com/triton-lang/triton/pull/9391

PTX ISA documentation for TMA im2col mode:
https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode
TMA tensor descriptor documentation:
https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html

# Summary of Changes

Added Gluon DSL frontend support for TMA im2col mode, completing the
end-to-end path from Python API to runtime descriptor creation.

- Extended `TensorDescriptor` with im2col fields (`mode`,
`elementStrides`, `pixelBoxLowerCorner`, `pixelBoxUpperCorner`) and
parameter validation
- Refactored `tensor_descriptor_type` into a base class with
`tensor_descriptor_type` (tiled) and `tensor_descriptor_im2col_type`
(im2col) subclasses
- Added `offsets` parameter to `async_copy_global_to_shared` for im2col
spatial offsets
- Updated specialization (`specialize.cc`) to emit
`tensordesc_im2col<..., input_rank=N>` signatures and type parsing
(`str_to_ty`) to reconstruct the correct types
- Updated driver to parse im2col signatures and call
`fill_tma_descriptor_im2col` at runtime
- Changed `amendFuncOp` to use `TensorDescInterface` instead of
`TensorDescType` so im2col descriptors also receive `tt.nv_tma_desc`
- Added end-to-end `test_tma_im2col` Gluon test

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

# New contributor declaration
- [x] I am not making a trivial change, such as fixing a typo in a
comment.

- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [x] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [x] I have not added any `lit` tests.
- [] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: peterbell10 <peterbell10@live.co.uk>

* [BACKEND] Support generic multi-cta convert_layouts (#9317)

Stacked PRs:
 * #9327
 * #9318
 * __->__#9317


--- --- ---

### [BACKEND] Support generic multi-cta convert_layouts


We generalise the swizzling algorithm to work with blocks and generalise
the most of the memory lowerings to support layouts with blocks.

We remove the legacy lowering.

The generic swizzling algorithm for blocks might be fine, but we didn't
try to be super clever. There might be some perf left on the table. We
can look into this at a later point if it becomes relevant.

We also activate multi-cta reductions in the process and test both
there.

TODO: Add some funky tests that just test the `convert_layout`, not the
`convert_layout` within the reduction.
TODO: Check how to perform multiCTA barriers in AMD and perhaps merge
cluster barriers into ttg.barrier, predicate broadcasting blocks, etc.

* [Membar] Membar pass for clusters (#9318)

Stacked PRs:
 * #9327
 * __->__#9318


--- --- ---

### [Membar] Membar pass for clusters


The main invariant here is that:

Membar for CTAs only synchronises CTAs when their buffers did not
alias in the ttgir, but they alias after the Allocation pass

In other words, in Gluon, the user is in charge of manually
synchronising the bufferes they declare.

For now, we always emit a full cluster barrier. We can improve this in
the future by emitting `mbarrier`s that just synchronise subsets of the
CTAs. For that we would need to be a bit more clever, as we would need
to allocate some `mbarrier`s  but the Allocation pass has already run...

We add a number of test cases with comments of which of them are
expected and which can be improved.

* [Membar] Fix non-trivial function smem offsets (#9327)

Codex rightly identified that we were not considering the offsets of
functions in our membar analysis at
https://github.com/triton-lang/triton/pull/9318#discussion_r2740883625

Codex then went on and fixed it and added a regression test.

* Add descending flag to topk (#9355)

* Remove TritonIR dependence on TritonGPUIR (#9392)

* [AMD] Update gfx1250 MXFP FA example kernel (#9414)

Include recent some optimizations for MXFP FA kernel.

* [FPSAN] Fix fpsan crash with warp specialization + tmem (#9415)

Fpsan is replacing tensor memory with global scratch, and was missing
correct handling of passing global memory pointers to warp_specialize
op.
Also, use the -Ofc mid ptx compilation mode for fpsan compilation.

* [AMD][NFC][BACKEND] Use ROCDL ops for (cluster) async load (#9410)

Replaces intrinsics for (cluster) async loads with ROCDL Ops.

* [AMD] Fix gpu::BarrierOp lowering for gfx1250 (#9416)

Update gpu::BarrierOp lowering for gfx1250.

* [AMD] Broadcasting in v_perm based conversion (#9354)

This PR enables support of broadcasted layouts in AMD specific
convert_layout lowering:
- Generalizes code, that previously assumed number src and dst values is
equal;

---------

Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>

* [NVIDIA] Fully restore canSkipBarSync (#9281)

* [CI] Remove unused commands (#9419)

Calling `df` twice should not have any impact; this seems like an
artifact from somewhere else that should be removed?

* [CI] Remove use of `LLVM_ENABLE_TERMINFO` (#9423)

This is a follow up to #9419 from looking at `llvm-build.yml`. When
building LLVM, setting `LLVM_ENABLE_TERMINFO=OFF` results in a CMake
warning:

```
CMake Warning:
  Manually-specified variables were not used by the project:

    LLVM_ENABLE_TERMINFO
```

I searched for uses of this in LLVM and there are none today. The latest
reference I found in GitHub issues/PRs was from 2024. That makes me
relatively confident this is no longer used so I'm proposing we remove
it.

* [Cluster Membar] Add a cross-cta barrier at the end of the kernel (#9413)

If we have an outstanding read or write, (I think it'll always be a
read)
and we didn't have a TMEM deallocation, we should add a cluster-level
barrier for correctness

* [AMD][Gluon] Implement 8 wave F16 FA PingPong kernel. (#9427)

Ping Pong is another way to implement a performant attention kernel
allowing coexecution between two waves in same SIMD. In this PR we
introduce a 8 wave PP/Warp pipelined kernel, and related num_warp
changes. We also cleaned up the way to invoke/select different variants
of attention kernel.

Signed-off-by: Stanley Winata <stanley.winata@amd.com>

* [AMD][gfx1250] Adjust (BLOCK) M/N in f16 gemm examples (#9421)

Co-authored-by: Alexander Weinrauch <Alexander.Weinrauch@amd.com>

* [Tutorial] Fix a ty…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants