Add missing NVGPUToLLVMPass dependency by bmyerz0 · Pull Request #9398 · triton-lang/triton

bmyerz0 · 2026-02-07T08:35:17Z

Cmake can sometimes fail with the message below. NVGPUToLLVMPass has a dependency on TritonGPUOpInterfacesIncGen.

FAILED: Compilers/triton/third_party/nvidia/lib/NVGPUToLLVM/CMakeFiles/NVGPUToLLVM.dir/NVGPUToLLVMPass.cpp.o 

In file included from /__w/1/s/src/Compilers/triton/third_party/nvidia/lib/NVGPUToLLVM/NVGPUToLLVMPass.cpp:11:
In file included from /__w/1/s/src/Compilers/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/Utility.h:8:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/TritonGPUToLLVM/Utility.h:7:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/TritonGPUToLLVM/TargetInfoBase.h:4:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/MLIRTypes.h:5:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/Dialect.h:11:
In file included from /__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/Attributes.h:6:
/__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/TritonGPUInterfaces.h:10:10: fatal error: 'triton/Dialect/TritonGPU/IR/OpInterfaces.h.inc' file not found
   10 | #include "triton/Dialect/TritonGPU/IR/OpInterfaces.h.inc"
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because it is fixing a CMakeLists.txt bug
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

* [PROTON] Metadata profiling with periodic flushing (#9236) * Fix infinite rewrite loop in latest LLVM (#9249) * [Kernels] Enable high occupancy persistent matmul (#9248) Having independent work within a single SM allows the warp scheduler to hide some of the bubbles in our pipelines. This brings persistent bf16 x mxfp4 MoE from 2290 GBps -> 2640 GBps on h200, or a 15% improvement. * [CONSAN] Optimize compilation time (#9240) Reduce compilation time with enabled consan by not emitting alias matrices if the buffers trivially don't alias. It should also have positive impact on execution speed. ~15% compilation time reduction in tutorials/01-attention-forward.py may have even bigger impact in cases where both smem and tmem buffers won't ever alias. * Explicitly convert numpy array into scalar because it is required by numpy 2.4.0 (#9172)  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Fixes #9171 Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com> * [feat] add sentinel when async compile (#9251) * [triton_kernels] Fix device descriptor allocator to keep a pool (#9259) When used in a CUDA graph, the per stream tensor allocator could leave dangling pointers in the CUDA graph. Suppose during graph capture, an 8 KB tensor is allocated as Triton scratch memory. It is recorded in the CUDA graph. However, if the allocator needs to resize and allocators 16 KB, the previous tensor is decref'd and gc'd by Python. When replaying the CUDA graph, the first 8 KB tensor that was captured is now dangling. Fix this by keeping old allocs alive. * [AMD] Add clampf lowering via v_med3 (#9256) `v_med3_*` selects the median of three inputs, or returns the minimum if any of them is NaN. * [PROTON] Further reduce unnecessary locks (#9257) * [AMD] Added i8xi8xi32 v3 to wmma database (#9267) * [AMD][gfx1250] Added missing f64.16x16x4.f64 to wmma database (#9271) Added support for f64 case for GFX1250 arch. Note, - we are skipping `llvm.amdgcn.wmma.f32.32x16x128.f4` support because there is no fp4 builtin type in Triton. - `llvm.amdgcn.wmma.f32.16x16x128.f8f6f4` can be added to the WMMA database as well. This requires refactoring; currently it is hard-codded and bypasses the database * [ci] Pin pandas < 3.0 (#9273) We're seeing test failures in proton, and it looks like pandas updating is the cause. Lets pin the version for now. * [BACKEND] Add support for out of tree Triton Dialect Plugins (#8523) * Fix dotCanBeProperlyAsync when wgmma is not yielded by loop (#9274) * [Backend][AMD] Change membarFilter to take bufferIDs into account (#9265) When a local memory location is deallocated and later reused, a `local_load` from the old allocation can still be in-flight when a subsequent `async_copy` writes into the reallocated memory, which creates a data hazard due to the membarFilter. We need to insert a `ttg.barrier local` to make sure the local load has finished fetching the data. Add Allocation to membarFilter to do that: only filter out when we are sure that local_load and async have the same bufferID. If they access a different bufferID then we shouldn't filter the barrier. * [Kernels] Don't flatten persistent hopper mixed precision matmul (#9279) For some reason, overlapping the epilogue with the prologue of the next tile is actually slower here. Disabling it gives a 150 GBps speedup on h200, from 2650 GBps -> 2800 GBps. My hypothesis is that because we use occupancy here, it allows the other block in the SM to better utilise the tensor cores while we process the epilogue. Similar to a ping-pong schedule but done by the warp scheduler. * [AMD][NFC] used IsFPClass MLIR op for checkIsNan (#9276) The PR removes explicit intrinsic call and replaces it with a dedicated MLIR op * Use variadic argument pre-compiled cuda launcher (#6788) * create new variadic launcher in driver.c * remove C string / string substitution logic to create a cuda kernel from driver.py * add logic to parse arguments & remove constexpr / flatten tuples in the new launcher The launch overhead using the scripts in comments below show no regressions. * [NVIDIA,Membar] Resurrect NVIDIA::canSkipBarSync (#9246) I was debugging Membar, and confused why `triton-opt --test-print-membar` gives different results that the normal compiler pass. It turns out this `canSkipBarSync` filter is enabled in testing mode but NOT in real compilation. It was removed during Blackwell enablement, presumably because it's making assumptions about the kernel structure. This scales it back to something that's more conservative, and re-enables it in normal compilation. With this I see a 18 GBps improvement in persistent bf16 x mfxp4 MoE. * [Docs] Clarify divisibility reset logic for contiguous dimensions in AxisInfo (#9266) ## What does this PR do? This PR improves the code documentation in `AxisInfo.cpp`, specifically within `MulIOp` visitor. ## Why is it important? The logic `lhsDivisibility = 1` when `contiguity > 1` is a critical safety guardrail. Without explicit explanation, it looks like an arbitrary pessimization. This change clarifies that: 1. Contiguous sequences (e.g., `range(0, N)`) inherently have a GCD of 1 across the dimension. 2. Propagating the base value's divisibility would lead to incorrect alignment assumptions in downstream passes (like `Coalesce` or `LoadStoreOp` vectorization), potentially causing illegal memory accesses or miscompiled code. This documentation helps future contributors understand the interaction between `Contiguity` and `Divisibility` without needing to derive the number theory from scratch. * Fix infinite loop in dotCanBeProperlyAsync (#9282) The `checkOperand` traversal can run forever if either: 1. A block argument participates in a cycle containing only permitted instructions. 2. A block argument is defined outside of `forOp`, in which case we never advance transitiveOperand. To fix (1), track the set of visited block arguments. If we visit the same block argument again, that means that we are in a cycle originating in the init value of the iter arg, which is outside the loop. To fix (2), check for values defined outside the loop as we iterate. This way, we know that if we are evaluating a block argument, it must be an iter arg to the loop. * [triton_kernels] fix test case for distributed routing kernels (#9258) * [AMD][gfx1250] Add Assumptions and Fix Predicate in MXGEMM Kernel (#9285) This PR - added assumptions to loop boundary - fix predicate to eliminate readfirstlane instrs - moved `static_profile` and `composition` to a shared location * [AMD] Don't use s_waitcnt to lower global barrier for now (#9287) Atomics tests are failing a small % of time because of an invalid transformation in the backend. This will be reverted once the backend is fixed and LLVM is bumped. * Adding Triton community meetup notes for 20260106 (#9288) Topics and discussions covered: - Update on triton-shared (Haishan Zhu and Nhat Nguyen, Meta) - Update on the plugin system infrastructure - what's upstream today and roadmap (Puyan Lotfi and Corbin Robeck, Meta) - Status of Triton Plugin repository and example: loop unfolding (Simon Waters, kernelize.ai) # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [AMD][gfx1250] Enable some TDM features (#9283) This pull request adds basic threading for supporting descriptor load/store. More changes are coming. For now this allows to run: python3 -m pytest -s test/unit/language/test_tensor_descriptor.py::test_make_tensor_descriptor_matmul[128-128-16-1-1] * [Consan] Reduce number of configs in coverage tests (#9294) * Revert "[AMD][gfx1250] Added missing f64.16x16x4.f64 to wmma database" (#9291) This reverts https://github.com/triton-lang/triton/commit/53b0eafd76debe074965a5d751dd21c593097eb2. * [AMD] Add PrepareIfCombining pass to enable scf.if combining (#9253) Add a new pass that moves operations out from between scf.if pairs that share the same condition. This enables the canonicalizer to combine adjacent if operations. Also, remove moveUpTranspose optimization from the ReorderInstructions pass as we no longer need it with this new pass. * [Backend] Separate out additive kReg smem padding contribution calculation (#9286) LLVM was struggling to use immediate values in the padded offset calculation. This change makes things easier for LLVM, and we get a nice reduction in register usage. For non-padded pathways, there should be no difference in the LLVM IR generated. * [ANALYSIS] Enhance divisibility handling in AxisInfo for addition and subtraction operations (#9297) Unfortunately it's probably difficult to derive a very generic rule for all ops in triton, though these `affine` ops share many common characteristics * [Nvidia] Enable TMA im2col mode -- Tensor Descriptor (#9225) # Summary This is the second PR in a series that enables TMA im2col mode (in addition to the existing tiled mode) for NVIDIA GPUs. The goal of the series is to support TMA im2col mode in Gluon DSL. - First PR: https://github.com/triton-lang/triton/pull/9202 - -> Second PR: https://github.com/triton-lang/triton/pull/9225 PTX ISA documentation for TMA im2col mode: https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode TMA tensor descriptor documentation: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html # Summary of changes Add general `TT_TensorDescInterface`; Add `TTNG_TensorDescIm2ColType` for Nvidia TMA im2col mode; Add lit test; # Note The additional im2col parameters `elementStrides`, `pixelBoxLowerCorner`, `pixelBoxUpperCorner`, `channelsPerPixel`, `pixelsPerColumn`, will be passed to tensor descriptor from host at runtime. No need to be in the kernel IR. In im2col mode, `blockType` must be rank-2: [`channelsPerPixel` x `pixelsPerColumn`]  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) --------- Co-authored-by: Thomas Raoux <thomas.raoux@openai.com> * [AMD][gluon][gfx1250] Support tensor async scatter (#9299) Supports tensor async scatter using TDM. Gluon only at the moment. * [GLUON] Add get_view() to Gluon layouts (#9270) Adds new method `get_view(shape, use_hw_view)` to all Gluon layout classes. This can be called from outside a kernel context, useful in development for inspecting / debugging layouts. Exports new `gluon_ir.get_layout_view` function that creates a temporary MLIR context and converts to a LinearLayout. Returns the same output as the `triton-tensor-layout`, but without needing to convert to TTG IR. Tested by asserting output is the same as the CLI tool. ## Example: ```python layout = gl.BlockedLayout([2, 1], [1, 4], [1, 2], [1, 0]) print(layout.get_view([2, 8])) [[T0:0, T1:0, T2:0, T3:0, T4:0, T5:0, T6:0, T7:0] [ T0:1, T1:1, T2:1, T3:1, T4:1, T5:1, T6:1, T7:1]] layout = gl.NVMMASharedLayout.get_default_for([16, 16], gl.float16) print(layout.get_view([16, 16])) [[( 0: 0),( 0: 1),( 0: 2),( 0: 3),( 0: 4),( 0: 5),( 0: 6),( 0: 7),( 0: 8),( 0: 9),( 0:10),( 0:11),( 0:12),( 0:13),( 0:14),( 0:15)] [ ( 1: 0),( 1: 1),( 1: 2),( 1: 3),( 1: 4),( 1: 5),( 1: 6),( 1: 7),( 1: 8),( 1: 9),( 1:10),( 1:11),( 1:12),( 1:13),( 1:14),( 1:15)] [ ( 2: 0),( 2: 1),( 2: 2),( 2: 3),( 2: 4),( 2: 5),( 2: 6),( 2: 7),( 2: 8),( 2: 9),( 2:10),( 2:11),( 2:12),( 2:13),( 2:14),( 2:15)] [ ( 3: 0),( 3: 1),( 3: 2),( 3: 3),( 3: 4),( 3: 5),( 3: 6),( 3: 7),( 3: 8),( 3: 9),( 3:10),( 3:11),( 3:12),( 3:13),( 3:14),( 3:15)] [ ( 4: 8),( 4: 9),( 4:10),( 4:11),( 4:12),( 4:13),( 4:14),( 4:15),( 4: 0),( 4: 1),( 4: 2),( 4: 3),( 4: 4),( 4: 5),( 4: 6),( 4: 7)] [ ( 5: 8),( 5: 9),( 5:10),( 5:11),( 5:12),( 5:13),( 5:14),( 5:15),( 5: 0),( 5: 1),( 5: 2),( 5: 3),( 5: 4),( 5: 5),( 5: 6),( 5: 7)] [ ( 6: 8),( 6: 9),( 6:10),( 6:11),( 6:12),( 6:13),( 6:14),( 6:15),( 6: 0),( 6: 1),( 6: 2),( 6: 3),( 6: 4),( 6: 5),( 6: 6),( 6: 7)] [ ( 7: 8),( 7: 9),( 7:10),( 7:11),( 7:12),( 7:13),( 7:14),( 7:15),( 7: 0),( 7: 1),( 7: 2),( 7: 3),( 7: 4),( 7: 5),( 7: 6),( 7: 7)] [ ( 8: 0),( 8: 1),( 8: 2),( 8: 3),( 8: 4),( 8: 5),( 8: 6),( 8: 7),( 8: 8),( 8: 9),( 8:10),( 8:11),( 8:12),( 8:13),( 8:14),( 8:15)] [ ( 9: 0),( 9: 1),( 9: 2),( 9: 3),( 9: 4),( 9: 5),( 9: 6),( 9: 7),( 9: 8),( 9: 9),( 9:10),( 9:11),( 9:12),( 9:13),( 9:14),( 9:15)] [ (10: 0),(10: 1),(10: 2),(10: 3),(10: 4),(10: 5),(10: 6),(10: 7),(10: 8),(10: 9),(10:10),(10:11),(10:12),(10:13),(10:14),(10:15)] [ (11: 0),(11: 1),(11: 2),(11: 3),(11: 4),(11: 5),(11: 6),(11: 7),(11: 8),(11: 9),(11:10),(11:11),(11:12),(11:13),(11:14),(11:15)] [ (12: 8),(12: 9),(12:10),(12:11),(12:12),(12:13),(12:14),(12:15),(12: 0),(12: 1),(12: 2),(12: 3),(12: 4),(12: 5),(12: 6),(12: 7)] [ (13: 8),(13: 9),(13:10),(13:11),(13:12),(13:13),(13:14),(13:15),(13: 0),(13: 1),(13: 2),(13: 3),(13: 4),(13: 5),(13: 6),(13: 7)] [ (14: 8),(14: 9),(14:10),(14:11),(14:12),(14:13),(14:14),(14:15),(14: 0),(14: 1),(14: 2),(14: 3),(14: 4),(14: 5),(14: 6),(14: 7)] [ (15: 8),(15: 9),(15:10),(15:11),(15:12),(15:13),(15:14),(15:15),(15: 0),(15: 1),(15: 2),(15: 3),(15: 4),(15: 5),(15: 6),(15: 7)]] ```  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [Nvidia] Enable TMA im2col mode - Fix tma load op (#9303) # Summary This is the third PR in a series that enables TMA im2col mode (in addition to the existing tiled mode) for NVIDIA GPUs. The goal of the series is to support TMA im2col mode in Gluon DSL. - First PR: https://github.com/triton-lang/triton/pull/9202 - Second PR: https://github.com/triton-lang/triton/pull/9225 - -> Third PR: https://github.com/triton-lang/triton/pull/9303 PTX ISA documentation for TMA im2col mode: https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode TMA tensor descriptor documentation: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html # Summary of changes This PR is to fix the AsyncTMACopyGlobalToLocalOp: - Delete the TTNG_TensorModeAttr, since we can infer the descriptor mode from the desc type - Remove TTNG_TensorModeAttr from TTNG_AsyncTMACopyGlobalToLocalOp - Fix the op verification and lit test accordingly  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [BACKEND][AMD] Add option to swap MIR (#8711) We use TRITON_SWAP_MIR to specify the path to externally-provided MIR file that can be used for compilation to ASM in LLVM backend, overriding the one generated by regular pipeline. This can be used to apply custom transformations on MIR and use it to generate binary. * [BACKEND][AMD] Use addOccurrence for LLVM options to properly disable… (#9311) … schedulers Use addOccurrence instead of setValue when setting LLVM command-line options like enable-misched=false. This is necessary because LLVM's scheduler passes check getNumOccurrences() to determine if the option was explicitly set on the command line (see llvm/lib/CodeGen/MachineScheduler.cpp). Also add test to verify the MIR swap pipeline starts before machine-scheduler and that both machine-scheduler and post-RA scheduler are disabled.  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [PROTON][BLACKWELL] Enable low-overhead hardware trace (#9307) * Increment index in plugin example (#9315) In the Triton Plugin example, the plugin pass enumerator does not increment the index into `passNames`. If this example were used elsewhere with multiple passes, this would overwrite the first returned pass name. It's a minor tweak and this is just an example, but it seems best to fix this prior to this getting heavy use. # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. [this is debatable--this change is quite minor!] - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because it's for an example. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [triton] Add `tl.cat(can_reorder=False)` implementation (#9312) This resurrects the old PR that replaced the implementation entirely. I also fixed `tl.cat` to be equivalent in semantics to `torch.cat` * [BACKEND][AMD] Add ScopedNoAliasAAWrapperPass to MIR swap pipeline (#9309) * Fix tolerance for float8 x mx combinations in matmul tests (#9316) Increases tolerance from 3e-2 to 2e-1 for float8 x mx combinations to account for compounded quantization errors. * [triton_kernels] Unfuse fma in reduce kernel for numeric stability (#9320) * [Nvidia] Enable TMA im2col mode - driver support (#9305) # Summary This is the fourth PR in a series that enables TMA im2col mode (in addition to the existing tiled mode) for NVIDIA GPUs. The goal of the series is to support TMA im2col mode in Gluon DSL. - First PR: https://github.com/triton-lang/triton/pull/9202 - Second PR: https://github.com/triton-lang/triton/pull/9225 - Third PR: https://github.com/triton-lang/triton/pull/9303 - -> Fourth PR: https://github.com/triton-lang/triton/pull/9305 PTX ISA documentation for TMA im2col mode: https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode TMA tensor descriptor documentation: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html # Summary of changes This PR adds the driver function for create the tensor descriptor for TMA im2col mode. The driver function can be found in the doc: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [PROTON] Fix default value for TRITON_ENABLE_HW_TRACE in CuptiProfiler (#9324) * [AMD][gfx1250] Update f16 GEMM examples (#9326) * Factored out common utilites * Added support for L2 prefetch Co-authored-by: Alexander Weinrauch <Alexander.Weinrauch@amd.com> Co-authored-by: Jungwook Park <jungwook.park@amd.com> Co-authored-by: Xiaohu Guo <Xiaohu.Guo@amd.com> * [Backend] Bump LLVM to 2eb709b95d8f (#9264) Bumps LLVM to `2eb709b95d8f` to consume 2 fixes related to `gfx1250`: - https://github.com/llvm/llvm-project/pull/176206 - https://github.com/llvm/llvm-project/pull/176355 --------- Co-authored-by: neildhar <neildhar@users.noreply.github.com> Co-authored-by: Lei Zhang <antiagainst@gmail.com> * [triton_kernels] Add unpadded batch size handling to reduce (#9332) * [AMD] Use v_perm instruction for convert_layout acceleration (#9014) This PR introduces AMD specific ttg->llvm pattern which uses v_perm instructions instead of combinations of shifts and logical operations. Limitations of this pattern: - Applied only for 8 bit data types; - Conversion required to be bijective; - No permutation across threads in workgroup. --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> * [Kernels] Re-enable 4-warp persistent kernel (#9331) Looks like some of the interceding changes have removed the code sequence was causing issues with ptxas. I also had to disable `warp_specialize` as I'm getting an invalid IR error from inside the pass. * Partial revert of [AMD][NFC] cleanup llvm instrinsic calls; replaced with rocdl ops (#9334) `getClusterCTAId` should return the local CTA id in the cluster. https://github.com/triton-lang/triton/pull/9222 replaced it with the overall cluster id which is wrong. I could not find a ROCDL Op exposing the workgroup id inside the cluster so this PR reverts it back to the intrinsic. * Revert "Fix tolerance for float8 x mx combinations in matmul tests (#… (#9338) …9316)" This reverts commit bcbcabdd0cff6539c7168299075992b2a23ff38e. * [AMD] NFC: Commonise some checks into functions. (#9335) This makes feature checks a bit tidier * [FRONTEND] Add option to round f32 to tf32 in descriptor (#9295) add an option to do rounding on the fly of float32 inputs into rounded tf32 values. * [AMD][Gluon] Warp-pipeline fixes and priority hint (#9301) These changes sync internal warp‑pipeline fixes into upstream to align correctness (warp size), configurability (priority hints), and safety (dependency analysis / barrier placement) across Gluon and LLVM lowering paths. - Correct warp size handling in the warp‑pipeline per given AMDGPU HW arch. - Introduce optional priority hints on warp‑pipeline stages, enabling more explicit scheduling intent for stage ordering/perf tuning. - Fix a bug in warp‑pipeline dependency analysis. * [AMD][gfx1250] Skip dot tests for bf16x3 and bf16x6 types (#9343) * Apply rounding to TF32 in MoE (#9296) * [AMD][BACKEND] Fix TDM shape adjustment to factor in the CGA offset (#9341) TDM needs to adjust the shape in the tensor descriptor to account for the tile offset when computing OOB. However, we did not take the CGA offset into account. I adjusted the test to overallocate the buffers so the test can catch out of bounds writes and we do not rely on a random segfault to detect the OOB write. * [AMD][gluon][gfx1250] Add tensor async gather support using TDM (#9313) Implements tensor async_gather using TDM in a similar fashion to https://github.com/triton-lang/triton/pull/9299 on Gluon. * [AMD] Replace ReorderInstructions with MoveUpPrologueLoads (#9328) After the recent changes, the ReorderInstructions pass had only one optimization left: moving prologue loads early for prefetching. Add a new pass for that optimization, refactor the implementation, and add more tests. * [AMD] Support TDM AsyncWait in UpdateAsyncWaitCount (#9352) Since TDM scatter and gather might produce more than 1 intrinsic/assembly instruction we need to compute the correct waitcnt based on the #intrinsics instead of the number of TGGIR ops. This piggy-backs on the analysis pass we already use for `ttg.async_copy_global_to_local`. This allows us to use the number of TTGIR TDM ops in Gluon and also works with the token based approach from the pipeliner. * [AMD][BACKEND] Support multi-CTA for `AMDWmmaEncodingAttr` (#9340) Adjusts the verifier of DotOperandEncoding to allow for different CGALayouts between the accumulator and the operands. The dot verifier ensure that the accumulator does not broadcast and A is broadcasting along dim1 and B along dim0. If this turns out to be too restrictive, e.g. if the case comes up where we want to do duplicated work on CTAs, we can relax this later. * [Gluon] Refactor multi-CTA support of PaddedSharedLayouts (#9336) Refactors PaddedSharedLayouts to make them work nicer in multi-CTA. - Instead of `block_bases` it now holds the `CGALayout` in Gluon. - `with_identity_for` has to use the `shape_per_cta` to properly build the identity `linear_component` - `get_padded_shared_layout` should multiply the CTA and CGA layout - Adjust the verifier to broadcast in the `block` dimension. We still do not support broadcasts in the `offset` dimension * [AMD] Introduce PartitionedSharedEncodingAttr (#9314) This PR introduces PartitionSharedEncodingAttr. An encoding for tensors whose elements are partitioned across multiple separate shared memory allocations. This reduces shared memory partition conflicts by splitting a tensor along a specific dimension into separate allocations. Parameters: - numPartitions: Number of distinct memory partitions (and separate buffers). Buffers in different partitions MUST be placed in different physical shared memory slots. - numGroups: Number of groups. Each group contains numPartitions consecutive pieces of the tensor. - partitionDim: The dimension along which the tensor is partitioned. - partitionLayout: The shared memory layout used within each piece. The total number of logical pieces is numPartitions * numGroups. Pieces are organized as: [Group 0: pieces 0..numPartitions-1] [Group 1: pieces numPartitions..2*numPartitions-1] ... ## Memory Allocation The allocator creates numPartitions buffers (NOT numLogicalPieces buffers). Each buffer contains all pieces from all groups that belong to that partition, concatenated together. Buffer size = pieceSize * numGroups. For example, with numPartitions=2, numGroups=4 and partitionDim=0 on a [128, 32] tensor: - Total 8 logical pieces, each of size [16, 32] (pieceSize = 16*32 elements) - Piece layout: [0, 1, 2, 3, 4, 5, 6, 7] - Partitions: 0 1 0 1 0 1 0 1 - Groups: |Grp0||Grp1||Grp2||Grp3| Physical allocation (2 buffers, each containing 4 pieces): - Buffer 0 (Partition 0): [Piece0 | Piece2 | Piece4 | Piece6] - Buffer 1 (Partition 1): [Piece1 | Piece3 | Piece5 | Piece7] Physical allocation guarantee: Buffers in different partitions MUST reside in distinct physical shared memory partitions. TODO: Implement lowering of operations with PartitionSharedEncodingAttr. --------- Co-authored-by: Ognjen Plavsic <plognjen@amd.com> * [AMD][GLUON] Add multi-CTA GEMM example for `gfx1250` (#9342) Adds a multi-CTA GEMM example for `gfx1250` where we multicast A and B sub-tiles which are shared by CTAs of the cluster. `ctas_per_cga` is uses to control the cluster size and how we partition the accumulator across CTAs. This PR requires the following PRs to pass its correctness tests: - https://github.com/triton-lang/triton/pull/9341 - https://github.com/triton-lang/triton/pull/9340 - https://github.com/triton-lang/triton/pull/9336 - https://github.com/triton-lang/triton/pull/9334 * [AMD]Support 8-Warp Pingpong and Refactor MXGEMM Kernel on GFX1250 (#9356) This PR: - Refactored MXGEMM kernel to support various schedules - Supported 8-warp scheduling and 8-warp pingpong scheduling --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com> * [TRITON_KERNELS] some more tweaks (#9350) * [AMD][GLUON] Support AutoLayout for `offsets` and `mask` in `buffer_store` (#9353) `buffer_store` can infer the layout of `mask` and `offsets` based on the layout of `stored_value`. Currently we only infer from `offsets` to `masks` which is unnecessarily restrictive. * [docs] Add topk operation to language documentation (#9345) ## Summary - Add `topk` to Scan/Sort Ops section in `triton.language.rst` - Add docstring to `topk` function describing parameters, return value, and usage example ## Motivation Fixes #9278 The `topk` operation is implemented and exported in `triton.language` but was not mentioned in the API documentation at https://triton-lang.org/main/python-api/triton.language.html ## Changes 1. **docs/python-api/triton.language.rst**: Added `topk` entry under Scan/Sort Ops section 2. **python/triton/language/standard.py**: Added comprehensive docstring to `topk` function including: - Description of functionality - Parameter documentation - Return type - Usage example ## Testing Documentation builds correctly with the new entry. * [AMD][BACKEND] Support padding in TDM store if interval equals the inner dimension (#9360) TDM store does not support padding. However, we can adjust the tile dim to include padding and shrink the tensor shape to drop stores for padding bytes. This only works for a single padding interval which equals the inner dimensions. Other padding configs are not supported. * [AMD][gfx1250] Add clamp operand to WMMA int intrinsic (#9359) * reland upgrade ptxas for blackwell (#9363) * [AMD][gfx1250] Roll up some small changes (#9365) * Register gfx1250 to various target capability queries * Tighten some error checking logic * Adjust tests for gfx1250 * Revert "[AMD] Introduce PartitionedSharedEncodingAttr" (#9367) Reverts triton-lang/triton#9314 causing functional regression in fence insertion pass * [PROTON] Vector metrics (#9329) * Fix static srcMap bug causing multi-arch FP8 compilation failures (#9364) Compiling FP8 kernels for multiple GPU architectures in the same process fails with "failed to legalize operation 'tt.fp_to_fp'". The bug is in ElementwiseOpToLLVM.cpp's `FpToFpOpConversion::getConversionFunc()`. The srcMap that maps (srcType, dstType, roundingMode) to converter functions was declared `static`, causing it to be initialized once with the first architecture's isaFamily. Subsequent architectures would reuse the wrong converters. Example failure scenario: 1. Compile FP8 kernel for gfx950 (CDNA4) first - static srcMap initialized with CDNA4 converters (e.g., Fp8E4M3FNUZ_to_Fp16) 2. Compile same kernel for gfx942 (CDNA3) - srcMap already initialized, returns CDNA4 converters - CDNA3 lacks CDNA4-specific instructions, compilation fails The performance impact of removing static is negligible, since the map has only ~25 entries and is built per-conversion (not per-instruction).  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [TRITON_KERNELS] added global scale to reduction (#9372) * [AMD][gfx1250][gluon] TDM: Fix OOB handling for Scatter/Gather (#9371) Fixes the incorrect offset handling when the tile size `N` is not a multiple of `BLOCK_N` causing the last block of a TDM Scatter/Gather to be a partial block. This happened because unlike in `fillTDMDescriptor` for standard TDM, the tensor shapes weren't updated based on offsets within `fillTDMDescriptorForGatherScatter` for gather/scatter. * [AMD][GLUON] Add 4 and 8 warps stream-k gluon kernels for gfx1250 (#9370) - add 8 warp pingpong stream-k kernel implementation - persistent loop prefetch to enable overlap between prologue and epilogue - implement 4 warp stream-k specific partial tile processing with atomic based spinning locks * Consider rematerialisation cost when hoisting over ext (#9194) Avoid hoisting converts if the cost of rematerialising the slice and the new convert is higher than the original convert. To do this, factor out the cost computation from `backwardRematerialization` into `isRematBeneficial` and call it from `hoistConvertOnTopOfExtOrBroadcast `. The added test demonstrates a simple example where the hoisted convert is more expensive than the original. It is also possible to extend this to the other hoisting functions, although it requires slightly more care (e.g. a convert hoisted into a conditional should probably carry less cost). * Fix assert in `_p_matmul` (#9376) The current check prevents calling the persistent matmul on MXFP8, thought per the comment it is intended to block MXFP4 weights only. * [Backend] Bump to llvm/llvm-project@ac5dc54d5091 (#9333) * Reapply "[AMD] Introduce PartitionedSharedEncodingAttr" (#9367) (#9374) The initial patch was reverted because ProxyFenceInsertion.cpp still used getBufferIds instead of getAllBufferIdsWithAliases. Previously, getBufferIds returned buffer IDs for both a value and its aliases. With partitioned tensor support, a single value can now hold multiple buffers, so the API was split: getBufferIds returns only the buffer IDs for a value, while getAllBufferIdsWithAliases includes aliased buffer IDs as well. Co-authored-by: Ognjen Plavsic <plognjen@amd.com> * [AMD] Support PaddedSharedLayout in TDM Gather on GFX1250 (#9369) This PR supported `PaddedSharedLayout` in TDM Gather for limited cases. * [AMD][gfx1250] test_dot fix for small K (#9358) * [ConSan] First pass at improving ConSan compile times (#9366) When compiling `01-attention-forward.py` with consan enabled, before: ``` # times.ir_initialization=790100 0.79 sec # stage='ttgir' duration=182893 0.18 sec # stage='llir' duration=99643940 99.6 sec # stage='llvmir' duration=27658211 27.7 sec # stage='ptx' duration=11120513 11.1 sec # stage='cubin' duration=1149355658 19.15 min ``` After: ``` # times.ir_initialization=796533 0.79 sec # stage='ttgir' duration=184720 0.18 sec # stage='llir' duration=16957192 17.0 sec # stage='llvmir' duration=3735579 3.74 sec # stage='ptx' duration=1972309 1.97 sec # stage='cubin' duration=34357675 34.36 sec ``` This PR does quite a number of things at once: * Custom CanonicalizeLLVMIR pass that adds a pattern for `select %false|%true, %a, %b` since LLVM dialect is missing this (and is opposed to adding it) * Cache global constants to avoid creating many copies of the same string when lowering asserts * Fix warp specialize lowering to handle function calls and deduplicate barrier lowering code between NVIDIA and AMD backends. To support function calls, non-kernel functions are rewritten to accept a barrier handle argument that is passed down from the call site * Rewrite `createMultiColumnMask` to generate a constant tensor rather than computing it from a bunch of `make_range` and masking. This single function was generating gigabytes of IR * Pick warp-local layouts in consan instrumentation. Previously, consan used thread-local layouts where every thread has a copy of the tensor. This was to avoid using shared memory. We can switch to warp-local layouts where each warp has a copy of the tensor distributed across its threads to reduce the generated IR (and register usage) by a factor 32, plus some extra IR needed for shuffles. * To support warp-local layouts, I added a two flags: `uniform` to `tt.assert` and replaced `tti.experiment_assert_in_thread` with a `tt.reduce` + `tt.assert uniform`. Uniform just means only the first thread in the warp group will trigger the assert since the condition is uniform. * I added `always_use_warp_shuffle` function-level flag to force `convert_layout` lowering to use warp shuffles even when the performance heuristic picks shared memory to avoid using shared memory for the layout conversions inside consan helpers * Changed the lowering of `arith.constant` with non-splat dense elements attribute to generate a constant global array where each thread loads into it * Generate global stores in the main function into a helper function to reduce bloat by deduplicating. This also enables separate compilation later. * [AMD][BACKEND] Fix shared layout order for AsyncCopy on GFX9 (#9373) The heuristic used `getElementsPerThread` instead of `getContigPerThread` to retrieve the contiguity. It also has to use to clamp the `vecSize` to a supported width by the hardware to see if we need to carry over the `regOrder`. Before this PR, for the added lit tests we select a order for the shared layout which does not allow for coalesced direct-to-lds writes. * [AMD][GFX1250] Roll up some minor changes (#9380) * Add pattern to optimize tanh * Fix asserts for WMMA * Fix some TDM field updates * [AMD][BACKEND] Fix RangeAnalysis tripCount calculation (#9383) We currently check if the `optional<bool>` holds a value or not instead of checking the actual held value to decide if we take the upper or the lower limit. If the optional empty is empty we take the minimum to preserve the behavior from before this PR. Note that the first adjusted lit test only tests if we fold a `scf.ifOp` after the `scf.forLoop` so the changes in the loop are not removing the actual test. * [AMD][GLUON] Add CGA Layout to wmma scale (#9381) Attach CGA Layout for scale's linear layout. This is needed when we use wmma scaled in multi-cta kernels. * [AMD][gluon] Add gfx1250 warp‑pipeline f16 GEMM example (#9382) Provide a warp‑pipeline TDM GEMM kernel with shared‑memory buffering. * add a second gb200 node to CI (#9384) * [BACKEND] Improve and simplify ReduceOp's lowering (#9219) Stacked PRs: * #9327 * #9318 * #9317 * #9221 * #9220 * __->__#9219 --- --- --- ### [BACKEND] Improve and simplify ReduceOp's lowering We implement a LinearLayout-based `ReduceOp` lowering. This has a number of benefits: - The logic is noticeably simpler as we barely have to implement anything. ConvertLayout and some LL helpers do all the heavy lifting - We get shmem swizzling for free - We sometimes save a shmem round-trip (before we did it unconditionally) - It is now clear that we have a `tmpLl` variable we can carefully choose (we'll do so in a future PR) - It opens the door to returning an arbitrary layout (fusing a `convert_layout` into this op) - It is now really simple to generalise this op to perform cross-cluster reductions, provided that `convert_layout` supports them. - We fix some latent issues the previous implementation had when run on arbitrary linear layouts. We add a funky regression test that used to fail and now passes. - All this while being LOC-neutral! In future PRs we will improve the choice fo `tmpLl` to avoid in many cases the last `convert_layout`, and we will pack the inputs in shmem to be able to vectorize the load/stores for full reductions with multiple inputs. This PR was the result of quite a long (but rather successful) vibe-coding session together with `gpt-5.2-codex`. I found particularly useful being able to emit a ConvertLayout within this lowering rather than having to call the lowering of the function manually. This simplifies the code quite a bit and I would have struggled to convince MLIR to do so myself. * [BACKEND] Perform tree reductions on in-thread values (#9220) Stacked PRs: * #9327 * #9318 * #9317 * #9221 * __->__#9220 --- --- --- ### [BACKEND] Perform tree reductions on in-thread values We generate ternary trees for suitable integer ops and binary trees for everything else. We manually generate `{add,mul}.{f16,f32}x2` ops. This brings a speed-up to some gluon attention kernels. * [Nvidia] Enable TMA im2col mode - LLVM lowering (#9322) # Summary This is the fifth PR in a series that enables TMA im2col mode (in addition to the existing tiled mode) for NVIDIA GPUs. The goal of the series is to support TMA im2col mode in Gluon DSL. - First PR: https://github.com/triton-lang/triton/pull/9202 - Second PR: https://github.com/triton-lang/triton/pull/9225 - Third PR: https://github.com/triton-lang/triton/pull/9303 - Fourth PR: https://github.com/triton-lang/triton/pull/9305 - -> Fifth PR: https://github.com/triton-lang/triton/pull/9322 PTX ISA documentation for TMA im2col mode: https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode TMA tensor descriptor documentation: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html # Summary of Changes Added LLVM lowering logic for `AsyncTMACopyGlobalToLocalOpConversion` to support im2col mode. ## Im2col Mode Constraints ### pixelsPerColumn (non-contiguous dimension) - **Maximum size**: 1024 elements - **Corresponds to**: Spatial dimensions (N, D, H, W) - **Block shape**: Restricted to match `shapePerCTA` (no splitting) - **Rationale**: Avoids generating multiple TMA messages along spatial dimensions, eliminating complex offset calculations that would depend on input tensor shape and padding - **Note**: 1024 is sufficient for most practical use cases ### channelsPerPixel (contiguous dimension) - **Maximum size**: 256 elements, or swizzle byte size if swizzle is enabled - **Multiple messages**: Supported when channel dimension exceeds block size - **Offset application**: Only coord[0] (channel coordinate in PTX order) receives non-zero offsets ## Key Implementation Details 1. **Offset application**: For im2col mode, only the channel dimension receives non-zero offsets; spatial dimension offsets are always 0 (verified by assertion) 2. **Im2col offsets reversal**: Spatial offsets (e.g., `off_w`, `off_h`) are reversed to match PTX/CUDA innermost-to-outermost ordering, consistent with coordinate handling 3. **Alignment with tiled mode**: These constraints align with tiled mode behavior used for GEMM operations  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [FPSAN] Introducing FpSan - floating point sanitizer (#9337) FpSanitizer ensures that arithmetically equivalent kernels produce the exact same results by replacing floating‑point ops with integer ones. For every binary op we cast operands to int, perform the equivalent integer op, then cast back to float. Unary ops are replaced with identity. For dot ops we cast operands to int and perform a matmul in registers. The resulting kernel is of course numerically incorrect, but it should be independent of operation order. If two implementations match under fpsan, they are very likely algorithmically equivalent even if FP results differ due to ordering/reassociation/etc. Note about tensor memory: tcgen05_mma is lowered by loading operands into registers and manually emulating MMA. Because tcgen05_mma can be issued from a partition with 1 warp group while tmem load/stores require 4 or 8 warps, we replace all tmem uses with global scratch memory. * [KERNELS] enable swap_xw on blackwell for non-mx matmuls (#9390) helps significantly for ragged matmuls where the slice size is small. otherwise, blackwell will compile to mma.sync * [BACKEND] Implement support for cross-CTA tt.reduce (#9221) Stacked PRs: * #9327 * #9318 * #9317 * __->__#9221 --- --- --- ### [BACKEND] Implement support for cross-CTA tt.reduce The title of this PR is a bit of a lie. Even though the lowering is now implemented to support cross-CTA reductions, it depends on `convert_layout` supporting them, and it doesn't currently support LinearLayouts. We should generalise this one first and then enable it here. We should also emit the correct cross-CTA barrier from `targetInfo` in the case of cross-CTA memory reuse. In this PR, we take the chance to also generalise the lowering to avoid convert layouts whenever possible. * [UT] Generalize Proton tests by providing `device` fixture (#9351) The same approach as with other unit tests. --------- Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com> * [Gluon] Add Cluster Launch Control (CLC) support for Blackwell GPUs (#9361) This adds support for NVIDIA's Cluster Launch Control (CLC) feature on Blackwell (SM100+) GPUs, enabling dynamic work distribution for persistent kernels. CLC allows running workers to cancel not-yet-launched clusters and take over their work, improving load balancing when SM availability varies. New Gluon API (triton.experimental.gluon.language.nvidia.blackwell.clc): - try_cancel(result, mbar): Issue async CLC request to cancel a pending cluster - is_canceled(result): Check if cancellation succeeded (returns non-zero) - get_first_ctaid(result, dim): Get the canceled cluster's first CTA ID MLIR ops added: - ttng.clc_try_cancel: Lowers to clusterlaunchcontrol.try_cancel.async PTX - ttng.clc_is_canceled: Lowers to clusterlaunchcontrol.query_cancel.is_canceled - ttng.clc_get_first_ctaid: Lowers to clusterlaunchcontrol.query_cancel.get_first_ctaid All ops include SM100+ compute capability checks and emit errors on older GPUs. Tutorial included demonstrating CLC matmul achieving 92.5% of cuBLAS performance on 8192x8192x8192 FP16 matrices.  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) --------- Co-authored-by: Peter Bell <peterbell10@openai.com> * [ConSan] Use `-Ofc mid` for ConSan compilation (#9394) I tested a variety of `ptxas` flags and found the following (using `01-attention-forward.py`) ``` compile (sec) execution (sec) -O0 19 81 -O1 49 22 -O3 52 21 -Ofc max 19 82 -Ofc mid 21 9 -Ofc min 21 9 -Ofc 0 19 21 ``` Shockingly, `-Ofc mid|min` yield best total compilation+execution times for consan. * [TritonGPU] Run Gluon canonicalizer in LLVM lowering when ConSan is on (#9396) Regular canonicalizer will rewrite layout conversions * Add missing NVGPUToLLVMPass dependency (#9398) Cmake can sometimes fail with the message below. NVGPUToLLVMPass has a dependency on TritonGPUOpInterfacesIncGen. ``` FAILED: Compilers/triton/third_party/nvidia/lib/NVGPUToLLVM/CMakeFiles/NVGPUToLLVM.dir/NVGPUToLLVMPass.cpp.o In file included from /__w/1/s/src/Compilers/triton/third_party/nvidia/lib/NVGPUToLLVM/NVGPUToLLVMPass.cpp:11: In file included from /__w/1/s/src/Compilers/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/Utility.h:8: In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/TritonGPUToLLVM/Utility.h:7: In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/TritonGPUToLLVM/TargetInfoBase.h:4: In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/MLIRTypes.h:5: In file included from /__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/Dialect.h:11: In file included from /__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/Attributes.h:6: /__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/TritonGPUInterfaces.h:10:10: fatal error: 'triton/Dialect/TritonGPU/IR/OpInterfaces.h.inc' file not found 10 | #include "triton/Dialect/TritonGPU/IR/OpInterfaces.h.inc" | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because it is fixing a CMakeLists.txt bug - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [Nvidia] Enable TMA im2col mode - Gluon API (#9391) # Summary This is the sixth PR in a series that enables TMA im2col mode (in addition to the existing tiled mode) for NVIDIA GPUs. The goal of the series is to support TMA im2col mode in Gluon DSL. - First PR: https://github.com/triton-lang/triton/pull/9202 - Second PR: https://github.com/triton-lang/triton/pull/9225 - Third PR: https://github.com/triton-lang/triton/pull/9303 - Fourth PR: https://github.com/triton-lang/triton/pull/9305 - Fifth PR: https://github.com/triton-lang/triton/pull/9322 - -> Sixth PR: https://github.com/triton-lang/triton/pull/9391 PTX ISA documentation for TMA im2col mode: https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode TMA tensor descriptor documentation: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html # Summary of Changes Added Gluon DSL frontend support for TMA im2col mode, completing the end-to-end path from Python API to runtime descriptor creation. - Extended `TensorDescriptor` with im2col fields (`mode`, `elementStrides`, `pixelBoxLowerCorner`, `pixelBoxUpperCorner`) and parameter validation - Refactored `tensor_descriptor_type` into a base class with `tensor_descriptor_type` (tiled) and `tensor_descriptor_im2col_type` (im2col) subclasses - Added `offsets` parameter to `async_copy_global_to_shared` for im2col spatial offsets - Updated specialization (`specialize.cc`) to emit `tensordesc_im2col<..., input_rank=N>` signatures and type parsing (`str_to_ty`) to reconstruct the correct types - Updated driver to parse im2col signatures and call `fill_tma_descriptor_im2col` at runtime - Changed `amendFuncOp` to use `TensorDescInterface` instead of `TensorDescType` so im2col descriptors also receive `tt.nv_tma_desc` - Added end-to-end `test_tma_im2col` Gluon test  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: peterbell10 <peterbell10@live.co.uk> * [BACKEND] Support generic multi-cta convert_layouts (#9317) Stacked PRs: * #9327 * #9318 * __->__#9317 --- --- --- ### [BACKEND] Support generic multi-cta convert_layouts We generalise the swizzling algorithm to work with blocks and generalise the most of the memory lowerings to support layouts with blocks. We remove the legacy lowering. The generic swizzling algorithm for blocks might be fine, but we didn't try to be super clever. There might be some perf left on the table. We can look into this at a later point if it becomes relevant. We also activate multi-cta reductions in the process and test both there. TODO: Add some funky tests that just test the `convert_layout`, not the `convert_layout` within the reduction. TODO: Check how to perform multiCTA barriers in AMD and perhaps merge cluster barriers into ttg.barrier, predicate broadcasting blocks, etc. * [Membar] Membar pass for clusters (#9318) Stacked PRs: * #9327 * __->__#9318 --- --- --- ### [Membar] Membar pass for clusters The main invariant here is that: Membar for CTAs only synchronises CTAs when their buffers did not alias in the ttgir, but they alias after the Allocation pass In other words, in Gluon, the user is in charge of manually synchronising the bufferes they declare. For now, we always emit a full cluster barrier. We can improve this in the future by emitting `mbarrier`s that just synchronise subsets of the CTAs. For that we would need to be a bit more clever, as we would need to allocate some `mbarrier`s but the Allocation pass has already run... We add a number of test cases with comments of which of them are expected and which can be improved. * [Membar] Fix non-trivial function smem offsets (#9327) Codex rightly identified that we were not considering the offsets of functions in our membar analysis at https://github.com/triton-lang/triton/pull/9318#discussion_r2740883625 Codex then went on and fixed it and added a regression test. * Add descending flag to topk (#9355) * Remove TritonIR dependence on TritonGPUIR (#9392) * [AMD] Update gfx1250 MXFP FA example kernel (#9414) Include recent some optimizations for MXFP FA kernel. * [FPSAN] Fix fpsan crash with warp specialization + tmem (#9415) Fpsan is replacing tensor memory with global scratch, and was missing correct handling of passing global memory pointers to warp_specialize op. Also, use the -Ofc mid ptx compilation mode for fpsan compilation. * [AMD][NFC][BACKEND] Use ROCDL ops for (cluster) async load (#9410) Replaces intrinsics for (cluster) async loads with ROCDL Ops. * [AMD] Fix gpu::BarrierOp lowering for gfx1250 (#9416) Update gpu::BarrierOp lowering for gfx1250. * [AMD] Broadcasting in v_perm based conversion (#9354) This PR enables support of broadcasted layouts in AMD specific convert_layout lowering: - Generalizes code, that previously assumed number src and dst values is equal; --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> * [NVIDIA] Fully restore canSkipBarSync (#9281) * [CI] Remove unused commands (#9419) Calling `df` twice should not have any impact; this seems like an artifact from somewhere else that should be removed? * [CI] Remove use of `LLVM_ENABLE_TERMINFO` (#9423) This is a follow up to #9419 from looking at `llvm-build.yml`. When building LLVM, setting `LLVM_ENABLE_TERMINFO=OFF` results in a CMake warning: ``` CMake Warning: Manually-specified variables were not used by the project: LLVM_ENABLE_TERMINFO ``` I searched for uses of this in LLVM and there are none today. The latest reference I found in GitHub issues/PRs was from 2024. That makes me relatively confident this is no longer used so I'm proposing we remove it. * [Cluster Membar] Add a cross-cta barrier at the end of the kernel (#9413) If we have an outstanding read or write, (I think it'll always be a read) and we didn't have a TMEM deallocation, we should add a cluster-level barrier for correctness * [AMD][Gluon] Implement 8 wave F16 FA PingPong kernel. (#9427) Ping Pong is another way to implement a performant attention kernel allowing coexecution between two waves in same SIMD. In this PR we introduce a 8 wave PP/Warp pipelined kernel, and related num_warp changes. We also cleaned up the way to invoke/select different variants of attention kernel. Signed-off-by: Stanley Winata <stanley.winata@amd.com> * [AMD][gfx1250] Adjust (BLOCK) M/N in f16 gemm examples (#9421) Co-authored-by: Alexander Weinrauch <Alexander.Weinrauch@amd.com> * [Tutorial] Fix a ty…

bmyerz0 added 2 commits February 7, 2026 08:30

Add missing TritonGPUOpInterfacesIncGen dependency

c6c6245

Merge branch 'main' into dev/bmyerz0/nvgppputollvmpass-cmake

7a34b7e

bmyerz0 marked this pull request as ready for review February 7, 2026 08:47

bmyerz0 requested a review from ptillet as a code owner February 7, 2026 08:47

ThomasRaoux approved these changes Feb 7, 2026

View reviewed changes

ThomasRaoux merged commit 650b9d4 into triton-lang:main Feb 7, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add missing NVGPUToLLVMPass dependency#9398

Add missing NVGPUToLLVMPass dependency#9398
ThomasRaoux merged 2 commits intotriton-lang:mainfrom
bmyerz0:dev/bmyerz0/nvgppputollvmpass-cmake

bmyerz0 commented Feb 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bmyerz0 commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New contributor declaration

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bmyerz0 commented Feb 7, 2026 •

edited

Loading