Skip to content

Commit da2e93b

Browse files
masahiJokerenneildharpeterbell10pawelszczerbuk
authored
OSS main merge (triton-lang#17)
* [PROTON] Metadata profiling with periodic flushing (#9236) * Fix infinite rewrite loop in latest LLVM (#9249) * [Kernels] Enable high occupancy persistent matmul (#9248) Having independent work within a single SM allows the warp scheduler to hide some of the bubbles in our pipelines. This brings persistent bf16 x mxfp4 MoE from 2290 GBps -> 2640 GBps on h200, or a 15% improvement. * [CONSAN] Optimize compilation time (#9240) Reduce compilation time with enabled consan by not emitting alias matrices if the buffers trivially don't alias. It should also have positive impact on execution speed. ~15% compilation time reduction in tutorials/01-attention-forward.py may have even bigger impact in cases where both smem and tmem buffers won't ever alias. * Explicitly convert numpy array into scalar because it is required by numpy 2.4.0 (#9172) <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Fixes #9171 Signed-off-by: Gregory Shimansky <gregory.shimansky@intel.com> * [feat] add sentinel when async compile (#9251) * [triton_kernels] Fix device descriptor allocator to keep a pool (#9259) When used in a CUDA graph, the per stream tensor allocator could leave dangling pointers in the CUDA graph. Suppose during graph capture, an 8 KB tensor is allocated as Triton scratch memory. It is recorded in the CUDA graph. However, if the allocator needs to resize and allocators 16 KB, the previous tensor is decref'd and gc'd by Python. When replaying the CUDA graph, the first 8 KB tensor that was captured is now dangling. Fix this by keeping old allocs alive. * [AMD] Add clampf lowering via v_med3 (#9256) `v_med3_*` selects the median of three inputs, or returns the minimum if any of them is NaN. * [PROTON] Further reduce unnecessary locks (#9257) * [AMD] Added i8xi8xi32 v3 to wmma database (#9267) * [AMD][gfx1250] Added missing f64.16x16x4.f64 to wmma database (#9271) Added support for f64 case for GFX1250 arch. Note, - we are skipping `llvm.amdgcn.wmma.f32.32x16x128.f4` support because there is no fp4 builtin type in Triton. - `llvm.amdgcn.wmma.f32.16x16x128.f8f6f4` can be added to the WMMA database as well. This requires refactoring; currently it is hard-codded and bypasses the database * [ci] Pin pandas < 3.0 (#9273) We're seeing test failures in proton, and it looks like pandas updating is the cause. Lets pin the version for now. * [BACKEND] Add support for out of tree Triton Dialect Plugins (#8523) * Fix dotCanBeProperlyAsync when wgmma is not yielded by loop (#9274) * [Backend][AMD] Change membarFilter to take bufferIDs into account (#9265) When a local memory location is deallocated and later reused, a `local_load` from the old allocation can still be in-flight when a subsequent `async_copy` writes into the reallocated memory, which creates a data hazard due to the membarFilter. We need to insert a `ttg.barrier local` to make sure the local load has finished fetching the data. Add Allocation to membarFilter to do that: only filter out when we are sure that local_load and async have the same bufferID. If they access a different bufferID then we shouldn't filter the barrier. * [Kernels] Don't flatten persistent hopper mixed precision matmul (#9279) For some reason, overlapping the epilogue with the prologue of the next tile is actually slower here. Disabling it gives a 150 GBps speedup on h200, from 2650 GBps -> 2800 GBps. My hypothesis is that because we use occupancy here, it allows the other block in the SM to better utilise the tensor cores while we process the epilogue. Similar to a ping-pong schedule but done by the warp scheduler. * [AMD][NFC] used IsFPClass MLIR op for checkIsNan (#9276) The PR removes explicit intrinsic call and replaces it with a dedicated MLIR op * Use variadic argument pre-compiled cuda launcher (#6788) * create new variadic launcher in driver.c * remove C string / string substitution logic to create a cuda kernel from driver.py * add logic to parse arguments & remove constexpr / flatten tuples in the new launcher The launch overhead using the scripts in comments below show no regressions. * [NVIDIA,Membar] Resurrect NVIDIA::canSkipBarSync (#9246) I was debugging Membar, and confused why `triton-opt --test-print-membar` gives different results that the normal compiler pass. It turns out this `canSkipBarSync` filter is enabled in testing mode but NOT in real compilation. It was removed during Blackwell enablement, presumably because it's making assumptions about the kernel structure. This scales it back to something that's more conservative, and re-enables it in normal compilation. With this I see a 18 GBps improvement in persistent bf16 x mfxp4 MoE. * [Docs] Clarify divisibility reset logic for contiguous dimensions in AxisInfo (#9266) ## What does this PR do? This PR improves the code documentation in `AxisInfo.cpp`, specifically within `MulIOp` visitor. ## Why is it important? The logic `lhsDivisibility = 1` when `contiguity > 1` is a critical safety guardrail. Without explicit explanation, it looks like an arbitrary pessimization. This change clarifies that: 1. Contiguous sequences (e.g., `range(0, N)`) inherently have a GCD of 1 across the dimension. 2. Propagating the base value's divisibility would lead to incorrect alignment assumptions in downstream passes (like `Coalesce` or `LoadStoreOp` vectorization), potentially causing illegal memory accesses or miscompiled code. This documentation helps future contributors understand the interaction between `Contiguity` and `Divisibility` without needing to derive the number theory from scratch. * Fix infinite loop in dotCanBeProperlyAsync (#9282) The `checkOperand` traversal can run forever if either: 1. A block argument participates in a cycle containing only permitted instructions. 2. A block argument is defined outside of `forOp`, in which case we never advance transitiveOperand. To fix (1), track the set of visited block arguments. If we visit the same block argument again, that means that we are in a cycle originating in the init value of the iter arg, which is outside the loop. To fix (2), check for values defined outside the loop as we iterate. This way, we know that if we are evaluating a block argument, it must be an iter arg to the loop. * [triton_kernels] fix test case for distributed routing kernels (#9258) * [AMD][gfx1250] Add Assumptions and Fix Predicate in MXGEMM Kernel (#9285) This PR - added assumptions to loop boundary - fix predicate to eliminate readfirstlane instrs - moved `static_profile` and `composition` to a shared location * [AMD] Don't use s_waitcnt to lower global barrier for now (#9287) Atomics tests are failing a small % of time because of an invalid transformation in the backend. This will be reverted once the backend is fixed and LLVM is bumped. * Adding Triton community meetup notes for 20260106 (#9288) Topics and discussions covered: - Update on triton-shared (Haishan Zhu and Nhat Nguyen, Meta) - Update on the plugin system infrastructure - what's upstream today and roadmap (Puyan Lotfi and Corbin Robeck, Meta) - Status of Triton Plugin repository and example: loop unfolding (Simon Waters, kernelize.ai) # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [AMD][gfx1250] Enable some TDM features (#9283) This pull request adds basic threading for supporting descriptor load/store. More changes are coming. For now this allows to run: python3 -m pytest -s test/unit/language/test_tensor_descriptor.py::test_make_tensor_descriptor_matmul[128-128-16-1-1] * [Consan] Reduce number of configs in coverage tests (#9294) * Revert "[AMD][gfx1250] Added missing f64.16x16x4.f64 to wmma database" (#9291) This reverts https://github.com/triton-lang/triton/commit/53b0eafd76debe074965a5d751dd21c593097eb2. * [AMD] Add PrepareIfCombining pass to enable scf.if combining (#9253) Add a new pass that moves operations out from between scf.if pairs that share the same condition. This enables the canonicalizer to combine adjacent if operations. Also, remove moveUpTranspose optimization from the ReorderInstructions pass as we no longer need it with this new pass. * [Backend] Separate out additive kReg smem padding contribution calculation (#9286) LLVM was struggling to use immediate values in the padded offset calculation. This change makes things easier for LLVM, and we get a nice reduction in register usage. For non-padded pathways, there should be no difference in the LLVM IR generated. * [ANALYSIS] Enhance divisibility handling in AxisInfo for addition and subtraction operations (#9297) Unfortunately it's probably difficult to derive a very generic rule for all ops in triton, though these `affine` ops share many common characteristics * [Nvidia] Enable TMA im2col mode -- Tensor Descriptor (#9225) # Summary This is the second PR in a series that enables TMA im2col mode (in addition to the existing tiled mode) for NVIDIA GPUs. The goal of the series is to support TMA im2col mode in Gluon DSL. - First PR: https://github.com/triton-lang/triton/pull/9202 - -> Second PR: https://github.com/triton-lang/triton/pull/9225 PTX ISA documentation for TMA im2col mode: https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode TMA tensor descriptor documentation: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html # Summary of changes Add general `TT_TensorDescInterface`; Add `TTNG_TensorDescIm2ColType` for Nvidia TMA im2col mode; Add lit test; # Note The additional im2col parameters `elementStrides`, `pixelBoxLowerCorner`, `pixelBoxUpperCorner`, `channelsPerPixel`, `pixelsPerColumn`, will be passed to tensor descriptor from host at runtime. No need to be in the kernel IR. In im2col mode, `blockType` must be rank-2: [`channelsPerPixel` x `pixelsPerColumn`] <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) --------- Co-authored-by: Thomas Raoux <thomas.raoux@openai.com> * [AMD][gluon][gfx1250] Support tensor async scatter (#9299) Supports tensor async scatter using TDM. Gluon only at the moment. * [GLUON] Add get_view() to Gluon layouts (#9270) Adds new method `get_view(shape, use_hw_view)` to all Gluon layout classes. This can be called from outside a kernel context, useful in development for inspecting / debugging layouts. Exports new `gluon_ir.get_layout_view` function that creates a temporary MLIR context and converts to a LinearLayout. Returns the same output as the `triton-tensor-layout`, but without needing to convert to TTG IR. Tested by asserting output is the same as the CLI tool. ## Example: ```python layout = gl.BlockedLayout([2, 1], [1, 4], [1, 2], [1, 0]) print(layout.get_view([2, 8])) [[T0:0, T1:0, T2:0, T3:0, T4:0, T5:0, T6:0, T7:0] [ T0:1, T1:1, T2:1, T3:1, T4:1, T5:1, T6:1, T7:1]] layout = gl.NVMMASharedLayout.get_default_for([16, 16], gl.float16) print(layout.get_view([16, 16])) [[( 0: 0),( 0: 1),( 0: 2),( 0: 3),( 0: 4),( 0: 5),( 0: 6),( 0: 7),( 0: 8),( 0: 9),( 0:10),( 0:11),( 0:12),( 0:13),( 0:14),( 0:15)] [ ( 1: 0),( 1: 1),( 1: 2),( 1: 3),( 1: 4),( 1: 5),( 1: 6),( 1: 7),( 1: 8),( 1: 9),( 1:10),( 1:11),( 1:12),( 1:13),( 1:14),( 1:15)] [ ( 2: 0),( 2: 1),( 2: 2),( 2: 3),( 2: 4),( 2: 5),( 2: 6),( 2: 7),( 2: 8),( 2: 9),( 2:10),( 2:11),( 2:12),( 2:13),( 2:14),( 2:15)] [ ( 3: 0),( 3: 1),( 3: 2),( 3: 3),( 3: 4),( 3: 5),( 3: 6),( 3: 7),( 3: 8),( 3: 9),( 3:10),( 3:11),( 3:12),( 3:13),( 3:14),( 3:15)] [ ( 4: 8),( 4: 9),( 4:10),( 4:11),( 4:12),( 4:13),( 4:14),( 4:15),( 4: 0),( 4: 1),( 4: 2),( 4: 3),( 4: 4),( 4: 5),( 4: 6),( 4: 7)] [ ( 5: 8),( 5: 9),( 5:10),( 5:11),( 5:12),( 5:13),( 5:14),( 5:15),( 5: 0),( 5: 1),( 5: 2),( 5: 3),( 5: 4),( 5: 5),( 5: 6),( 5: 7)] [ ( 6: 8),( 6: 9),( 6:10),( 6:11),( 6:12),( 6:13),( 6:14),( 6:15),( 6: 0),( 6: 1),( 6: 2),( 6: 3),( 6: 4),( 6: 5),( 6: 6),( 6: 7)] [ ( 7: 8),( 7: 9),( 7:10),( 7:11),( 7:12),( 7:13),( 7:14),( 7:15),( 7: 0),( 7: 1),( 7: 2),( 7: 3),( 7: 4),( 7: 5),( 7: 6),( 7: 7)] [ ( 8: 0),( 8: 1),( 8: 2),( 8: 3),( 8: 4),( 8: 5),( 8: 6),( 8: 7),( 8: 8),( 8: 9),( 8:10),( 8:11),( 8:12),( 8:13),( 8:14),( 8:15)] [ ( 9: 0),( 9: 1),( 9: 2),( 9: 3),( 9: 4),( 9: 5),( 9: 6),( 9: 7),( 9: 8),( 9: 9),( 9:10),( 9:11),( 9:12),( 9:13),( 9:14),( 9:15)] [ (10: 0),(10: 1),(10: 2),(10: 3),(10: 4),(10: 5),(10: 6),(10: 7),(10: 8),(10: 9),(10:10),(10:11),(10:12),(10:13),(10:14),(10:15)] [ (11: 0),(11: 1),(11: 2),(11: 3),(11: 4),(11: 5),(11: 6),(11: 7),(11: 8),(11: 9),(11:10),(11:11),(11:12),(11:13),(11:14),(11:15)] [ (12: 8),(12: 9),(12:10),(12:11),(12:12),(12:13),(12:14),(12:15),(12: 0),(12: 1),(12: 2),(12: 3),(12: 4),(12: 5),(12: 6),(12: 7)] [ (13: 8),(13: 9),(13:10),(13:11),(13:12),(13:13),(13:14),(13:15),(13: 0),(13: 1),(13: 2),(13: 3),(13: 4),(13: 5),(13: 6),(13: 7)] [ (14: 8),(14: 9),(14:10),(14:11),(14:12),(14:13),(14:14),(14:15),(14: 0),(14: 1),(14: 2),(14: 3),(14: 4),(14: 5),(14: 6),(14: 7)] [ (15: 8),(15: 9),(15:10),(15:11),(15:12),(15:13),(15:14),(15:15),(15: 0),(15: 1),(15: 2),(15: 3),(15: 4),(15: 5),(15: 6),(15: 7)]] ``` <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [Nvidia] Enable TMA im2col mode - Fix tma load op (#9303) # Summary This is the third PR in a series that enables TMA im2col mode (in addition to the existing tiled mode) for NVIDIA GPUs. The goal of the series is to support TMA im2col mode in Gluon DSL. - First PR: https://github.com/triton-lang/triton/pull/9202 - Second PR: https://github.com/triton-lang/triton/pull/9225 - -> Third PR: https://github.com/triton-lang/triton/pull/9303 PTX ISA documentation for TMA im2col mode: https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode TMA tensor descriptor documentation: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html # Summary of changes This PR is to fix the AsyncTMACopyGlobalToLocalOp: - Delete the TTNG_TensorModeAttr, since we can infer the descriptor mode from the desc type - Remove TTNG_TensorModeAttr from TTNG_AsyncTMACopyGlobalToLocalOp - Fix the op verification and lit test accordingly <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [BACKEND][AMD] Add option to swap MIR (#8711) We use TRITON_SWAP_MIR to specify the path to externally-provided MIR file that can be used for compilation to ASM in LLVM backend, overriding the one generated by regular pipeline. This can be used to apply custom transformations on MIR and use it to generate binary. * [BACKEND][AMD] Use addOccurrence for LLVM options to properly disable… (#9311) … schedulers Use addOccurrence instead of setValue when setting LLVM command-line options like enable-misched=false. This is necessary because LLVM's scheduler passes check getNumOccurrences() to determine if the option was explicitly set on the command line (see llvm/lib/CodeGen/MachineScheduler.cpp). Also add test to verify the MIR swap pipeline starts before machine-scheduler and that both machine-scheduler and post-RA scheduler are disabled. <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [PROTON][BLACKWELL] Enable low-overhead hardware trace (#9307) * Increment index in plugin example (#9315) In the Triton Plugin example, the plugin pass enumerator does not increment the index into `passNames`. If this example were used elsewhere with multiple passes, this would overwrite the first returned pass name. It's a minor tweak and this is just an example, but it seems best to fix this prior to this getting heavy use. # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. [this is debatable--this change is quite minor!] - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because it's for an example. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [triton] Add `tl.cat(can_reorder=False)` implementation (#9312) This resurrects the old PR that replaced the implementation entirely. I also fixed `tl.cat` to be equivalent in semantics to `torch.cat` * [BACKEND][AMD] Add ScopedNoAliasAAWrapperPass to MIR swap pipeline (#9309) * Fix tolerance for float8 x mx combinations in matmul tests (#9316) Increases tolerance from 3e-2 to 2e-1 for float8 x mx combinations to account for compounded quantization errors. * [triton_kernels] Unfuse fma in reduce kernel for numeric stability (#9320) * [Nvidia] Enable TMA im2col mode - driver support (#9305) # Summary This is the fourth PR in a series that enables TMA im2col mode (in addition to the existing tiled mode) for NVIDIA GPUs. The goal of the series is to support TMA im2col mode in Gluon DSL. - First PR: https://github.com/triton-lang/triton/pull/9202 - Second PR: https://github.com/triton-lang/triton/pull/9225 - Third PR: https://github.com/triton-lang/triton/pull/9303 - -> Fourth PR: https://github.com/triton-lang/triton/pull/9305 PTX ISA documentation for TMA im2col mode: https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode TMA tensor descriptor documentation: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html # Summary of changes This PR adds the driver function for create the tensor descriptor for TMA im2col mode. The driver function can be found in the doc: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [PROTON] Fix default value for TRITON_ENABLE_HW_TRACE in CuptiProfiler (#9324) * [AMD][gfx1250] Update f16 GEMM examples (#9326) * Factored out common utilites * Added support for L2 prefetch Co-authored-by: Alexander Weinrauch <Alexander.Weinrauch@amd.com> Co-authored-by: Jungwook Park <jungwook.park@amd.com> Co-authored-by: Xiaohu Guo <Xiaohu.Guo@amd.com> * [Backend] Bump LLVM to 2eb709b95d8f (#9264) Bumps LLVM to `2eb709b95d8f` to consume 2 fixes related to `gfx1250`: - https://github.com/llvm/llvm-project/pull/176206 - https://github.com/llvm/llvm-project/pull/176355 --------- Co-authored-by: neildhar <neildhar@users.noreply.github.com> Co-authored-by: Lei Zhang <antiagainst@gmail.com> * [triton_kernels] Add unpadded batch size handling to reduce (#9332) * [AMD] Use v_perm instruction for convert_layout acceleration (#9014) This PR introduces AMD specific ttg->llvm pattern which uses v_perm instructions instead of combinations of shifts and logical operations. Limitations of this pattern: - Applied only for 8 bit data types; - Conversion required to be bijective; - No permutation across threads in workgroup. --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> * [Kernels] Re-enable 4-warp persistent kernel (#9331) Looks like some of the interceding changes have removed the code sequence was causing issues with ptxas. I also had to disable `warp_specialize` as I'm getting an invalid IR error from inside the pass. * Partial revert of [AMD][NFC] cleanup llvm instrinsic calls; replaced with rocdl ops (#9334) `getClusterCTAId` should return the local CTA id in the cluster. https://github.com/triton-lang/triton/pull/9222 replaced it with the overall cluster id which is wrong. I could not find a ROCDL Op exposing the workgroup id inside the cluster so this PR reverts it back to the intrinsic. * Revert "Fix tolerance for float8 x mx combinations in matmul tests (#… (#9338) …9316)" This reverts commit bcbcabdd0cff6539c7168299075992b2a23ff38e. * [AMD] NFC: Commonise some checks into functions. (#9335) This makes feature checks a bit tidier * [FRONTEND] Add option to round f32 to tf32 in descriptor (#9295) add an option to do rounding on the fly of float32 inputs into rounded tf32 values. * [AMD][Gluon] Warp-pipeline fixes and priority hint (#9301) These changes sync internal warp‑pipeline fixes into upstream to align correctness (warp size), configurability (priority hints), and safety (dependency analysis / barrier placement) across Gluon and LLVM lowering paths. - Correct warp size handling in the warp‑pipeline per given AMDGPU HW arch. - Introduce optional priority hints on warp‑pipeline stages, enabling more explicit scheduling intent for stage ordering/perf tuning. - Fix a bug in warp‑pipeline dependency analysis. * [AMD][gfx1250] Skip dot tests for bf16x3 and bf16x6 types (#9343) * Apply rounding to TF32 in MoE (#9296) * [AMD][BACKEND] Fix TDM shape adjustment to factor in the CGA offset (#9341) TDM needs to adjust the shape in the tensor descriptor to account for the tile offset when computing OOB. However, we did not take the CGA offset into account. I adjusted the test to overallocate the buffers so the test can catch out of bounds writes and we do not rely on a random segfault to detect the OOB write. * [AMD][gluon][gfx1250] Add tensor async gather support using TDM (#9313) Implements tensor async_gather using TDM in a similar fashion to https://github.com/triton-lang/triton/pull/9299 on Gluon. * [AMD] Replace ReorderInstructions with MoveUpPrologueLoads (#9328) After the recent changes, the ReorderInstructions pass had only one optimization left: moving prologue loads early for prefetching. Add a new pass for that optimization, refactor the implementation, and add more tests. * [AMD] Support TDM AsyncWait in UpdateAsyncWaitCount (#9352) Since TDM scatter and gather might produce more than 1 intrinsic/assembly instruction we need to compute the correct waitcnt based on the #intrinsics instead of the number of TGGIR ops. This piggy-backs on the analysis pass we already use for `ttg.async_copy_global_to_local`. This allows us to use the number of TTGIR TDM ops in Gluon and also works with the token based approach from the pipeliner. * [AMD][BACKEND] Support multi-CTA for `AMDWmmaEncodingAttr` (#9340) Adjusts the verifier of DotOperandEncoding to allow for different CGALayouts between the accumulator and the operands. The dot verifier ensure that the accumulator does not broadcast and A is broadcasting along dim1 and B along dim0. If this turns out to be too restrictive, e.g. if the case comes up where we want to do duplicated work on CTAs, we can relax this later. * [Gluon] Refactor multi-CTA support of PaddedSharedLayouts (#9336) Refactors PaddedSharedLayouts to make them work nicer in multi-CTA. - Instead of `block_bases` it now holds the `CGALayout` in Gluon. - `with_identity_for` has to use the `shape_per_cta` to properly build the identity `linear_component` - `get_padded_shared_layout` should multiply the CTA and CGA layout - Adjust the verifier to broadcast in the `block` dimension. We still do not support broadcasts in the `offset` dimension * [AMD] Introduce PartitionedSharedEncodingAttr (#9314) This PR introduces PartitionSharedEncodingAttr. An encoding for tensors whose elements are partitioned across multiple separate shared memory allocations. This reduces shared memory partition conflicts by splitting a tensor along a specific dimension into separate allocations. Parameters: - numPartitions: Number of distinct memory partitions (and separate buffers). Buffers in different partitions MUST be placed in different physical shared memory slots. - numGroups: Number of groups. Each group contains numPartitions consecutive pieces of the tensor. - partitionDim: The dimension along which the tensor is partitioned. - partitionLayout: The shared memory layout used within each piece. The total number of logical pieces is numPartitions * numGroups. Pieces are organized as: [Group 0: pieces 0..numPartitions-1] [Group 1: pieces numPartitions..2*numPartitions-1] ... ## Memory Allocation The allocator creates numPartitions buffers (NOT numLogicalPieces buffers). Each buffer contains all pieces from all groups that belong to that partition, concatenated together. Buffer size = pieceSize * numGroups. For example, with numPartitions=2, numGroups=4 and partitionDim=0 on a [128, 32] tensor: - Total 8 logical pieces, each of size [16, 32] (pieceSize = 16*32 elements) - Piece layout: [0, 1, 2, 3, 4, 5, 6, 7] - Partitions: 0 1 0 1 0 1 0 1 - Groups: |Grp0||Grp1||Grp2||Grp3| Physical allocation (2 buffers, each containing 4 pieces): - Buffer 0 (Partition 0): [Piece0 | Piece2 | Piece4 | Piece6] - Buffer 1 (Partition 1): [Piece1 | Piece3 | Piece5 | Piece7] Physical allocation guarantee: Buffers in different partitions MUST reside in distinct physical shared memory partitions. TODO: Implement lowering of operations with PartitionSharedEncodingAttr. --------- Co-authored-by: Ognjen Plavsic <plognjen@amd.com> * [AMD][GLUON] Add multi-CTA GEMM example for `gfx1250` (#9342) Adds a multi-CTA GEMM example for `gfx1250` where we multicast A and B sub-tiles which are shared by CTAs of the cluster. `ctas_per_cga` is uses to control the cluster size and how we partition the accumulator across CTAs. This PR requires the following PRs to pass its correctness tests: - https://github.com/triton-lang/triton/pull/9341 - https://github.com/triton-lang/triton/pull/9340 - https://github.com/triton-lang/triton/pull/9336 - https://github.com/triton-lang/triton/pull/9334 * [AMD]Support 8-Warp Pingpong and Refactor MXGEMM Kernel on GFX1250 (#9356) This PR: - Refactored MXGEMM kernel to support various schedules - Supported 8-warp scheduling and 8-warp pingpong scheduling --------- Co-authored-by: Lei Zhang <antiagainst@gmail.com> * [TRITON_KERNELS] some more tweaks (#9350) * [AMD][GLUON] Support AutoLayout for `offsets` and `mask` in `buffer_store` (#9353) `buffer_store` can infer the layout of `mask` and `offsets` based on the layout of `stored_value`. Currently we only infer from `offsets` to `masks` which is unnecessarily restrictive. * [docs] Add topk operation to language documentation (#9345) ## Summary - Add `topk` to Scan/Sort Ops section in `triton.language.rst` - Add docstring to `topk` function describing parameters, return value, and usage example ## Motivation Fixes #9278 The `topk` operation is implemented and exported in `triton.language` but was not mentioned in the API documentation at https://triton-lang.org/main/python-api/triton.language.html ## Changes 1. **docs/python-api/triton.language.rst**: Added `topk` entry under Scan/Sort Ops section 2. **python/triton/language/standard.py**: Added comprehensive docstring to `topk` function including: - Description of functionality - Parameter documentation - Return type - Usage example ## Testing Documentation builds correctly with the new entry. * [AMD][BACKEND] Support padding in TDM store if interval equals the inner dimension (#9360) TDM store does not support padding. However, we can adjust the tile dim to include padding and shrink the tensor shape to drop stores for padding bytes. This only works for a single padding interval which equals the inner dimensions. Other padding configs are not supported. * [AMD][gfx1250] Add clamp operand to WMMA int intrinsic (#9359) * reland upgrade ptxas for blackwell (#9363) * [AMD][gfx1250] Roll up some small changes (#9365) * Register gfx1250 to various target capability queries * Tighten some error checking logic * Adjust tests for gfx1250 * Revert "[AMD] Introduce PartitionedSharedEncodingAttr" (#9367) Reverts triton-lang/triton#9314 causing functional regression in fence insertion pass * [PROTON] Vector metrics (#9329) * Fix static srcMap bug causing multi-arch FP8 compilation failures (#9364) Compiling FP8 kernels for multiple GPU architectures in the same process fails with "failed to legalize operation 'tt.fp_to_fp'". The bug is in ElementwiseOpToLLVM.cpp's `FpToFpOpConversion::getConversionFunc()`. The srcMap that maps (srcType, dstType, roundingMode) to converter functions was declared `static`, causing it to be initialized once with the first architecture's isaFamily. Subsequent architectures would reuse the wrong converters. Example failure scenario: 1. Compile FP8 kernel for gfx950 (CDNA4) first - static srcMap initialized with CDNA4 converters (e.g., Fp8E4M3FNUZ_to_Fp16) 2. Compile same kernel for gfx942 (CDNA3) - srcMap already initialized, returns CDNA4 converters - CDNA3 lacks CDNA4-specific instructions, compilation fails The performance impact of removing static is negligible, since the map has only ~25 entries and is built per-conversion (not per-instruction). <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [TRITON_KERNELS] added global scale to reduction (#9372) * [AMD][gfx1250][gluon] TDM: Fix OOB handling for Scatter/Gather (#9371) Fixes the incorrect offset handling when the tile size `N` is not a multiple of `BLOCK_N` causing the last block of a TDM Scatter/Gather to be a partial block. This happened because unlike in `fillTDMDescriptor` for standard TDM, the tensor shapes weren't updated based on offsets within `fillTDMDescriptorForGatherScatter` for gather/scatter. * [AMD][GLUON] Add 4 and 8 warps stream-k gluon kernels for gfx1250 (#9370) - add 8 warp pingpong stream-k kernel implementation - persistent loop prefetch to enable overlap between prologue and epilogue - implement 4 warp stream-k specific partial tile processing with atomic based spinning locks * Consider rematerialisation cost when hoisting over ext (#9194) Avoid hoisting converts if the cost of rematerialising the slice and the new convert is higher than the original convert. To do this, factor out the cost computation from `backwardRematerialization` into `isRematBeneficial` and call it from `hoistConvertOnTopOfExtOrBroadcast `. The added test demonstrates a simple example where the hoisted convert is more expensive than the original. It is also possible to extend this to the other hoisting functions, although it requires slightly more care (e.g. a convert hoisted into a conditional should probably carry less cost). * Fix assert in `_p_matmul` (#9376) The current check prevents calling the persistent matmul on MXFP8, thought per the comment it is intended to block MXFP4 weights only. * [Backend] Bump to llvm/llvm-project@ac5dc54d5091 (#9333) * Reapply "[AMD] Introduce PartitionedSharedEncodingAttr" (#9367) (#9374) The initial patch was reverted because ProxyFenceInsertion.cpp still used getBufferIds instead of getAllBufferIdsWithAliases. Previously, getBufferIds returned buffer IDs for both a value and its aliases. With partitioned tensor support, a single value can now hold multiple buffers, so the API was split: getBufferIds returns only the buffer IDs for a value, while getAllBufferIdsWithAliases includes aliased buffer IDs as well. Co-authored-by: Ognjen Plavsic <plognjen@amd.com> * [AMD] Support PaddedSharedLayout in TDM Gather on GFX1250 (#9369) This PR supported `PaddedSharedLayout` in TDM Gather for limited cases. * [AMD][gfx1250] test_dot fix for small K (#9358) * [ConSan] First pass at improving ConSan compile times (#9366) When compiling `01-attention-forward.py` with consan enabled, before: ``` # times.ir_initialization=790100 0.79 sec # stage='ttgir' duration=182893 0.18 sec # stage='llir' duration=99643940 99.6 sec # stage='llvmir' duration=27658211 27.7 sec # stage='ptx' duration=11120513 11.1 sec # stage='cubin' duration=1149355658 19.15 min ``` After: ``` # times.ir_initialization=796533 0.79 sec # stage='ttgir' duration=184720 0.18 sec # stage='llir' duration=16957192 17.0 sec # stage='llvmir' duration=3735579 3.74 sec # stage='ptx' duration=1972309 1.97 sec # stage='cubin' duration=34357675 34.36 sec ``` This PR does quite a number of things at once: * Custom CanonicalizeLLVMIR pass that adds a pattern for `select %false|%true, %a, %b` since LLVM dialect is missing this (and is opposed to adding it) * Cache global constants to avoid creating many copies of the same string when lowering asserts * Fix warp specialize lowering to handle function calls and deduplicate barrier lowering code between NVIDIA and AMD backends. To support function calls, non-kernel functions are rewritten to accept a barrier handle argument that is passed down from the call site * Rewrite `createMultiColumnMask` to generate a constant tensor rather than computing it from a bunch of `make_range` and masking. This single function was generating gigabytes of IR * Pick warp-local layouts in consan instrumentation. Previously, consan used thread-local layouts where every thread has a copy of the tensor. This was to avoid using shared memory. We can switch to warp-local layouts where each warp has a copy of the tensor distributed across its threads to reduce the generated IR (and register usage) by a factor 32, plus some extra IR needed for shuffles. * To support warp-local layouts, I added a two flags: `uniform` to `tt.assert` and replaced `tti.experiment_assert_in_thread` with a `tt.reduce` + `tt.assert uniform`. Uniform just means only the first thread in the warp group will trigger the assert since the condition is uniform. * I added `always_use_warp_shuffle` function-level flag to force `convert_layout` lowering to use warp shuffles even when the performance heuristic picks shared memory to avoid using shared memory for the layout conversions inside consan helpers * Changed the lowering of `arith.constant` with non-splat dense elements attribute to generate a constant global array where each thread loads into it * Generate global stores in the main function into a helper function to reduce bloat by deduplicating. This also enables separate compilation later. * [AMD][BACKEND] Fix shared layout order for AsyncCopy on GFX9 (#9373) The heuristic used `getElementsPerThread` instead of `getContigPerThread` to retrieve the contiguity. It also has to use to clamp the `vecSize` to a supported width by the hardware to see if we need to carry over the `regOrder`. Before this PR, for the added lit tests we select a order for the shared layout which does not allow for coalesced direct-to-lds writes. * [AMD][GFX1250] Roll up some minor changes (#9380) * Add pattern to optimize tanh * Fix asserts for WMMA * Fix some TDM field updates * [AMD][BACKEND] Fix RangeAnalysis tripCount calculation (#9383) We currently check if the `optional<bool>` holds a value or not instead of checking the actual held value to decide if we take the upper or the lower limit. If the optional empty is empty we take the minimum to preserve the behavior from before this PR. Note that the first adjusted lit test only tests if we fold a `scf.ifOp` after the `scf.forLoop` so the changes in the loop are not removing the actual test. * [AMD][GLUON] Add CGA Layout to wmma scale (#9381) Attach CGA Layout for scale's linear layout. This is needed when we use wmma scaled in multi-cta kernels. * [AMD][gluon] Add gfx1250 warp‑pipeline f16 GEMM example (#9382) Provide a warp‑pipeline TDM GEMM kernel with shared‑memory buffering. * add a second gb200 node to CI (#9384) * [BACKEND] Improve and simplify ReduceOp's lowering (#9219) Stacked PRs: * #9327 * #9318 * #9317 * #9221 * #9220 * __->__#9219 --- --- --- ### [BACKEND] Improve and simplify ReduceOp's lowering We implement a LinearLayout-based `ReduceOp` lowering. This has a number of benefits: - The logic is noticeably simpler as we barely have to implement anything. ConvertLayout and some LL helpers do all the heavy lifting - We get shmem swizzling for free - We sometimes save a shmem round-trip (before we did it unconditionally) - It is now clear that we have a `tmpLl` variable we can carefully choose (we'll do so in a future PR) - It opens the door to returning an arbitrary layout (fusing a `convert_layout` into this op) - It is now really simple to generalise this op to perform cross-cluster reductions, provided that `convert_layout` supports them. - We fix some latent issues the previous implementation had when run on arbitrary linear layouts. We add a funky regression test that used to fail and now passes. - All this while being LOC-neutral! In future PRs we will improve the choice fo `tmpLl` to avoid in many cases the last `convert_layout`, and we will pack the inputs in shmem to be able to vectorize the load/stores for full reductions with multiple inputs. This PR was the result of quite a long (but rather successful) vibe-coding session together with `gpt-5.2-codex`. I found particularly useful being able to emit a ConvertLayout within this lowering rather than having to call the lowering of the function manually. This simplifies the code quite a bit and I would have struggled to convince MLIR to do so myself. * [BACKEND] Perform tree reductions on in-thread values (#9220) Stacked PRs: * #9327 * #9318 * #9317 * #9221 * __->__#9220 --- --- --- ### [BACKEND] Perform tree reductions on in-thread values We generate ternary trees for suitable integer ops and binary trees for everything else. We manually generate `{add,mul}.{f16,f32}x2` ops. This brings a speed-up to some gluon attention kernels. * [Nvidia] Enable TMA im2col mode - LLVM lowering (#9322) # Summary This is the fifth PR in a series that enables TMA im2col mode (in addition to the existing tiled mode) for NVIDIA GPUs. The goal of the series is to support TMA im2col mode in Gluon DSL. - First PR: https://github.com/triton-lang/triton/pull/9202 - Second PR: https://github.com/triton-lang/triton/pull/9225 - Third PR: https://github.com/triton-lang/triton/pull/9303 - Fourth PR: https://github.com/triton-lang/triton/pull/9305 - -> Fifth PR: https://github.com/triton-lang/triton/pull/9322 PTX ISA documentation for TMA im2col mode: https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode TMA tensor descriptor documentation: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html # Summary of Changes Added LLVM lowering logic for `AsyncTMACopyGlobalToLocalOpConversion` to support im2col mode. ## Im2col Mode Constraints ### pixelsPerColumn (non-contiguous dimension) - **Maximum size**: 1024 elements - **Corresponds to**: Spatial dimensions (N, D, H, W) - **Block shape**: Restricted to match `shapePerCTA` (no splitting) - **Rationale**: Avoids generating multiple TMA messages along spatial dimensions, eliminating complex offset calculations that would depend on input tensor shape and padding - **Note**: 1024 is sufficient for most practical use cases ### channelsPerPixel (contiguous dimension) - **Maximum size**: 256 elements, or swizzle byte size if swizzle is enabled - **Multiple messages**: Supported when channel dimension exceeds block size - **Offset application**: Only coord[0] (channel coordinate in PTX order) receives non-zero offsets ## Key Implementation Details 1. **Offset application**: For im2col mode, only the channel dimension receives non-zero offsets; spatial dimension offsets are always 0 (verified by assertion) 2. **Im2col offsets reversal**: Spatial offsets (e.g., `off_w`, `off_h`) are reversed to match PTX/CUDA innermost-to-outermost ordering, consistent with coordinate handling 3. **Alignment with tiled mode**: These constraints align with tiled mode behavior used for GEMM operations <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [FPSAN] Introducing FpSan - floating point sanitizer (#9337) FpSanitizer ensures that arithmetically equivalent kernels produce the exact same results by replacing floating‑point ops with integer ones. For every binary op we cast operands to int, perform the equivalent integer op, then cast back to float. Unary ops are replaced with identity. For dot ops we cast operands to int and perform a matmul in registers. The resulting kernel is of course numerically incorrect, but it should be independent of operation order. If two implementations match under fpsan, they are very likely algorithmically equivalent even if FP results differ due to ordering/reassociation/etc. Note about tensor memory: tcgen05_mma is lowered by loading operands into registers and manually emulating MMA. Because tcgen05_mma can be issued from a partition with 1 warp group while tmem load/stores require 4 or 8 warps, we replace all tmem uses with global scratch memory. * [KERNELS] enable swap_xw on blackwell for non-mx matmuls (#9390) helps significantly for ragged matmuls where the slice size is small. otherwise, blackwell will compile to mma.sync * [BACKEND] Implement support for cross-CTA tt.reduce (#9221) Stacked PRs: * #9327 * #9318 * #9317 * __->__#9221 --- --- --- ### [BACKEND] Implement support for cross-CTA tt.reduce The title of this PR is a bit of a lie. Even though the lowering is now implemented to support cross-CTA reductions, it depends on `convert_layout` supporting them, and it doesn't currently support LinearLayouts. We should generalise this one first and then enable it here. We should also emit the correct cross-CTA barrier from `targetInfo` in the case of cross-CTA memory reuse. In this PR, we take the chance to also generalise the lowering to avoid convert layouts whenever possible. * [UT] Generalize Proton tests by providing `device` fixture (#9351) The same approach as with other unit tests. --------- Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com> Co-authored-by: Keren Zhou <kerenzhou@openai.com> * [Gluon] Add Cluster Launch Control (CLC) support for Blackwell GPUs (#9361) This adds support for NVIDIA's Cluster Launch Control (CLC) feature on Blackwell (SM100+) GPUs, enabling dynamic work distribution for persistent kernels. CLC allows running workers to cancel not-yet-launched clusters and take over their work, improving load balancing when SM availability varies. New Gluon API (triton.experimental.gluon.language.nvidia.blackwell.clc): - try_cancel(result, mbar): Issue async CLC request to cancel a pending cluster - is_canceled(result): Check if cancellation succeeded (returns non-zero) - get_first_ctaid(result, dim): Get the canceled cluster's first CTA ID MLIR ops added: - ttng.clc_try_cancel: Lowers to clusterlaunchcontrol.try_cancel.async PTX - ttng.clc_is_canceled: Lowers to clusterlaunchcontrol.query_cancel.is_canceled - ttng.clc_get_first_ctaid: Lowers to clusterlaunchcontrol.query_cancel.get_first_ctaid All ops include SM100+ compute capability checks and emit errors on older GPUs. Tutorial included demonstrating CLC matmul achieving 92.5% of cuBLAS performance on 8192x8192x8192 FP16 matrices. <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) --------- Co-authored-by: Peter Bell <peterbell10@openai.com> * [ConSan] Use `-Ofc mid` for ConSan compilation (#9394) I tested a variety of `ptxas` flags and found the following (using `01-attention-forward.py`) ``` compile (sec) execution (sec) -O0 19 81 -O1 49 22 -O3 52 21 -Ofc max 19 82 -Ofc mid 21 9 -Ofc min 21 9 -Ofc 0 19 21 ``` Shockingly, `-Ofc mid|min` yield best total compilation+execution times for consan. * [TritonGPU] Run Gluon canonicalizer in LLVM lowering when ConSan is on (#9396) Regular canonicalizer will rewrite layout conversions * Add missing NVGPUToLLVMPass dependency (#9398) Cmake can sometimes fail with the message below. NVGPUToLLVMPass has a dependency on TritonGPUOpInterfacesIncGen. ``` FAILED: Compilers/triton/third_party/nvidia/lib/NVGPUToLLVM/CMakeFiles/NVGPUToLLVM.dir/NVGPUToLLVMPass.cpp.o In file included from /__w/1/s/src/Compilers/triton/third_party/nvidia/lib/NVGPUToLLVM/NVGPUToLLVMPass.cpp:11: In file included from /__w/1/s/src/Compilers/triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/Utility.h:8: In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/TritonGPUToLLVM/Utility.h:7: In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/TritonGPUToLLVM/TargetInfoBase.h:4: In file included from /__w/1/s/src/Compilers/triton/include/triton/Conversion/MLIRTypes.h:5: In file included from /__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/Dialect.h:11: In file included from /__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/Attributes.h:6: /__w/1/s/src/Compilers/triton/include/triton/Dialect/TritonGPU/IR/TritonGPUInterfaces.h:10:10: fatal error: 'triton/Dialect/TritonGPU/IR/OpInterfaces.h.inc' file not found 10 | #include "triton/Dialect/TritonGPU/IR/OpInterfaces.h.inc" | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because it is fixing a CMakeLists.txt bug - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) * [Nvidia] Enable TMA im2col mode - Gluon API (#9391) # Summary This is the sixth PR in a series that enables TMA im2col mode (in addition to the existing tiled mode) for NVIDIA GPUs. The goal of the series is to support TMA im2col mode in Gluon DSL. - First PR: https://github.com/triton-lang/triton/pull/9202 - Second PR: https://github.com/triton-lang/triton/pull/9225 - Third PR: https://github.com/triton-lang/triton/pull/9303 - Fourth PR: https://github.com/triton-lang/triton/pull/9305 - Fifth PR: https://github.com/triton-lang/triton/pull/9322 - -> Sixth PR: https://github.com/triton-lang/triton/pull/9391 PTX ISA documentation for TMA im2col mode: https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-mode TMA tensor descriptor documentation: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TENSOR__MEMORY.html # Summary of Changes Added Gluon DSL frontend support for TMA im2col mode, completing the end-to-end path from Python API to runtime descriptor creation. - Extended `TensorDescriptor` with im2col fields (`mode`, `elementStrides`, `pixelBoxLowerCorner`, `pixelBoxUpperCorner`) and parameter validation - Refactored `tensor_descriptor_type` into a base class with `tensor_descriptor_type` (tiled) and `tensor_descriptor_im2col_type` (im2col) subclasses - Added `offsets` parameter to `async_copy_global_to_shared` for im2col spatial offsets - Updated specialization (`specialize.cc`) to emit `tensordesc_im2col<..., input_rank=N>` signatures and type parsing (`str_to_ty`) to reconstruct the correct types - Updated driver to parse im2col signatures and call `fill_tma_descriptor_im2col` at runtime - Changed `amendFuncOp` to use `TensorDescInterface` instead of `TensorDescType` so im2col descriptors also receive `tt.nv_tma_desc` - Added end-to-end `test_tma_im2col` Gluon test <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) --------- Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: peterbell10 <peterbell10@live.co.uk> * [BACKEND] Support generic multi-cta convert_layouts (#9317) Stacked PRs: * #9327 * #9318 * __->__#9317 --- --- --- ### [BACKEND] Support generic multi-cta convert_layouts We generalise the swizzling algorithm to work with blocks and generalise the most of the memory lowerings to support layouts with blocks. We remove the legacy lowering. The generic swizzling algorithm for blocks might be fine, but we didn't try to be super clever. There might be some perf left on the table. We can look into this at a later point if it becomes relevant. We also activate multi-cta reductions in the process and test both there. TODO: Add some funky tests that just test the `convert_layout`, not the `convert_layout` within the reduction. TODO: Check how to perform multiCTA barriers in AMD and perhaps merge cluster barriers into ttg.barrier, predicate broadcasting blocks, etc. * [Membar] Membar pass for clusters (#9318) Stacked PRs: * #9327 * __->__#9318 --- --- --- ### [Membar] Membar pass for clusters The main invariant here is that: Membar for CTAs only synchronises CTAs when their buffers did not alias in the ttgir, but they alias after the Allocation pass In other words, in Gluon, the user is in charge of manually synchronising the bufferes they declare. For now, we always emit a full cluster barrier. We can improve this in the future by emitting `mbarrier`s that just synchronise subsets of the CTAs. For that we would need to be a bit more clever, as we would need to allocate some `mbarrier`s but the Allocation pass has already run... We add a number of test cases with comments of which of them are expected and which can be improved. * [Membar] Fix non-trivial function smem offsets (#9327) Codex rightly identified that we were not considering the offsets of functions in our membar analysis at https://github.com/triton-lang/triton/pull/9318#discussion_r2740883625 Codex then went on and fixed it and added a regression test. * Add descending flag to topk (#9355) * Remove TritonIR dependence on TritonGPUIR (#9392) * [AMD] Update gfx1250 MXFP FA example kernel (#9414) Include recent some optimizations for MXFP FA kernel. * [FPSAN] Fix fpsan crash with warp specialization + tmem (#9415) Fpsan is replacing tensor memory with global scratch, and was missing correct handling of passing global memory pointers to warp_specialize op. Also, use the -Ofc mid ptx compilation mode for fpsan compilation. * [AMD][NFC][BACKEND] Use ROCDL ops for (cluster) async load (#9410) Replaces intrinsics for (cluster) async loads with ROCDL Ops. * [AMD] Fix gpu::BarrierOp lowering for gfx1250 (#9416) Update gpu::BarrierOp lowering for gfx1250. * [AMD] Broadcasting in v_perm based conversion (#9354) This PR enables support of broadcasted layouts in AMD specific convert_layout lowering: - Generalizes code, that previously assumed number src and dst values is equal; --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com> * [NVIDIA] Fully restore canSkipBarSync (#9281) * [CI] Remove unused commands (#9419) Calling `df` twice should not have any impact; this seems like an artifact from somewhere else that should be removed? * [CI] Remove use of `LLVM_ENABLE_TERMINFO` (#9423) This is a follow up to #9419 from looking at `llvm-build.yml`. When building LLVM, setting `LLVM_ENABLE_TERMINFO=OFF` results in a CMake warning: ``` CMake Warning: Manually-specified variables were not used by the project: LLVM_ENABLE_TERMINFO ``` I searched for uses of this in LLVM and there are none today. The latest reference I found in GitHub issues/PRs was from 2024. That makes me relatively confident this is no longer used so I'm proposing we remove it. * [Cluster Membar] Add a cross-cta barrier at the end of the kernel (#9413) If we have an outstanding read or write, (I think it'll always be a read) and we didn't have a TMEM deallocation, we should add a cluster-level barrier for correctness * [AMD][Gluon] Implement 8 wave F16 FA PingPong kernel. (#9427) Ping Pong is another way to implement a performant attention kernel allowing coexecution between two waves in same SIMD. In this PR we introduce a 8 wave PP/Warp pipelined kernel, and related num_warp changes. We also cleaned up the way to invoke/select different variants of attention kernel. Signed-off-by: Stanley Winata <stanley.winata@amd.com> * [AMD][gfx1250] Adjust (BLOCK) M/N in f16 gemm examples (#9421) Co-authored-by: Alexander Weinrauch <Alexander.Weinrauch@amd.com> * [Tutorial] Fix a ty…
1 parent dfabf6a commit da2e93b

496 files changed

Lines changed: 39002 additions & 10816 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/integration-tests-amd.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ jobs:
8585
~/.triton/json
8686
key: ${{ runner.os }}-${{ runner.arch }}-llvm-${{ steps.cache-key.outputs.llvm }}-nvidia-${{ steps.cache-key.outputs.nvidia }}-json-${{ steps.cache-key.outputs.json }}
8787
- name: Install dependencies
88-
run: apt-get install -y clang lld ccache
88+
run: apt-get update && apt-get install -y clang lld ccache
8989
- name: Inspect cache directories
9090
run: |
9191
mkdir -p ~/.triton

.github/workflows/integration-tests-nvidia.yml

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,16 @@ on:
99

1010
jobs:
1111
integration-tests-nvidia:
12-
runs-on: ${{ matrix.runner }}
12+
name: integration-tests-nvidia (${{ matrix.config.name }})
13+
runs-on: ${{ matrix.config.runs_on }}
1314
timeout-minutes: 60
1415
# Let A100 and H100 continue even if GB200 fails, as it's a bit flaky
15-
continue-on-error: ${{ matrix.runner[0] == 'nvidia-gb200'}}
16+
continue-on-error: ${{ startsWith(matrix.config.runner_type, 'nvidia-gb200') }}
1617
strategy:
1718
matrix:
18-
runner: ${{ fromJson(inputs.matrix) }}
19+
config: ${{ fromJson(inputs.matrix) }}
1920
env:
20-
RUNNER_TYPE: ${{ matrix.runner[0] }}
21+
RUNNER_TYPE: ${{ matrix.config.runner_type }}
2122
TRITON_BUILD_WITH_CCACHE: "true"
2223
TRITON_BUILD_WITH_CLANG_LLD: "TRUE"
2324
TRITON_USE_ASSERT_ENABLED_LLVM: "TRUE"
@@ -69,7 +70,7 @@ jobs:
6970
run: |
7071
echo "$HOME/.local/bin" >> $GITHUB_PATH
7172
- name: Setup Python environment for GB200
72-
if: ${{ matrix.runner[0] == 'nvidia-gb200' }}
73+
if: ${{ startsWith(matrix.config.runner_type, 'nvidia-gb200') }}
7374
run: |
7475
echo "/venv/bin" >> $GITHUB_PATH
7576
echo "VIRTUAL_ENV=/venv" >> $GITHUB_ENV
@@ -90,7 +91,7 @@ jobs:
9091
- name: Run python tests on CUDA
9192
run: make NUM_PROCS=24 test-unit
9293
- name: Run interpreter tests
93-
if: ${{ matrix.runner[0] == 'nvidia-h100' }}
94+
if: ${{ matrix.config.runner_type == 'nvidia-h100' }}
9495
run: make test-interpret
9596
- name: Run regression tests
9697
run: make test-regression

.github/workflows/llvm-build.yml

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -103,8 +103,6 @@ jobs:
103103
sudo apt-get autoremove -y
104104
sudo apt-get clean
105105
df -h
106-
echo "Removing large directories"
107-
df -h
108106
109107
- name: Configure, Build, Test, and Install LLVM (Ubuntu and macOS x64)
110108
if: matrix.config.arch == 'x64' && (matrix.config.target-os == 'ubuntu' || matrix.config.target-os == 'macos')
@@ -124,7 +122,6 @@ jobs:
124122
-DLLVM_ENABLE_PROJECTS="mlir;lld"
125123
-DLLVM_INSTALL_UTILS=ON
126124
-DLLVM_TARGETS_TO_BUILD="host;NVPTX;AMDGPU"
127-
-DLLVM_ENABLE_TERMINFO=OFF
128125
-DLLVM_ENABLE_ZSTD=OFF
129126
llvm-project/llvm
130127
@@ -149,7 +146,6 @@ jobs:
149146
-DLLVM_ENABLE_DIA_SDK=OFF
150147
-DLLVM_INSTALL_UTILS=ON
151148
-DLLVM_TARGETS_TO_BUILD="host;NVPTX;AMDGPU"
152-
-DLLVM_ENABLE_TERMINFO=OFF
153149
-DLLVM_ENABLE_ZSTD=OFF
154150
llvm-project/llvm
155151
@@ -214,7 +210,8 @@ jobs:
214210
-DCMAKE_RANLIB="/usr/bin/aarch64-linux-gnu-ranlib" \
215211
-DCMAKE_STRIP="/usr/bin/aarch64-linux-gnu-strip" \
216212
-DCMAKE_SYSROOT=$SYSROOT \
217-
-DLLVM_ENABLE_TERMINFO=OFF \
213+
-DLLVM_INCLUDE_TESTS=OFF \
214+
-DMLIR_INCLUDE_TESTS=OFF \
218215
llvm-project/llvm
219216
ninja -C llvm-project/build install
220217
tar czf "${{ env.llvm_install_dir }}.tar.gz" "${{ env.llvm_install_dir }}"
@@ -240,7 +237,6 @@ jobs:
240237
-DLLVM_INSTALL_UTILS=ON
241238
-DLLVM_TARGETS_TO_BUILD="AArch64;NVPTX;AMDGPU"
242239
-DLLVM_USE_HOST_TOOLS=ON
243-
-DLLVM_ENABLE_TERMINFO=OFF
244240
-DLLVM_ABI_BREAKING_CHECKS=FORCE_OFF
245241
llvm-project/llvm
246242

.github/workflows/llvm-build/almalinux.Dockerfile

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,6 @@ RUN cmake -GNinja -Bbuild \
3535
-DLLVM_ENABLE_ASSERTIONS=ON \
3636
-DMLIR_ENABLE_BINDINGS_PYTHON=OFF \
3737
-DLLVM_ENABLE_PROJECTS="mlir;lld" \
38-
-DLLVM_ENABLE_TERMINFO=OFF \
3938
-DLLVM_INSTALL_UTILS=ON \
4039
-DLLVM_TARGETS_TO_BUILD="host;NVPTX;AMDGPU" \
4140
-DLLVM_ENABLE_ZSTD=OFF \

.github/workflows/runner-preparation.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,11 +95,11 @@ jobs:
9595
if: env.enable_integration == 'true'
9696
run: |
9797
if [ x"${{ github.repository }}" == x"triton-lang/triton" ]; then
98-
echo '::set-output name=matrix-NVIDIA::[["nvidia-a100"], ["nvidia-h100"], ["nvidia-gb200"]]'
98+
echo '::set-output name=matrix-NVIDIA::[{"name":"nvidia-a100","runner_type":"nvidia-a100","runs_on":["nvidia-a100"]},{"name":"nvidia-h100","runner_type":"nvidia-h100","runs_on":["nvidia-h100"]},{"name":"nvidia-gb200","runner_type":"nvidia-gb200","runs_on":{"group":"gb200-runner-set"}}]'
9999
echo '::set-output name=matrix-AMD::[["self-hosted", "gfx90a"], ["amd-gfx942"], ["amd-gfx950"]]'
100100
echo '::set-output name=matrix-MACOS::[["macos-latest"]]'
101101
else
102-
echo '::set-output name=matrix-NVIDIA::["ubuntu-latest"]'
102+
echo '::set-output name=matrix-NVIDIA::[{"name":"ubuntu-latest","runner_type":"ubuntu-latest","runs_on":"ubuntu-latest"}]'
103103
echo '::set-output name=matrix-AMD::["ubuntu-latest"]'
104104
echo '::set-output name=matrix-MACOS::[["macos-latest"]]'
105105
fi

.github/workflows/wheels.yml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ permissions: read-all
1212
jobs:
1313

1414
Build-Wheels:
15-
timeout-minutes: 120
15+
timeout-minutes: 180
1616
runs-on: ${{ matrix.config.runs_on }}
1717

1818
strategy:
@@ -99,12 +99,12 @@ jobs:
9999
path: ./wheelhouse/*.whl
100100

101101
- name: Install Azure CLI
102-
if: ${{ steps.check-version.outputs.new_commit == 'true' }}
102+
if: ${{ steps.check-version.outputs.new_commit == 'true' && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') }}
103103
run: |
104104
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
105105
106106
- name: Azure login
107-
if: ${{ steps.check-version.outputs.new_commit == 'true' }}
107+
if: ${{ steps.check-version.outputs.new_commit == 'true' && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') }}
108108
uses: azure/login@v2
109109
with:
110110
client-id: ${{ secrets.AZURE_CLIENT_ID }}
@@ -113,20 +113,20 @@ jobs:
113113

114114
- id: generate-token
115115
name: Generate token
116-
if: ${{ steps.check-version.outputs.new_commit == 'true' }}
116+
if: ${{ steps.check-version.outputs.new_commit == 'true' && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') }}
117117
run: |
118118
AZ_TOKEN=$(az account get-access-token --query accessToken)
119119
echo "::add-mask::$AZ_TOKEN"
120120
echo "access_token=$AZ_TOKEN" >> "$GITHUB_OUTPUT"
121121
122122
- name: Publish wheels to Azure DevOps
123-
if: ${{ steps.check-version.outputs.new_commit == 'true' }}
123+
if: ${{ steps.check-version.outputs.new_commit == 'true' && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') }}
124124
run: |
125125
python3 -m pip install twine
126126
python3 -m twine upload -r Triton-Nightly -u TritonArtifactsSP -p ${{ steps.generate-token.outputs.access_token }} --config-file utils/nightly.pypirc --non-interactive --verbose wheelhouse/*
127127
128128
- name: Azure Logout
129-
if: ${{ steps.check-version.outputs.new_commit == 'true' && (success() || failure()) }}
129+
if: ${{ steps.check-version.outputs.new_commit == 'true' && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') && (success() || failure()) }}
130130
run: |
131131
az logout
132132
az cache purge

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ ptxas-blackwell
7070
# Third-party include
7171
third_party/nvidia/backend/include
7272
third_party/nvidia/backend/lib/cupti
73+
third_party/nvidia/backend/lib/cupti-blackwell
7374

7475
# Docs
7576
docs/_build/
@@ -93,3 +94,5 @@ docs/sg_execution_times.rst
9394

9495
# macOS
9596
.DS_Store
97+
98+
AGENTS.override.md

AGENTS.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Working on Triton
2+
3+
## Build and Testing Guidelines
4+
- Before running any tests, run `make` in the triton directory to rebuild triton.
5+
- For compiler changes, add tests in `python/test/` (pytest) or test (lit). Keep GPU-only tests in `python/test/unit/` or `python/test/gluon/`, name them `test_<feature>_<condition>`, and avoid creating new test files unless requested.
6+
- Run pytest with `-s --tb=short`. Run a single test with `pytest file.py::test_name`.
7+
- The build dir is given by `BUILD_DIR := $(shell cd python; $(PYTHON) -c 'from build_helpers import get_cmake_dir; print(get_cmake_dir())')`
8+
- Run lit from the build dir: `cd BUILD_DIR; ninja triton-opt; lit -v test/<path>.mlir` (example: `lit -v test/TritonNvidiaGPU/tmem_layouts.mlir`).
9+
- Lit tests can be run locally (no GPU required).
10+
- Compiler crashes sometimes print an MLIR reproducer (external_resources / mlir_reproducer). Save the full MLIR + {-# ... #-} metadata to `/tmp/<file>.mlir`, then run `triton-opt /tmp/<file>.mlir --run-reproducer` to reproduce locally.

CMakeLists.txt

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,126 @@ option(TRITON_BUILD_PYTHON_MODULE "Build Python Triton bindings" OFF)
2020
option(TRITON_BUILD_PROTON "Build the Triton Proton profiler" ON)
2121
option(TRITON_BUILD_UT "Build C++ Triton Unit Tests" ON)
2222
option(TRITON_BUILD_WITH_CCACHE "Build with ccache (if available)" ON)
23+
option(TRITON_OFFLINE_BUILD "Build without downloading dependencies" OFF)
2324
option(LLVM_BUILD_SHARED_LIBS
2425
"Build all libraries as shared libraries instead of static" OFF)
2526
set(TRITON_CODEGEN_BACKENDS "" CACHE STRING "Enable different codegen backends")
2627

28+
set(TRITON_CACHE_PATH "" CACHE PATH "Path to triton cache")
29+
set(TRITON_LLVM_SYSTEM_SUFFIX "" CACHE STRING "Path to LLVM system suffix")
30+
set(LLVM_SYSPATH "" CACHE PATH "Path to system LLVM installation")
31+
set(JSON_SYSPATH "" CACHE PATH "Path to system nlohmann/json headers")
32+
set(TRITON_PTXAS_PATH "" CACHE FILEPATH "Path override for ptxas")
33+
set(TRITON_PTXAS_BLACKWELL_PATH "" CACHE FILEPATH "Path override for ptxas-blackwell")
34+
set(TRITON_CUOBJDUMP_PATH "" CACHE FILEPATH "Path override for cuobjdump")
35+
set(TRITON_NVDISASM_PATH "" CACHE FILEPATH "Path override for nvdisasm")
36+
set(TRITON_CUDACRT_PATH "" CACHE PATH "Path to CUDA CRT headers.")
37+
set(TRITON_CUDART_PATH "" CACHE PATH "Path to CUDA Runtime headers")
38+
set(TRITON_CUPTI_INCLUDE_PATH "" CACHE PATH "Path to CUPTI headers")
39+
set(TRITON_CUPTI_LIB_PATH "" CACHE PATH "Path to CUPTI libraries")
40+
41+
if(NOT TRITON_CACHE_PATH)
42+
message(FATAL_ERROR "TRITON_CACHE_PATH must be set or derivable from TRITON_HOME/HOME/USERPROFILE/HOMEPATH.")
43+
endif()
44+
45+
set(TRITON_BUILD_HELPER_COMMON_ARGS --triton-cache-path "${TRITON_CACHE_PATH}")
46+
if("${TRITON_OFFLINE_BUILD}")
47+
list(APPEND TRITON_BUILD_HELPER_COMMON_ARGS --triton-offline-build)
48+
endif()
49+
if(NOT "${TRITON_LLVM_SYSTEM_SUFFIX}" STREQUAL "")
50+
list(APPEND TRITON_BUILD_HELPER_COMMON_ARGS --triton-llvm-system-suffix "${TRITON_LLVM_SYSTEM_SUFFIX}")
51+
endif()
52+
if(NOT "${LLVM_SYSPATH}" STREQUAL "")
53+
list(APPEND TRITON_BUILD_HELPER_COMMON_ARGS --llvm-syspath "${LLVM_SYSPATH}")
54+
endif()
55+
if(NOT "${JSON_SYSPATH}" STREQUAL "")
56+
list(APPEND TRITON_BUILD_HELPER_COMMON_ARGS --json-syspath "${JSON_SYSPATH}")
57+
endif()
58+
if(NOT "${TRITON_PTXAS_PATH}" STREQUAL "")
59+
list(APPEND TRITON_BUILD_HELPER_COMMON_ARGS --triton-ptxas-path "${TRITON_PTXAS_PATH}")
60+
endif()
61+
if(NOT "${TRITON_PTXAS_BLACKWELL_PATH}" STREQUAL "")
62+
list(APPEND TRITON_BUILD_HELPER_COMMON_ARGS --triton-ptxas-blackwell-path "${TRITON_PTXAS_BLACKWELL_PATH}")
63+
endif()
64+
if(NOT "${TRITON_CUOBJDUMP_PATH}" STREQUAL "")
65+
list(APPEND TRITON_BUILD_HELPER_COMMON_ARGS --triton-cuobjdump-path "${TRITON_CUOBJDUMP_PATH}")
66+
endif()
67+
if(NOT "${TRITON_NVDISASM_PATH}" STREQUAL "")
68+
list(APPEND TRITON_BUILD_HELPER_COMMON_ARGS --triton-nvdisasm-path "${TRITON_NVDISASM_PATH}")
69+
endif()
70+
if(NOT "${TRITON_CUDACRT_PATH}" STREQUAL "")
71+
list(APPEND TRITON_BUILD_HELPER_COMMON_ARGS --triton-cudacrt-path "${TRITON_CUDACRT_PATH}")
72+
endif()
73+
if(NOT "${TRITON_CUDART_PATH}" STREQUAL "")
74+
list(APPEND TRITON_BUILD_HELPER_COMMON_ARGS --triton-cudart-path "${TRITON_CUDART_PATH}")
75+
endif()
76+
if(NOT "${TRITON_CUPTI_INCLUDE_PATH}" STREQUAL "")
77+
list(APPEND TRITON_BUILD_HELPER_COMMON_ARGS --triton-cupti-include-path "${TRITON_CUPTI_INCLUDE_PATH}")
78+
endif()
79+
if(NOT "${TRITON_CUPTI_LIB_PATH}" STREQUAL "")
80+
list(APPEND TRITON_BUILD_HELPER_COMMON_ARGS --triton-cupti-lib-path "${TRITON_CUPTI_LIB_PATH}")
81+
endif()
82+
83+
# Resolve third-party package paths in CMake so direct CMake builds don't rely on setup.py.
84+
set(TRITON_HAS_LLVM_SYSPATH OFF)
85+
if(NOT "${LLVM_SYSPATH}" STREQUAL "")
86+
set(TRITON_HAS_LLVM_SYSPATH ON)
87+
if(NOT DEFINED LLVM_INCLUDE_DIRS)
88+
set(LLVM_INCLUDE_DIRS "${LLVM_SYSPATH}/include")
89+
endif()
90+
if(NOT DEFINED LLVM_LIBRARY_DIR)
91+
set(LLVM_LIBRARY_DIR "${LLVM_SYSPATH}/lib")
92+
endif()
93+
endif()
94+
95+
if(NOT "${JSON_SYSPATH}" STREQUAL "" AND NOT DEFINED JSON_INCLUDE_DIR)
96+
set(JSON_INCLUDE_DIR "${JSON_SYSPATH}/include")
97+
endif()
98+
99+
# Regenerate configure outputs during `cmake --build` when helper inputs change.
100+
set_property(
101+
DIRECTORY
102+
APPEND
103+
PROPERTY CMAKE_CONFIGURE_DEPENDS
104+
"${CMAKE_CURRENT_SOURCE_DIR}/python/build_helpers.py"
105+
"${CMAKE_CURRENT_SOURCE_DIR}/cmake/llvm-hash.txt"
106+
"${CMAKE_CURRENT_SOURCE_DIR}/cmake/json-version.txt"
107+
)
108+
find_package(Python3 REQUIRED COMPONENTS Interpreter)
109+
set(TRITON_THIRD_PARTY_CMAKE_VARS_FILE "${CMAKE_CURRENT_BINARY_DIR}/triton-third-party-vars.cmake")
110+
execute_process(
111+
COMMAND
112+
${Python3_EXECUTABLE}
113+
"${CMAKE_CURRENT_SOURCE_DIR}/python/build_helpers.py"
114+
write_thirdparty_cmake_vars
115+
${TRITON_BUILD_HELPER_COMMON_ARGS}
116+
--output
117+
"${TRITON_THIRD_PARTY_CMAKE_VARS_FILE}"
118+
--packages llvm json
119+
WORKING_DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}"
120+
COMMAND_ERROR_IS_FATAL ANY
121+
)
122+
include("${TRITON_THIRD_PARTY_CMAKE_VARS_FILE}")
123+
124+
if(TRITON_BUILD_PYTHON_MODULE)
125+
set_property(
126+
DIRECTORY
127+
APPEND
128+
PROPERTY CMAKE_CONFIGURE_DEPENDS
129+
"${CMAKE_CURRENT_SOURCE_DIR}/python/build_helpers.py"
130+
"${CMAKE_CURRENT_SOURCE_DIR}/cmake/nvidia-toolchain-version.json"
131+
)
132+
find_package(Python3 REQUIRED COMPONENTS Interpreter)
133+
execute_process(
134+
COMMAND ${Python3_EXECUTABLE}
135+
${CMAKE_CURRENT_SOURCE_DIR}/python/build_helpers.py
136+
download_and_copy_dependencies
137+
${TRITON_BUILD_HELPER_COMMON_ARGS}
138+
WORKING_DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}"
139+
COMMAND_ERROR_IS_FATAL ANY
140+
)
141+
endif()
142+
27143
if(TRITON_BUILD_WITH_CCACHE)
28144
find_program(CCACHE_PROGRAM ccache)
29145
if(CCACHE_PROGRAM)
@@ -237,6 +353,7 @@ if(TRITON_BUILD_PYTHON_MODULE)
237353
MLIRIndexToLLVM
238354
MLIRGPUToROCDLTransforms
239355
MLIRUBToLLVM
356+
MLIRPluginsLib
240357

241358
# LLVM
242359
LLVMPasses
@@ -343,6 +460,7 @@ find_package(Threads REQUIRED)
343460
add_subdirectory(third_party/f2reduce)
344461
add_subdirectory(bin)
345462
add_subdirectory(test)
463+
add_subdirectory(examples)
346464

347465
if(TRITON_BUILD_UT)
348466
add_subdirectory(unittest)

MANIFEST.in

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,6 @@ include Makefile
1414
include python/build_helpers.py
1515
include python/requirements.txt
1616
include python/test-requirements.txt
17+
global-exclude __pycache__
18+
global-exclude __pycache__/*
19+
global-exclude *.py[cod]

0 commit comments

Comments
 (0)