Skip to content

Skip unnecessary boundary-check masks in RewriteTensorDescriptorToPointerPass#10034

Open
gkmhub wants to merge 1 commit intotriton-lang:mainfrom
gkmhub:skip-boundary-check-descriptor-rewrite
Open

Skip unnecessary boundary-check masks in RewriteTensorDescriptorToPointerPass#10034
gkmhub wants to merge 1 commit intotriton-lang:mainfrom
gkmhub:skip-boundary-check-descriptor-rewrite

Conversation

@gkmhub
Copy link
Copy Markdown

@gkmhub gkmhub commented Apr 14, 2026

Summary

RewriteTensorDescriptorToPointerPass unconditionally generates masked loads/stores for all tensor descriptor accesses. In contrast, the block pointer path (RewriteTensorPointerPass) respects boundary_check=[] and produces unmasked loads/stores when no boundary checking is needed.

This causes backends that lower tensor descriptors to pointer arithmetic (i.e., backends without TMA hardware) to always pay the cost of masked memory operations, even for provably in-bounds accesses (~20% perf regression in our benchmarks).

This PR adds two complementary mechanisms to skip unnecessary mask generation:

  1. Static in-bounds analysis (isStaticallyInBounds): compile-time check that verifies offset[i] + blockShape[i] <= shape[i] for all dimensions. When provably in-bounds, the pass emits unmasked tt.load/tt.store.

  2. skip_boundary_check attribute: a new UnitAttr on tt.descriptor_load and tt.descriptor_store that frontends (e.g., Inductor) can set when they guarantee the access is in-bounds. On backends with TMA hardware, this attribute is simply ignored.

Fixes #10033

Files Changed

  • include/triton/Dialect/Triton/IR/TritonOps.td — add UnitAttr:$skip_boundary_check to DescriptorLoadOp and DescriptorStoreOp
  • lib/Dialect/Triton/Transforms/RewriteTensorDescriptorToPointer.cpp — add isStaticallyInBounds() and conditionally skip mask/other generation
  • python/src/ir.cc — thread skipBoundaryCheck through create_descriptor_load/create_descriptor_store
  • python/triton/language/core.py — add skip_boundary_check parameter to tensor_descriptor_base.load() and .store()
  • python/triton/language/semantic.py — thread parameter through descriptor_load/descriptor_store
  • python/triton/runtime/interpreter.py — accept (and ignore) the new parameter
  • test/Triton/tensor-descriptors-in-bounds.mlir — 10 LIT test cases covering in-bounds, out-of-bounds, and skip_boundary_check scenarios

Test Plan

  • New MLIR LIT test (tensor-descriptors-in-bounds.mlir) with 10 cases:
    • In-bounds 2D load/store with zero offset
    • In-bounds 2D load with nonzero offset
    • Out-of-bounds load with dynamic shape/offset
    • Out-of-bounds load where offset+block exceeds shape
    • In-bounds 1D load
    • Out-of-bounds store with dynamic shape
    • skip_boundary_check attribute on load/store with dynamic shapes
  • Backward compatible: existing code without skip_boundary_check behaves identically
  • TMA backends unaffected (attribute is ignored)

…ptorToPointerPass

Summary:
CONTEXT: RewriteTensorDescriptorToPointerPass unconditionally generates
masked loads/stores for all tensor descriptor accesses, even when the
access is provably in-bounds. This causes ~20% perf regression vs the
block pointer path for backends without TMA hardware.

WHAT: Add two mechanisms to skip unnecessary mask generation:
1. Static in-bounds analysis (isStaticallyInBounds) that checks
   offset[i] + blockShape[i] <= shape[i] at compile time.
2. skip_boundary_check UnitAttr on descriptor_load/descriptor_store
   that frontends can set when they guarantee in-bounds access.

Fixes triton-lang#10033
@gkmhub gkmhub requested a review from ptillet as a code owner April 14, 2026 23:51
@ThomasRaoux
Copy link
Copy Markdown
Collaborator

If the semantic of descriptor ops (which is meant to map to TMA kind of HW) is a problem one alternative is to mimic what is currently in the front end for block_ptr and get array indexing support purely in SW: https://github.com/triton-lang/triton/blob/main/python/triton/language/core.py#L1743. We transitioned some kernels using this solution and it was fairly simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RewriteTensorDescriptorToPointerPass generates unnecessary boundary-check masks for in-bounds accesses

2 participants