Summary
Add a new air.rank operation and supporting air.universe type to the AIR dialect.
air.rank is a fourth hierarchy level that sits above air.launch and expresses a
communicating world of rank instances, where each instance corresponds to a complete
GPU device or a CPU host process (with its attached devices).
A detailed design document is in docs/AIRRankOp.md on the compute-model-explanation
branch. This issue captures the implementation work.
Motivation
The existing three-level hierarchy (air.launch → air.segment → air.herd) describes
work that runs on a single device. There is no dialect-level primitive for expressing
programs that span multiple GPUs or multiple hosts. Without such a primitive, multi-device
programs must be assembled entirely at the runtime or framework layer, losing the
compiler's ability to reason about inter-device data movement, resource allocation,
and communication scheduling.
air.rank fills this gap by extending the hierarchy upward:
air.rank — communicating world of devices or hosts ← NEW
air.launch — co-resident work on one device
air.segment — tile rectangle + L2 memory
air.herd — array of PEs
New Operations and Types
air.rank
[%token =] air.rank (%r₀, …, %rₙ) in (%sr₀=%M₀, …, %srₙ=%Mₙ)
args(%a₀=%v₀, …) : <types>
[universe = %u]
[dependency = [%t₀, …]]
[affinity = [%t₀, …]]
{
…
air.rank_terminator
}
Key properties:
- Defines an N-dimensional iteration space. Each point is a rank instance.
- Each rank instance maps to:
- A distinct GPU device when nested inside
air.launch.
- A distinct host process (with attached devices) when at top level.
- The body is
IsolatedFromAbove; values are passed via explicit kernel operands.
- The compiler is free to distribute the serial preamble of the body across all
participating rank instances (SPMD fork semantics).
- May-be-parallel semantics (like
air.launch): concurrency token list is
not permitted. dependency and affinity lists are permitted.
- A
universe operand of type !air.universe constrains the physical pool from
which rank instances are scheduled.
Terminator: air.rank_terminator (analogous to air.launch_terminator).
air.universe type and air.universe.alloc
%u = air.universe.alloc(%capacity) : !air.universe
!air.universe is an opaque dialect type that represents a bounded pool of
concurrently available devices or host processes. Allocating a universe with
capacity = N instructs the runtime to guarantee that at least
min(N, iteration_space_size) rank instances may run concurrently. The handle
is passed to air.rank through its universe operand.
Implementation Scope
1. Dialect changes (mlir/include/air/Dialect/AIR/)
2. C++ op implementation (mlir/lib/Dialect/AIR/IR/)
3. Verifier rules
4. Passes / transforms (mlir/lib/Transform/)
5. Lowering to GPU backend (mlir/lib/Conversion/)
6. air.universe runtime support
7. Tests
8. Documentation
Non-Goals (for this issue)
- NPU backend lowering (NPUs are single-device; multi-chiplet support is future work).
- Formalising
L4 memory space in air::MemorySpace (reserved for follow-on issue).
- Fine-grained sub-universe device selection (e.g. pinning to a specific interconnect
island); air.universe in this issue is cardinality-only.
References
- Design document:
docs/AIRRankOp.md (branch compute-model-explanation)
- Existing hierarchy op pattern:
AIR.td air_LaunchOp, air_SegmentOp
- Existing interface definitions:
AIROpBase.td air_HierarchyInterface
- arXiv:2510.14871 — "From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs
with MLIR-AIR"
Summary
Add a new
air.rankoperation and supportingair.universetype to the AIR dialect.air.rankis a fourth hierarchy level that sits aboveair.launchand expresses acommunicating world of rank instances, where each instance corresponds to a complete
GPU device or a CPU host process (with its attached devices).
A detailed design document is in
docs/AIRRankOp.mdon thecompute-model-explanationbranch. This issue captures the implementation work.
Motivation
The existing three-level hierarchy (
air.launch→air.segment→air.herd) describeswork that runs on a single device. There is no dialect-level primitive for expressing
programs that span multiple GPUs or multiple hosts. Without such a primitive, multi-device
programs must be assembled entirely at the runtime or framework layer, losing the
compiler's ability to reason about inter-device data movement, resource allocation,
and communication scheduling.
air.rankfills this gap by extending the hierarchy upward:New Operations and Types
air.rankKey properties:
air.launch.IsolatedFromAbove; values are passed via explicit kernel operands.participating rank instances (SPMD fork semantics).
air.launch):concurrencytoken list isnot permitted.
dependencyandaffinitylists are permitted.universeoperand of type!air.universeconstrains the physical pool fromwhich rank instances are scheduled.
Terminator:
air.rank_terminator(analogous toair.launch_terminator).air.universetype andair.universe.alloc!air.universeis an opaque dialect type that represents a bounded pool ofconcurrently available devices or host processes. Allocating a universe with
capacity = Ninstructs the runtime to guarantee that at leastmin(N, iteration_space_size)rank instances may run concurrently. The handleis passed to
air.rankthrough itsuniverseoperand.Implementation Scope
1. Dialect changes (
mlir/include/air/Dialect/AIR/)AIROpBase.td: Add!air.universedialect type (or a newAIRTypes.td).AIR.td:air_RankOpfollowing the pattern ofair_LaunchOp/air_SegmentOp:air_AsyncOpInterface,air_HierarchyInterface,AttrSizedOperandSegments,IsolatedFromAbove,AffineScope,SingleBlockImplicitTerminator<"RankTerminatorOp">.OptionalAttr<SymbolNameAttr>:$sym_name,Variadic<air_AsyncToken>:$async_dependencies,Variadic<Index>:$sizes,Variadic<AnyType>:$rank_operands,Optional<air_UniverseType>:$universe.Optional<air_AsyncToken>:$async_token.air_RankTerminatorOp.air_UniverseAllocOp.2. C++ op implementation (
mlir/lib/Dialect/AIR/IR/)getIds(),getSize(),getSizeOperands(),getNumKernelOperands(),getKernelOperands(),getKernelArguments(),getNumDims(),getId()forRankOp(same pattern asLaunchOp).air.rank.air.universe.allocand the!air.universetype.RankOp(remove unused kernel operands).3. Verifier rules
air.rankmust not carry aconcurrencytoken list.air.launch, verify thatair.rankbodies containonly
air.launchops (not bareair.segment/air.herd).air.rankbodies may containair.launchops.universeoperand, if present, must be of type!air.universe.air.rank_terminatormust haveair.rankas its immediate parent.4. Passes / transforms (
mlir/lib/Transform/)air-rank-to-launchpass: lowerair.rankto a loop (for single-processtargets that simulate multi-device execution serially), threading rank index
arguments into enclosed
air.launchops.HierarchyInterfacetraversal passes (e.g. loop unrolling,operand hoisting, async dependency analysis) to handle
air.rankas the newoutermost level.
5. Lowering to GPU backend (
mlir/lib/Conversion/)air.rankat top level → one MPI process (orfork()-style process group)per rank instance; emit a process launch stub.
air.rankinsideair.launch→ one HIP device context per rank instance;use
hipSetDevice(rank_id)before enclosedair.launchlowering.air.channelat rank body scope → RCCL collective (if pattern matches) orpeer-to-peer
hipMemcpyPeer.6.
air.universeruntime supportair_universe_alloc(size_t capacity)to the AIR runtime API, returningan opaque handle.
air.rankexecution function and used topre-allocate process/device slots and communicators (MPI_Comm / rcclComm_t).
7. Tests
test/dialect/AIR/:air.rankandair.universe.alloc.test/Transform/andtest/Conversion/:air-rank-to-launchserialisation pass.hipSetDevice.output rows across two GPUs.
8. Documentation
docs/AIRRankOp.md(already drafted oncompute-model-explanation).docs/AIRComputeModel.md§1.1 to show the four-level hierarchy.docs/README.mdindex.Non-Goals (for this issue)
L4memory space inair::MemorySpace(reserved for follow-on issue).island);
air.universein this issue is cardinality-only.References
docs/AIRRankOp.md(branchcompute-model-explanation)AIR.tdair_LaunchOp,air_SegmentOpAIROpBase.tdair_HierarchyInterfacewith MLIR-AIR"