This document describes the compute and memory model of the AIR dialect, defines the semantics of its operations, and explains how those operations map onto the two supported hardware backends: AMD NPU (AIE) and AMD GPU (ROCDL/HIP).
The AIR dialect is designed to be architecture-independent at the programming-model
level. A program expressed in terms of air.launch, air.segment, air.herd, and the
associated data-movement operations carries a well-defined meaning that can be compiled
to either backend without changing the source IR.
A detailed description of the dialect and its use for mapping AI workloads onto AMD NPUs can be found in:
E. Wang et al. "From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR". arXiv:2510.14871, October 2025.
The AIR model defines a three-level execution hierarchy:
air.launch — outermost level; groups co-resident work
air.segment — intermediate level; groups cores with shared on-device memory
air.herd — innermost level; an array of processing elements (PEs)
Each level defines an iteration space (a multi-dimensional index range) and a body
region that executes across that iteration space. The bodies of launch and segment
execute at the host/controller level, while the body of herd executes in data-
parallel fashion across each point of its iteration space.
Execution at each level is spatially parallel: points in the iteration space may execute concurrently on distinct hardware resources.
Operands from the enclosing scope are passed explicitly as kernel operands into each
level. Levels are IsolatedFromAbove, meaning they cannot implicitly capture values from
outer regions.
The AIR model defines three levels of memory, identified by the air::MemorySpace enum:
| Level | Enum value | Scope | Description |
|---|---|---|---|
L3 |
0 | System | Off-chip or host-accessible memory (e.g. DDR, host RAM) |
L2 |
1 | Segment | On-device memory shared across all PEs in a segment (e.g. on-chip SRAM, memory tiles) |
L1 |
2 | Herd tile | Per-PE scratchpad memory local to each processing element |
Memory at each level must be accessed through the appropriate data-movement operations. Direct reads or writes across levels (e.g., from L1 directly into L3) are not supported; all cross-level data movement is explicit.
Data movement between memory levels is expressed with:
air.dma_memcpy_nd— an N-dimensional DMA-style bulk copy between two memrefs. The source and destination may reside at any combination of memory levels.air.channel/air.channel.put/air.channel.get— point-to-point or broadcast communication channels that decouple producers from consumers.
All data-movement and hierarchy operations may execute asynchronously. An op that
produces an !air.token result is dispatched immediately and executes concurrently
with subsequent operations. Token values are used to express data, control, affinity,
and concurrency constraints (see §1.5). air.wait_all merges multiple tokens into a
single synchronization point. air.execute wraps sequential computation in an
asynchronous region.
Every AIR hierarchy operation has two distinct lifetimes:
- Resource lifespan — the interval during which the physical resources assigned to the operation (compute tiles, memory banks, DMA channels) are reserved and unavailable to other operations.
- Execution lifespan — the interval during which the nested operations within the body are actively executing.
The execution lifespan must be a subset of the resource lifespan: resources must be allocated before execution begins and must remain allocated until execution completes.
All token kinds share a single MLIR type: !air.token. The kind is
determined by which attribute list the token appears in at a consumer op, not
by the token's type. !air.token is the replacement for air.async.token;
existing uses of air.async.token are dependency tokens.
A token is created in one of two ways:
1. Explicit allocation:
%t = air.token.alloc : !air.token
Used to introduce a token that groups a set of ops under a shared affinity or concurrency constraint without one op "owning" the token.
2. Op result:
%t = air.segment @foo { … }
%t = air.herd tile (%x,%y) in (%sx=%Nx,%sy=%Ny) [dependency = [%dep]] { … }
The token becomes signaled when the op and all of its nested operations complete. The result is optional; omitting it gives the synchronous (blocking) form of the op.
Tokens are consumed by including them in named attribute lists on hierarchy and data-movement ops. An op may carry any combination of the three lists:
air.segment @bar [dependency = [%t0, %t1]]
[affinity = [%ta]]
[concurrency = [%tc]]
{ … }
| Attribute list | Constraint imposed | Effect on resource lifespan | Effect on execution lifespan |
|---|---|---|---|
dependency |
Op does not begin until all listed tokens are signaled (happens-before) | Compiler/runtime may reuse resources freely across the edge | Disjoint from the producers of the listed tokens |
affinity |
Op is placed on the same hardware resources as all other ops that list the same token | Disjoint — shared resources used sequentially | Disjoint — operations execute at different times |
concurrency |
Op must have an overlapping execution lifetime with all other ops that list the same token | Overlapping — all resource sets must be live simultaneously | Overlapping |
-
dependencylist: The direct successor ofair.async.tokendependency chains. The op waits for all listed tokens to be signaled before acquiring resources or beginning execution. The compiler may reuse the releasing op's resources for the waiting op once the token fires. -
affinitylist: All ops that share the same!air.tokenin theiraffinitylist are bound to the same hardware partition (e.g. the same set of tile columns on NPU, or the same CU on GPU). Because they share resources they must execute sequentially; both their resource and execution lifespans are disjoint in time. -
concurrencylist: All ops sharing the same token in theirconcurrencylist are required to have overlapping execution lifetimes. The compiler must allocate distinct, simultaneously live resource sets for each such op. This is the mechanism that enforces co-residency (e.g. ensuring two segments are stamped out on hardware at the same time).
Tokens are SSA values and follow normal MLIR dominance. When a token must be
passed into a hierarchy body it is threaded through the args(…) operand list.
The sole exception to free token passing is air.launch: only tokens used in a
dependency list may be passed into the body of a launch. Affinity and
concurrency tokens may not cross the launch boundary, because imposing shared-
resource or forced-overlap constraints on the may-be-parallel instances of a
launch would contradict its execution model.
// Asynchronous form — produces !air.token, non-blocking
[%token =] air.launch (%x₀, …, %xₙ) in (%sx₀=%N₀, …, %sxₙ=%Nₙ)
args(%a₀=%v₀, …) : <types>
[dependency = [%t₀, …]]
[affinity = [%t₀, …]]
{
…
air.launch_terminator
}
// Synchronous form — caller blocks until all launch instances complete
air.launch sync (%x₀, …, %xₙ) in (%sx₀=%N₀, …, %sxₙ=%Nₙ)
args(%a₀=%v₀, …) : <types>
[dependency = [%t₀, …]]
[affinity = [%t₀, …]]
{
…
air.launch_terminator
}
air.launch defines the outermost scope of an AIR program. It:
- Declares an optional iteration space of shape
N₀ × … × Nₙ. Each point(x₀, …, xₙ)in this space constitutes an independent launch instance. - Implements may-be-parallel semantics: the runtime is permitted to execute
instances concurrently, but is not required to. The compiler must not assume a
fixed ordering between instances. A
concurrencylist attribute is therefore not permitted onair.launch; expressing an explicit concurrency bound would impose ordering constraints that contradict may-be-parallel semantics. - Ensures that all
air.segmentoperations declared within the body (and their associated L2 allocations) are co-resident on the device when the launch body begins executing, guaranteeing a bounded resource footprint. - Passes L3 memrefs and scalar values into the body through explicit kernel
operands. The body is
IsolatedFromAboveand cannot implicitly capture values from the enclosing scope.
The launch iteration space expresses a pool of independent work units that the runtime should schedule as hardware becomes available. It does not guarantee simultaneous execution of all instances; it expresses that instances are independent and may be run in any order or concurrently at the runtime's discretion.
Use the launch iteration space when:
- Each instance is fully self-contained and produces no result that another instance depends on.
- The workload naturally decomposes into a variable number of tiles determined at runtime or by a host-side loop.
- Runtime flexibility in scheduling is acceptable or desirable (e.g. when multiple launch streams share device resources).
Do not use the launch iteration space when you need a guarantee that all
instances execute simultaneously — for that, use the air.segment iteration
space (§2.2), which provides always-concurrent semantics (all instances are
co-resident and active for the duration of the segment).
A common goal is to overlap computation on one output tile with data movement for the next — "software pipelining" across tiles. This cannot be expressed across separate launch instances, because dependency tokens are the only token kind that may cross the launch boundary, and they only express ordering, not overlap.
Instead, express inter-tile pipelining within a single launch body:
-
Segment-level pipelining: place two
air.segmentops inside the launch body — a compute segment and a prefetch segment — connected bydependencytokens. Both segments are co-resident (the co-residency guarantee applies to all segments in the same launch body), so their execution genuinely overlaps: the compute segment processes tile N while the prefetch segment loads tile N+1. -
Loop-based pipelining: use an
scf.forloop inside the launch body (or inside a segment) withiter_args-carried tokens to overlap consecutive tile iterations, as described in thescf.forsection of §2.2. This is the standard double-buffering pattern and is typically the most efficient approach.
The degree of actual concurrency achieved by a launch iteration space depends on the backend:
| Backend | Launch iteration space behaviour |
|---|---|
| GPU (ROCDL) | Maps directly to gpu.launch gridDim; all instances execute in parallel as GPU thread blocks. The GPU scheduler guarantees concurrent dispatch of all grid points subject to occupancy. |
| NPU (AIE) | The runtime schedules instances as hardware partitions become available. Instances may run sequentially if only one hardware partition is available for the launch, or concurrently if multiple are available. No concurrency is guaranteed. |
This asymmetry is intentional: GPU hardware is designed for SIMT-style uniform
parallel dispatch, while NPU execution is more task-queue-like with variable
partition availability. Programs that require a guaranteed minimum concurrency
should use air.segment iteration spaces or concurrency tokens on segments
within a launch body rather than relying on the launch iteration space.
The synchrony of air.launch is determined by whether a result value is bound,
not by a separate async keyword. The sync keyword is an explicit synonym for
the no-result form and is provided for readability.
| Form | Behaviour |
|---|---|
[%token =] air.launch (…) {…} |
Asynchronous: dispatches immediately; returns !air.token that becomes signaled when all instances complete. Caller may overlap subsequent work. |
air.launch (…) {…} |
Synchronous (implicit): blocks the calling thread until all launch instances complete. Equivalent to the sync form. |
air.launch sync (…) {…} |
Synchronous (explicit): identical to the no-result form; sync keyword makes the blocking intent unambiguous, particularly useful when the op appears alongside async launches. |
| List | Purpose |
|---|---|
dependency |
!air.token values that must be signaled before any instance begins executing. Establishes a data/control dependency. |
affinity |
!air.token values that bind this launch to the same hardware partition as other ops sharing the same token (e.g. a specific column range on NPU or a specific XCD on GPU). |
air.launch must not carry a concurrency list. Concurrency constraints
belong to operations with always-parallel semantics (air.segment, air.herd);
air.launch instead expresses may-be-parallel work whose degree of parallelism
is determined by available hardware resources at runtime. Similarly, affinity
and concurrency tokens from enclosing scopes may not be passed into the launch
body (see §1.5 scope rule).
Nesting constraint: air.segment ops may appear inside air.launch, but not the
reverse.
// Asynchronous form — produces !air.token, non-blocking
[%token =] air.segment [@name]
(%x₀, …, %xₙ) in (%sx₀=%N₀, …)
args(%a₀=%v₀, …) : <types>
[x_loc=<col>] [y_loc=<row>] [x_size=<cols>] [y_size=<rows>]
[dependency = [%t₀, …]]
[affinity = [%t₀, …]]
[concurrency = [%t₀, …]]
{
…
air.segment_terminator
}
// Synchronous form — caller blocks until the segment (and all its herds) complete
air.segment [@name] sync
(%x₀, …, %xₙ) in (%sx₀=%N₀, …)
args(%a₀=%v₀, …) : <types>
[x_loc=<col>] [y_loc=<row>] [x_size=<cols>] [y_size=<rows>]
[dependency = [%t₀, …]]
[affinity = [%t₀, …]]
[concurrency = [%t₀, …]]
{
…
air.segment_terminator
}
air.segment represents a physically contiguous grouping of processing
elements together with their associated L2 (on-device) memory. It:
- Groups all
air.herdoperations, L2 allocations, and inter-level data movement required to implement a coherent kernel. - Optionally defines an iteration space of shape
N₀ × … × Nₙ. When present, every point in the space instantiates an independent copy of the segment body. All instances run with overlapping lifetimes: the runtime guarantees that every instance is active simultaneously, so the full set of instances is co-resident on the device for the duration of the segment. This is a "stamp-out" of the segment across independent hardware partitions. Each instance occupies a contiguous rectangular block of compute resources (columns × rows of tiles, L2 banks, DMA engines). The placement of distinct instances across the iteration space is independent: instances need not be adjacent or form a globally contiguous region; each may be placed anywhere in the device array subject only to non-overlap with other live instances. - Has access to L2 memory (allocated within its body with
memref.allocat address space 1) and can access L3 memory through DMA or channel operations.
The body of a segment may contain operations that express both temporal
constraints (sequencing via !air.token dependency lists, air.wait_all,
scf.for iter_arg-carried ordering, scf.parallel init/reduce token
merging) and spatial constraints (placement of air.herd tiles via
x_loc/y_loc, channel routing that fixes communication topology).
As with air.launch, synchrony is determined by the presence or absence of a
result value; sync is an explicit synonym for the no-result form.
| Form | Behaviour |
|---|---|
[%token =] air.segment (…) {…} |
Asynchronous: returns !air.token immediately; the enclosing launch body may overlap subsequent work with this segment. Token signals when all herds and data-movement ops within the segment complete. |
air.segment (…) {…} |
Synchronous (implicit): blocks until all herds and data-movement operations within the segment complete. |
air.segment sync (…) {…} |
Synchronous (explicit): identical to the no-result form. |
The compiler must determine the total resource requirements of a segment statically, at compile time. The analysis is recursive and bottom-up: inner segments are fully analysed before the segments that contain them, so that each level can treat its nested segments as opaque resource blocks.
The analysis at each level has two stages.
Stage 1 — per-instance worst-case analysis.
The analysis covers four structural cases: leaf segments (direct air.herd
ops), containing segments (nested air.segment ops), scf.for loops (with
iter_arg-based pipelining), and scf.parallel regions (all instances
concurrent, tokens reduced).
Leaf segment (body contains air.herd ops, no nested air.segment):
- L2 memory: the maximum number of bytes in address space 1 that may be live simultaneously, taken over all possible execution paths (maximum across conditional branches; see below for loop-carried allocations).
- Compute tiles: the maximum number of
air.herdtiles simultaneously active, derived from the herd iteration spaces and anyx_size/y_sizeplacement constraints. - DMA channels: the maximum number of concurrently active
air.dma_memcpy_ndorair.channeloperations, determined from the async dependency graph.
scf.for and scf.parallel loops within a segment body:
scf.for — loop-carried token dependencies.
scf.for iterations are logically sequential in program order, but the async
token model allows controlled overlap between consecutive iterations. The
mechanism is iter_args: the token yielded by scf.yield at the end of
iteration N becomes the iter_arg received at the start of iteration N+1.
%t_final = scf.for %k = %lb to %ub step %s
iter_args(%t_prev = %t_init) -> !air.token {
// Ops that depend on %t_prev wait for iteration N-1 to reach a known point.
// Ops that do NOT depend on %t_prev may execute concurrently with N-1.
%t_dma = air.dma_memcpy_nd async [%t_prev] …
%t_compute = air.herd async [%t_dma] …
%t_out = air.wait_all async [%t_compute]
scf.yield %t_out : !air.token
}An op in iteration N+1 that lists %t_prev in its dependency list does not
begin until iteration N has signaled that token. An op that does not depend
on %t_prev may begin before iteration N's scf.yield fires, creating genuine
concurrency between consecutive iterations.
The pipeline depth D for a given resource type is the number of independent
token chains maintained as separate iter_args for that resource:
- D = 1 (single
iter_argchaining all activity): strictly sequential — only one iteration's worth of resources is active at a time. - D = 2 (double-buffering): one DMA and one compute op from different iterations are simultaneously active. The common pattern is to yield the compute token as one iter_arg and a separate prefetch token as another, allowing the next iteration's data fetch to overlap with the current iteration's compute.
- D > 2: multiple independent in-flight streams tracked via multiple
iter_args. Each stream contributes independently to the concurrent resource count.
Resource bounds inside an scf.for:
- L2 memory:
memref.allocresults that are live acrossscf.yield(loop-carried allocations) are permanently live for the entire loop and contribute to every iteration's baseline. Allocations freed beforescf.yieldrequire only one slot (reused each iteration). - Compute tiles: the maximum number of simultaneously active herd instances across D iterations. Multiply the per-iteration herd footprint by the pipeline depth D for the compute resource chain.
- DMA channels: similarly, multiply the per-iteration DMA count by the pipeline depth D for the DMA resource chain.
scf.parallel — broadcast and tree-reduction of tokens.
scf.parallel expresses a set of independent instances that may execute
concurrently. The async token model for scf.parallel has two parts:
%t_out = scf.parallel (%i, %j) = (%lb0, %lb1) to (%ub0, %ub1) step (%s0, %s1)
init (%t_in) -> !air.token {
// All instances may start as soon as %t_in is signaled.
%t_instance = air.channel.put async [%t_in] @ch[%i, %j] …
scf.reduce(%t_instance : !air.token) {
^bb0(%a: !air.token, %b: !air.token):
%t_merge = air.wait_all async [%a, %b]
scf.reduce.return %t_merge : !air.token
}
}
// %t_out is signaled only when ALL instances have completed.init (%t_in): a single token that all parallel instances must wait for before beginning. It acts as a broadcast dependency — every instance sees the same%t_inas a prerequisite.scf.reduceblock: performs a pairwise tree-reduction over the tokens produced by each instance. Each call to the reduce block receives two instance tokens (%a,%b) and usesair.wait_allto merge them. The final result%t_outis signaled only when all instances' tokens have been reduced to one.
Resource bounds inside scf.parallel: all instances execute concurrently, so
their resource requirements are summed (not maxed), exactly as for a segment
iteration space (Stage 2 above). The aggregate resource demand is the per-instance
bound multiplied by the total number of parallel instances.
Containing segment (body contains one or more nested air.segment ops):
Each nested segment is first fully analysed through its own Stage 1 and Stage 2 to produce a total resource footprint — a tuple (tiles, L2 bytes, DMA channels) representing the resources that nested segment holds across all of its iterations simultaneously. This footprint is then treated as an indivisible unit at the outer level. The containing segment's Stage 1 computes:
- Compute tiles: the maximum number of tile-rows × tile-columns
simultaneously reserved, obtained by summing the footprints of all
concurrently active nested segments (and any direct
air.herdops) in the outer body. "Concurrently active" is determined from the outer body's async dependency graph: nested segments linked only bydependencytokens are treated as sequential and their resources may be reused; nested segments with no ordering constraint between them, or linked byconcurrencytokens, are treated as simultaneous and their resources are summed. - L2 memory: maximum simultaneous L2 allocation across all concurrently active nested segments plus any direct allocations in the outer body.
- DMA channels: maximum concurrent DMA/channel ops, counting each nested segment's DMA footprint as a unit.
Stage 2 — scaling by the iteration space.
Because all instances of a segment's iteration space execute with overlapping
lifetimes, the per-instance bounds from Stage 1 are multiplied by the total
iteration count (N₀ × … × Nₙ) to obtain that segment's total resource
footprint. A segment with no iteration space has an implicit count of 1.
Stage 2 applies independently at each nesting level: the inner segment scales by its own iteration count first; the resulting total footprint is what the outer Stage 1 sees as a unit; the outer segment then scales its own per-instance result by its iteration count. Resources are therefore multiplied by the product of all iteration counts along the nesting path, reflecting the fact that all instances at every level are simultaneously live.
The total footprint describes the sum of per-instance blocks, not a single contiguous region: the compiler allocates a contiguous block per instance and may place those blocks anywhere in the device array that fits.
These statically computed totals are used to:
- Verify that the segment hierarchy (across all iterations at all levels) fits within the target hardware partition before lowering begins.
- Assign physical resources (tile columns/rows, memory banks, DMA engines) to each iteration at each level without runtime negotiation.
- Guarantee that all segment instances — and any other segments declared within
the same
air.launch— can be co-resident without resource conflicts.
Optional physical placement attributes (x_loc, y_loc, x_size, y_size)
annotate the column/row offset and extent of one instance on devices with
2D tile arrays. For a segment with an iteration space, the compiler assigns a
separate (x_loc, y_loc) origin to each iteration point independently; those
per-instance origins need not be adjacent. When these attributes are present
they are taken as authoritative for a single instance's contiguous footprint;
when absent the compiler derives a per-instance contiguous footprint from the
worst-case analysis above and places each instance freely.
The optional backend-granularity attribute communicates the hardware
granularity that this segment instance should map to on the target backend.
Currently defined values:
| Value | Meaning |
|---|---|
XCD |
The segment maps to one XCD (Accelerator Complex Die) on AMD MI3xx GPUs. Used with a multi-XCD air.launch iteration space to achieve full-device occupancy (see §4.2). |
When backend-granularity is absent the backend applies its default mapping
(e.g., one workgroup / thread block on GPU).
Nesting constraints:
air.segmentmust be nested insideair.launchor inside anotherair.segment(directly or indirectly — the outermost segment must be inside a launch).air.herdmust be nested insideair.segment. It is the innermost hierarchy level and must not containair.segmentor anotherair.herd.- The nesting depth of
air.segmentis unbounded by the model; backends may impose device-specific limits (e.g. two levels on MI3xx — XCD and CU).
// Synchronous form — acquires resources immediately, blocks until all PEs complete
air.herd [@name] tile (%x, %y) in (%sx=%Nx, %sy=%Ny)
args(%a₀=%v₀, …) : <types>
[x_loc=<col>] [y_loc=<row>]
[link_with="<object>"]
[dependency = [%t₀, …]]
[affinity = [%t₀, …]]
[concurrency = [%t₀, …]]
{
…
air.herd_terminator
}
// Asynchronous form — produces !air.token; acquires resources when deps met, signals on completion
[%token =] air.herd [@name]
tile (%x, %y) in (%sx=%Nx, %sy=%Ny)
args(%a₀=%v₀, …) : <types>
[x_loc=<col>] [y_loc=<row>]
[link_with="<object>"]
[dependency = [%t₀, …]]
[affinity = [%t₀, …]]
[concurrency = [%t₀, …]]
{
…
air.herd_terminator
}
air.herd defines a 1D or 2D array of processing elements (PEs) that all
execute the same body code in data-parallel fashion. It is the innermost level
of the hierarchy and maps directly to physical compute tiles (NPU) or GPU
threads:
- The iteration space
%Nx × %Nydetermines how many PE instances execute the body and defines a contiguous block of compute resources. All instances run concurrently on distinct hardware resources; this is always-parallel (not may-be-parallel). - Block arguments
%xand%ygive each instance its own coordinates, enabling each PE to independently address its portion of L1 or L2 memory. - Within the herd body,
L1memory (address space 2) is per-PE local scratchpad. - Data needed from L2 or L3 must be explicitly fetched via
air.dma_memcpy_ndor channel operations before use.
Operations within the herd body may impose both temporal constraints (explicit token dependencies ordering DMA fetches before compute) and spatial constraints (channel indices or tile-coordinate arithmetic that binds each PE to a specific region of a shared buffer).
The meaning of "contiguous" and the constraints imposed on iteration space dimensions depend on the target backend:
-
AIE (NPU): A herd defines a logical rectangular array of compute units. The compiler may reshape the iteration space (e.g., collapse a 2D herd into a 1D arrangement) via the
AIRCollapseHerdPass. Reshaping is inhibited automatically when the herd body uses cascade channels (channel_type = "npu_cascade"), because cascade connections are topology-dependent and cannot survive reindexing. Explicit placement attributes (x_loc,y_loc,x_size,y_size) on the enclosing segment also constrain the legal shapes by fixing the tile footprint. The pass accepts amax-col-sizeoption to bound the width of the collapsed arrangement. A dedicated per-herd attribute to disable reshaping independently of these conditions is not yet implemented. The resulting contiguous rectangular region of physical tiles is determined after any reshaping has been applied. -
GPU (AMD MI3xx family): A herd executes entirely within a single Compute Unit (CU), with PE instances mapped to individual warps. The combined resource requirements (register file, LDS, wavefront slots) of all PE instances must fit within one CU. On MI3xx devices a CU provides fewer than 32 wavefront slots, so the total number of PE instances (
%Nx × %Ny) is correspondingly limited.
| Form | Behaviour |
|---|---|
| No result (synchronous) | Acquires resources immediately; blocks the enclosing body until all PE instances complete; releases resources at the herd terminator. |
[%token =] (asynchronous) |
Acquires resources when all dependency tokens are signaled; returns !air.token immediately; releases resources when the token fires (all PE instances complete). |
Optional attributes x_loc/y_loc specify the base placement on 2D tile
arrays; each PE at tile (%x, %y) occupies physical column/row
(x_loc + %x, y_loc + %y). The link_with attribute names an external kernel
object to link into the herd at compile time.
air.dma_memcpy_nd (dst[dst_offsets][dst_sizes][dst_strides],
src[src_offsets][src_sizes][src_strides]) : (type_dst, type_src)
Describes an asynchronous N-dimensional strided bulk copy between two memrefs. The
[offsets][sizes][strides] triples follow the same convention as memref.subview:
offsets is the starting index in each dimension, sizes is the number of elements to
transfer in each dimension, and strides is the stride in units of elements.
The operation is direction-agnostic: either the source or the destination (or both) may be at any memory level. The direction (L3→L2, L2→L1, L1→L2, etc.) is inferred from the address spaces of the operand memrefs and mapped to the appropriate hardware mechanism.
An empty [offsets], [sizes], or [strides] list for a side means the entire memref
is addressed with unit strides.
Optional src_rank / dst_rank integer attributes name a peer rank in the
enclosing air.rank scope. When present, the corresponding memref is
interpreted as living on rank R's symmetric heap rather than on the local
process. The verifier requires the op to be enclosed by an air.rank and the
referenced memref to be air.symmetric-tagged (see §2.7). Lowering for the
GPU backend (planned: air-cross-rank-dma-to-mgpu) will expand these into
mgpuGetHeapBases()-based peer-VA arithmetic + mgpuMemcpy; the NPU
backend does not support these attributes.
// Read 1024 floats from rank 0's symmetric buffer into local L1.
air.dma_memcpy_nd (%local[][][], %sym[][][]) {src_rank = 0 : i64}
: (memref<1024xf32, 2>, memref<1024xf32, 0>)
// Channel declaration — at module scope
air.channel @name [dim₀, dim₁, …] {channel_type = "npu_dma_stream", depth = <N>}
// Synchronous put/get — block until the transfer completes
air.channel.put @name[indices] (src[offsets][sizes][strides]) : (type_src)
air.channel.get @name[indices] (dst[offsets][sizes][strides]) : (type_dst)
// Asynchronous put/get — return !air.token; transfer completes when token signals
[%token =] air.channel.put @name[indices] [dependency = [%t₀, …]]
(src[offsets][sizes][strides]) : (type_src)
[%token =] air.channel.get @name[indices] [dependency = [%t₀, …]]
(dst[offsets][sizes][strides]) : (type_dst)
Channels are declared once (at module scope) and used at distinct producer and consumer sites. They decouple the put and get sides, allowing the compiler to schedule them independently and to introduce double-buffering.
A channel may be an array (e.g., [4, 4] for a 4×4 array). The indices operand on
put/get selects the specific channel within the array.
The channel_type attribute controls the underlying mechanism. Values are
namespaced by backend: NPU (AIE) channels use the npu_ prefix; GPU channels
use the gpu_ prefix.
| Value | Mechanism |
|---|---|
"npu_dma_stream" (default) |
NPU: DMA engines with streaming (circuit-switched) interconnect |
"npu_dma_packet" |
NPU: DMA engines with packet-switched interconnect |
"npu_cascade" |
NPU: Core-to-core cascade connections between adjacent tiles |
"npu_mmio" |
NPU: Host-side MMIO blockwrites delivering a constant payload into a tile-local L1 buffer |
"gpu_symmetric_heap" |
GPU: Cross-rank messaging through the symmetric heap runtime (XGMI peer-mapped VMem). Requires an enclosing air.rank scope. |
The broadcast_shape attribute enables one-to-many communication following NumPy
broadcasting rules.
A channel has a finite buffer capacity set by the depth attribute (default 1).
The depth specifies how many transfers may be in-flight simultaneously:
| Depth | Effect |
|---|---|
| 1 (default) | Rendezvous: each put must be consumed by a get before the next put can proceed. |
| 2 | Double-buffering: producer may issue one transfer ahead of the consumer. |
| N | Producer may have up to N transfers outstanding before it must wait. |
Increasing depth relaxes the coupling between producer and consumer at the cost of additional buffer memory (each slot requires storage for one transfer's worth of data).
put and get are blocking at the channel boundary:
- A
putissued when the channel already holdsdepthunread transfers stalls until the consumer issues agetand frees a slot. - A
getissued when the channel holds no data stalls until the producer issues aput.
In the asynchronous form the operation is dispatched immediately and the returned
!air.token does not signal until the blocking condition is resolved and the transfer
is complete. This allows other work to proceed while the channel handshake is in
progress, but does not remove the fundamental flow-control constraint.
For a channel to make progress, every put must be matched by a get on the same
channel index, and vice versa. The compiler enforces the static balance condition:
- Along every possible execution path through the program the number of
putoperations and the number ofgetoperations at each channel index must be equal. - For channels inside loop bodies, balance must hold per iteration (equal puts and gets in the loop body).
- For channels connecting herds with different iteration spaces, the total transfer count
must balance:
(puts per producer instance) × (producer instances)must equal(gets per consumer instance) × (consumer instances). - For channels inside conditional branches, balance must hold independently on each branch.
A violation of the balance condition is a compile-time error.
A program deadlocks when a put is waiting for a get that can never execute, or a
get is waiting for a put that can never execute, and no other operation can break
the wait.
Minimal deadlock — put and get in the same sequential scope:
air.channel @C [] {depth = 1}
air.segment @deadlock {
// put blocks: channel is at capacity, waiting for a get to free space
air.channel.put @C[] (%src[][][]) : (memref<…, 1>)
// never reached — sequential control flow cannot reach get while put blocks
air.channel.get @C[] (%dst[][][]) : (memref<…, 1>)
air.segment_terminator
}The necessary condition for deadlock freedom is that for every channel there exists a
concurrent execution context in which the matching put and get can both
make progress simultaneously. In practice:
putandgeton the same channel must appear in different herds, different async branches, or different segment instances — any context that allows them to execute concurrently.- The communication graph (nodes = ops, edges = channel put→get dependencies) must be acyclic: a cycle means at least one op in the cycle is waiting on another in the same cycle, which can never be resolved.
air.execute wraps a sequential block of host-visible computation (e.g., memref.alloc,
arithmetic on scalars) in an asynchronous region. The region produces an
!air.token (dependency kind) that becomes ready when all operations inside have completed.
air.wait_all takes a variadic list of !air.token values and produces a single
token that becomes ready only when all of its inputs are ready. It acts as a join node
in the async dependency graph.
A memref.alloc op may carry the unit attribute air.symmetric to indicate
that the allocation should be backed by the symmetric heap runtime. Every
rank in the enclosing air.rank scope performs the same allocation in lockstep,
so each rank has a memref of the same size at the same offset within the heap.
Cross-rank addressing (via air.dma_memcpy_nd src_rank/dst_rank attributes
or air.channel with channel_type = "gpu_symmetric_heap") refers to peer
ranks' symmetric memrefs at the same logical offset.
%buf = memref.alloc() {air.symmetric} : memref<1024xf32>
The GPU lowering (planned: air-symmetric-alloc-to-mgpu) will route such
allocations through mgpuSymmetricAlloc (runtime_lib/airgpu/gpu_runtime.cpp)
instead of plain mgpuMemAlloc. Peer ranks' base pointers are obtained at
runtime via mgpuGetHeapBases(). The NPU backend does not interpret this
attribute.
On AMD Versal AI Engine (AIE) and Ryzen AI NPU targets the three-level hierarchy maps directly to physical hardware structures:
| AIR concept | AIE hardware |
|---|---|
air.launch |
Device-level kernel dispatch; L3 = DDR via NOC |
air.segment |
A contiguous rectangle of AIE tiles + associated memory tiles or URAMs (L2) |
air.herd tile (x, y) |
A single AIE compute tile at physical column/row offset (x_loc+x, y_loc+y) |
| L1 memory (space 2) | Per-tile local data memory (32 KB on AIE2) |
| L2 memory (space 1) | Shared memory tiles or URAMs accessible via the tile interconnect |
| L3 memory (space 0) | DDR accessible via AXI4-MM NOC |
| AIR operation | AIE mechanism |
|---|---|
air.dma_memcpy_nd (L3→L2) |
Shim DMA + AXI-S routing + memory tile DMA |
air.dma_memcpy_nd (L2→L1) |
Tile DMA + local AXI-S stream |
air.channel (dma_stream) |
DMA + circuit-switched streaming interconnect |
air.channel (dma_packet) |
DMA + packet-switched overlay network |
air.channel (cascade) |
Core cascade interface (direct register-to-register) |
The air-to-aie pass translates each air.herd into AIE tile operations (aie.core,
aie.buffer, aie.lock), routes channels through the switch boxes, and generates the
SHIM DMA configuration for L3 transfers.
Placement of a segment on physical tiles is guided by the optional x_loc/y_loc
(column/row offset) and x_size/y_size (column/row count) attributes on
air.segment. The herd's tile indices are added to the segment offset to obtain the
absolute physical tile location.
On AMD GPU targets the AIR hierarchy maps onto the GPU execution model. The
air-to-rocdl and air-gpu-outlining passes perform this translation. The
basic mapping uses three levels (launch → segment → herd); on MI3xx devices a
four-level mapping with nested segments is used to achieve full-device occupancy
(see §4.2).
| AIR concept | GPU concept |
|---|---|
air.launch (%bx,%by) in (%gbx,%gby) |
gpu.launch grid: gridDim = (gbx, gby, 1) |
air.segment |
Workgroup (thread block); the segment body runs within a single gpu.launch |
air.herd tile (%x,%y) in (%bx,%by) |
Thread block dimensions: blockDim = (bx, by, 1) |
Herd tile index (%x, %y) |
(threadIdx.x, threadIdx.y) |
Launch index (%bx, %by) |
(blockIdx.x, blockIdx.y) |
The air.launch iteration space becomes the grid (number of thread blocks), and
the air.herd iteration space becomes the block (number of threads per block).
The air.segment body is the thread-block body: code that runs once per workgroup before
and after the per-thread air.herd body.
After translation the hierarchy is flattened: air.segment and air.herd are erased
and their bodies are moved into the enclosing gpu.launch region. The
air-gpu-outlining pass then extracts the gpu.launch body into a gpu.func kernel
and injects the appropriate gpu.BlockIdOp, gpu.ThreadIdOp, gpu.GridDimOp, and
gpu.BlockDimOp intrinsics.
On AMD MI3xx devices the full device can be occupied using a four-level hierarchy built from a unit-iteration-space launch, two nested segments, and an innermost herd:
| Level | AIR construct | MI3xx granularity |
|---|---|---|
| 1 | air.launch (no iteration space — implicit single instance) |
Whole-device dispatch |
| 2 | Outer air.segment (backend-granularity=XCD, iteration space = 8) |
All 8 XCDs simultaneously — all instances stamp out concurrently |
| 3 | Inner air.segment (iteration space = CUs per XCD) |
One CU within the XCD — all instances stamp out concurrently |
| 4 | air.herd (iteration space ≤ 32) |
Concurrent warps on that CU |
air.launch carries no iteration space (an implicit single-point space).
Its sole purpose at this level is to anchor the co-residency guarantee for
everything nested inside.
Outer air.segment (backend-granularity=XCD) has an iteration space of
cardinality 8, one instance per XCD on the MI3xx device. Because segment
instances always execute with overlapping lifetimes, all 8 XCDs are occupied
simultaneously.
Inner air.segment is nested inside the outer and iterates over the CUs
within its XCD. Again all instances run concurrently, so every CU in the XCD
is active at the same time.
air.herd is nested inside the inner segment and targets the warps running
on that single CU. The iteration space must not exceed 32 PE instances (the
wavefront-slot limit of a MI3xx CU, as described in §2.3). Iterations within
the herd may communicate through:
- LDS memory (
L2address space) — per-CU local data store, shared among all warps in the herd. - Global memory (
L3address space) — device HBM, accessible to all levels of the hierarchy.
The backend-granularity=XCD attribute on the outer segment is the only
MI3xx-specific annotation required; all other tiling falls out of the standard
segment and herd iteration space mechanisms.
| AIR memory space | Enum | GPU address space | GPU scope |
|---|---|---|---|
| L3 (space 0) | MemorySpace::L3 |
0 (generic/global) | Device global memory (HBM) |
| L2 (space 1) | MemorySpace::L2 |
3 (local) | Workgroup (LDS/shared memory) |
| L1 (space 2) | MemorySpace::L1 |
5 (private) | Per-thread private memory (VGPRs/scratch) |
memref.alloc ops in L2 space are hoisted to gpu.launch workgroup attributions
(shared among all threads in the block). memref.alloc ops in L1 space become
private attributions (one copy per thread). Explicit memref.dealloc ops are
removed because GPU attributions have implicit kernel-scoped lifetimes.
air.dma_memcpy_nd operations are lowered to explicit SCF loops containing
memref.load / memref.store pairs. The offsets, sizes, and strides from the DMA
descriptor drive the loop bounds and index arithmetic. Subsequent lowering passes
(ROCDL, LLVM CodeGen) optimise these into coalesced global loads, LDS stores, or
register-to-register moves depending on the inferred address spaces.
gpu.barrier instructions (already present in well-formed AIR GPU programs or inserted
by the compiler) synchronize threads at workgroup boundaries between DMA-equivalent
loads into shared memory and subsequent compute.
The GPU test in test/gpu/4k_4k_mul/air_sync.mlir illustrates the model:
// 32×32 grid of segments, one per 128×128 output tile
air.launch (%bx, %by) in (%nbx=%c32, %nby=%c32)
args(%A=%arg0, %B=%arg1, %C=%arg2) : … {
air.segment @forward_0 args(%bx=%bx, %by=%by, %A=%A, %B=%B, %C=%C) : … {
// Segment body: runs once per workgroup
// L2 allocations — shared memory tiles
%As = memref.alloc() : memref<128x8xf32, 1> // L2 → LDS
%Bs = memref.alloc() : memref<8x128xf32, 1> // L2 → LDS
scf.for %k = 0 to 4096 step 8 {
// Load A and B tiles into shared memory (DMA equivalent: global → LDS)
// Uses gpu.thread_id to distribute loads across threads
…
gpu.barrier
// 256 herd tiles (threads) compute the outer product
air.herd @herd_0 tile (%tx, %ty) in (%ntx=%c256, %nty=%c1)
args(%As=%As, %Bs=%Bs, …) : … {
// L1 accumulation registers (VGPRs)
%acc = memref.alloc() : memref<64xf32, 2> // L1 → private/VGPRs
…
gpu.barrier
}
}
// Write accumulated results back to L3 (global memory)
}
}The mapping:
air.launch→gpu.launchwithgridDim = (32, 32, 1)air.segment→ workgroup body (executed once per thread block)air.herd tile (%tx, .) in (%c256, %c1)→blockDim = (256, 1, 1),threadIdx.x=%tx- L2 memrefs (space 1) → LDS (shared memory)
- L1 memrefs (space 2) → VGPRs / private scratch
The full GPU lowering pipeline is:
air-to-rocdl— Flattensair.launch/air.segment/air.herdinto agpu.launch, converts L1/L2 allocations to workgroup/private attributions, and lowersair.dma_memcpy_ndto SCF loops.air-gpu-outlining— Outlines thegpu.launchbody into agpu.funcinside agpu.module, injectsgpu.BlockIdOp,gpu.ThreadIdOp, etc., and converts the launch site togpu.launch_func.- LLVM lowering — Lowers affine, SCF, and CF dialects.
- ROCDL binary generation — Attaches a ROCDL target descriptor (
rocdl-attach- target), convertsgpu.moduleto ROCDL IR (convert-gpu-to-rocdl), and serialises the GPU binary withgpu-module-to-binary.
See buildingGPU.md for build instructions and the complete
aircc compilation pipeline.
| Concept | NPU (AIE) | GPU (ROCDL) |
|---|---|---|
air.launch iteration point |
Device-level work unit | One GPU thread block |
air.segment |
Rectangle of AIE tiles + memory tiles | Thread block workgroup body |
air.herd tile |
Single AIE compute tile | Single GPU thread |
| L1 (space 2) | 32 KB tile-local data memory | Thread-private VGPRs / scratch |
| L2 (space 1) | Memory tiles / URAMs | LDS (shared memory, ~64 KB / CU) |
| L3 (space 0) | DDR via NOC | HBM via global memory |
dma_memcpy_nd (intra-rank) |
AIE Shim/Tile DMA engines | SCF load/store loops |
dma_memcpy_nd (cross-rank, src_rank/dst_rank) |
— | Symmetric heap peer addressing (planned) |
channel (npu_dma_stream) |
Streaming AXI-S switch | n/a |
channel (npu_dma_packet) |
Packet-switched AXI-S overlay | n/a |
channel (npu_cascade) |
Core cascade interface | n/a |
channel (npu_mmio) |
Host MMIO blockwrite | n/a |
channel (gpu_symmetric_heap) |
n/a | Symmetric heap peer addressing (planned) |
air.symmetric memref alloc |
n/a | mgpuSymmetricAlloc (planned) |
| Synchronization | AIE locks | gpu.barrier (intra-rank), mgpuBarrier (cross-rank) |
!air.token (dependency) |
AIE runtime completion signals | GPU stream/event dependencies |