Skip to content

New op: air.rank — multi-device / multi-host parallelism above air.launch #1414

@SamuelBayliss

Description

@SamuelBayliss

Summary

Add a new air.rank operation and supporting air.universe type to the AIR dialect.
air.rank is a fourth hierarchy level that sits above air.launch and expresses a
communicating world of rank instances, where each instance corresponds to a complete
GPU device or a CPU host process (with its attached devices).

A detailed design document is in docs/AIRRankOp.md on the compute-model-explanation
branch. This issue captures the implementation work.


Motivation

The existing three-level hierarchy (air.launchair.segmentair.herd) describes
work that runs on a single device. There is no dialect-level primitive for expressing
programs that span multiple GPUs or multiple hosts. Without such a primitive, multi-device
programs must be assembled entirely at the runtime or framework layer, losing the
compiler's ability to reason about inter-device data movement, resource allocation,
and communication scheduling.

air.rank fills this gap by extending the hierarchy upward:

air.rank           — communicating world of devices or hosts    ← NEW
  air.launch       — co-resident work on one device
    air.segment    — tile rectangle + L2 memory
      air.herd     — array of PEs

New Operations and Types

air.rank

[%token =] air.rank (%r₀, …, %rₙ) in (%sr₀=%M₀, …, %srₙ=%Mₙ)
           args(%a₀=%v₀, …) : <types>
           [universe = %u]
           [dependency  = [%t₀, …]]
           [affinity    = [%t₀, …]]
           {
  …
  air.rank_terminator
}

Key properties:

  • Defines an N-dimensional iteration space. Each point is a rank instance.
  • Each rank instance maps to:
    • A distinct GPU device when nested inside air.launch.
    • A distinct host process (with attached devices) when at top level.
  • The body is IsolatedFromAbove; values are passed via explicit kernel operands.
  • The compiler is free to distribute the serial preamble of the body across all
    participating rank instances (SPMD fork semantics).
  • May-be-parallel semantics (like air.launch): concurrency token list is
    not permitted. dependency and affinity lists are permitted.
  • A universe operand of type !air.universe constrains the physical pool from
    which rank instances are scheduled.

Terminator: air.rank_terminator (analogous to air.launch_terminator).

air.universe type and air.universe.alloc

%u = air.universe.alloc(%capacity) : !air.universe

!air.universe is an opaque dialect type that represents a bounded pool of
concurrently available devices or host processes. Allocating a universe with
capacity = N instructs the runtime to guarantee that at least
min(N, iteration_space_size) rank instances may run concurrently. The handle
is passed to air.rank through its universe operand.


Implementation Scope

1. Dialect changes (mlir/include/air/Dialect/AIR/)

  • AIROpBase.td: Add !air.universe dialect type (or a new AIRTypes.td).
  • AIR.td:
    • Add air_RankOp following the pattern of air_LaunchOp/air_SegmentOp:
      • Traits: air_AsyncOpInterface, air_HierarchyInterface,
        AttrSizedOperandSegments, IsolatedFromAbove, AffineScope,
        SingleBlockImplicitTerminator<"RankTerminatorOp">.
      • Arguments: OptionalAttr<SymbolNameAttr>:$sym_name,
        Variadic<air_AsyncToken>:$async_dependencies,
        Variadic<Index>:$sizes,
        Variadic<AnyType>:$rank_operands,
        Optional<air_UniverseType>:$universe.
      • Result: Optional<air_AsyncToken>:$async_token.
    • Add air_RankTerminatorOp.
    • Add air_UniverseAllocOp.

2. C++ op implementation (mlir/lib/Dialect/AIR/IR/)

  • Implement getIds(), getSize(), getSizeOperands(),
    getNumKernelOperands(), getKernelOperands(), getKernelArguments(),
    getNumDims(), getId() for RankOp (same pattern as LaunchOp).
  • Implement custom assembly format (parser/printer) for air.rank.
  • Implement air.universe.alloc and the !air.universe type.
  • Add canonicalizer for RankOp (remove unused kernel operands).

3. Verifier rules

  • air.rank must not carry a concurrency token list.
  • When nested inside air.launch, verify that air.rank bodies contain
    only air.launch ops (not bare air.segment/air.herd).
  • When at top level, air.rank bodies may contain air.launch ops.
  • universe operand, if present, must be of type !air.universe.
  • air.rank_terminator must have air.rank as its immediate parent.

4. Passes / transforms (mlir/lib/Transform/)

  • air-rank-to-launch pass: lower air.rank to a loop (for single-process
    targets that simulate multi-device execution serially), threading rank index
    arguments into enclosed air.launch ops.
  • Update existing HierarchyInterface traversal passes (e.g. loop unrolling,
    operand hoisting, async dependency analysis) to handle air.rank as the new
    outermost level.

5. Lowering to GPU backend (mlir/lib/Conversion/)

  • air.rank at top level → one MPI process (or fork()-style process group)
    per rank instance; emit a process launch stub.
  • air.rank inside air.launch → one HIP device context per rank instance;
    use hipSetDevice(rank_id) before enclosed air.launch lowering.
  • air.channel at rank body scope → RCCL collective (if pattern matches) or
    peer-to-peer hipMemcpyPeer.

6. air.universe runtime support

  • Add air_universe_alloc(size_t capacity) to the AIR runtime API, returning
    an opaque handle.
  • The handle is passed to the air.rank execution function and used to
    pre-allocate process/device slots and communicators (MPI_Comm / rcclComm_t).

7. Tests

  • FileCheck / lit tests in test/dialect/AIR/:
    • Round-trip parsing/printing of air.rank and air.universe.alloc.
    • Verifier rejection tests (concurrency list, bad nesting, wrong universe type).
  • Lowering tests in test/Transform/ and test/Conversion/:
    • air-rank-to-launch serialisation pass.
    • GPU lowering with hipSetDevice.
  • Integration test (optional, requires multi-GPU): two-rank GEMM splitting
    output rows across two GPUs.

8. Documentation

  • docs/AIRRankOp.md (already drafted on compute-model-explanation).
  • Update docs/AIRComputeModel.md §1.1 to show the four-level hierarchy.
  • Update docs/README.md index.

Non-Goals (for this issue)

  • NPU backend lowering (NPUs are single-device; multi-chiplet support is future work).
  • Formalising L4 memory space in air::MemorySpace (reserved for follow-on issue).
  • Fine-grained sub-universe device selection (e.g. pinning to a specific interconnect
    island); air.universe in this issue is cardinality-only.

References

  • Design document: docs/AIRRankOp.md (branch compute-model-explanation)
  • Existing hierarchy op pattern: AIR.td air_LaunchOp, air_SegmentOp
  • Existing interface definitions: AIROpBase.td air_HierarchyInterface
  • arXiv:2510.14871 — "From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs
    with MLIR-AIR"

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions