New op: air.rank — multi-device / multi-host parallelism above air.launch

## Summary

Add a new `air.rank` operation and supporting `air.universe` type to the AIR dialect.
`air.rank` is a fourth hierarchy level that sits above `air.launch` and expresses a
communicating world of rank instances, where each instance corresponds to a complete
GPU device or a CPU host process (with its attached devices).

A detailed design document is in `docs/AIRRankOp.md` on the `compute-model-explanation`
branch. This issue captures the implementation work.

---

## Motivation

The existing three-level hierarchy (`air.launch` → `air.segment` → `air.herd`) describes
work that runs on a **single device**. There is no dialect-level primitive for expressing
programs that span multiple GPUs or multiple hosts. Without such a primitive, multi-device
programs must be assembled entirely at the runtime or framework layer, losing the
compiler's ability to reason about inter-device data movement, resource allocation,
and communication scheduling.

`air.rank` fills this gap by extending the hierarchy upward:

```
air.rank           — communicating world of devices or hosts    ← NEW
  air.launch       — co-resident work on one device
    air.segment    — tile rectangle + L2 memory
      air.herd     — array of PEs
```

---

## New Operations and Types

### `air.rank`

```
[%token =] air.rank (%r₀, …, %rₙ) in (%sr₀=%M₀, …, %srₙ=%Mₙ)
           args(%a₀=%v₀, …) : <types>
           [universe = %u]
           [dependency  = [%t₀, …]]
           [affinity    = [%t₀, …]]
           {
  …
  air.rank_terminator
}
```

**Key properties:**

- Defines an N-dimensional iteration space. Each point is a **rank instance**.
- Each rank instance maps to:
  - A distinct **GPU device** when nested inside `air.launch`.
  - A distinct **host process** (with attached devices) when at top level.
- The body is `IsolatedFromAbove`; values are passed via explicit kernel operands.
- The compiler is free to distribute the serial preamble of the body across all
  participating rank instances (SPMD fork semantics).
- **May-be-parallel** semantics (like `air.launch`): `concurrency` token list is
  not permitted. `dependency` and `affinity` lists are permitted.
- A `universe` operand of type `!air.universe` constrains the physical pool from
  which rank instances are scheduled.

**Terminator:** `air.rank_terminator` (analogous to `air.launch_terminator`).

### `air.universe` type and `air.universe.alloc`

```
%u = air.universe.alloc(%capacity) : !air.universe
```

`!air.universe` is an opaque dialect type that represents a bounded pool of
concurrently available devices or host processes. Allocating a universe with
`capacity = N` instructs the runtime to guarantee that at least
`min(N, iteration_space_size)` rank instances may run concurrently. The handle
is passed to `air.rank` through its `universe` operand.

---

## Implementation Scope

### 1. Dialect changes (`mlir/include/air/Dialect/AIR/`)

- [ ] **`AIROpBase.td`**: Add `!air.universe` dialect type (or a new `AIRTypes.td`).
- [ ] **`AIR.td`**:
  - Add `air_RankOp` following the pattern of `air_LaunchOp`/`air_SegmentOp`:
    - Traits: `air_AsyncOpInterface`, `air_HierarchyInterface`,
      `AttrSizedOperandSegments`, `IsolatedFromAbove`, `AffineScope`,
      `SingleBlockImplicitTerminator<"RankTerminatorOp">`.
    - Arguments: `OptionalAttr<SymbolNameAttr>:$sym_name`,
      `Variadic<air_AsyncToken>:$async_dependencies`,
      `Variadic<Index>:$sizes`,
      `Variadic<AnyType>:$rank_operands`,
      `Optional<air_UniverseType>:$universe`.
    - Result: `Optional<air_AsyncToken>:$async_token`.
  - Add `air_RankTerminatorOp`.
  - Add `air_UniverseAllocOp`.

### 2. C++ op implementation (`mlir/lib/Dialect/AIR/IR/`)

- [ ] Implement `getIds()`, `getSize()`, `getSizeOperands()`,
  `getNumKernelOperands()`, `getKernelOperands()`, `getKernelArguments()`,
  `getNumDims()`, `getId()` for `RankOp` (same pattern as `LaunchOp`).
- [ ] Implement custom assembly format (parser/printer) for `air.rank`.
- [ ] Implement `air.universe.alloc` and the `!air.universe` type.
- [ ] Add canonicalizer for `RankOp` (remove unused kernel operands).

### 3. Verifier rules

- [ ] `air.rank` must not carry a `concurrency` token list.
- [ ] When nested inside `air.launch`, verify that `air.rank` bodies contain
  only `air.launch` ops (not bare `air.segment`/`air.herd`).
- [ ] When at top level, `air.rank` bodies may contain `air.launch` ops.
- [ ] `universe` operand, if present, must be of type `!air.universe`.
- [ ] `air.rank_terminator` must have `air.rank` as its immediate parent.

### 4. Passes / transforms (`mlir/lib/Transform/`)

- [ ] **`air-rank-to-launch`** pass: lower `air.rank` to a loop (for single-process
  targets that simulate multi-device execution serially), threading rank index
  arguments into enclosed `air.launch` ops.
- [ ] Update existing `HierarchyInterface` traversal passes (e.g. loop unrolling,
  operand hoisting, async dependency analysis) to handle `air.rank` as the new
  outermost level.

### 5. Lowering to GPU backend (`mlir/lib/Conversion/`)

- [ ] `air.rank` at top level → one MPI process (or `fork()`-style process group)
  per rank instance; emit a process launch stub.
- [ ] `air.rank` inside `air.launch` → one HIP device context per rank instance;
  use `hipSetDevice(rank_id)` before enclosed `air.launch` lowering.
- [ ] `air.channel` at rank body scope → RCCL collective (if pattern matches) or
  peer-to-peer `hipMemcpyPeer`.

### 6. `air.universe` runtime support

- [ ] Add `air_universe_alloc(size_t capacity)` to the AIR runtime API, returning
  an opaque handle.
- [ ] The handle is passed to the `air.rank` execution function and used to
  pre-allocate process/device slots and communicators (MPI_Comm / rcclComm_t).

### 7. Tests

- [ ] **FileCheck / lit tests** in `test/dialect/AIR/`:
  - Round-trip parsing/printing of `air.rank` and `air.universe.alloc`.
  - Verifier rejection tests (concurrency list, bad nesting, wrong universe type).
- [ ] **Lowering tests** in `test/Transform/` and `test/Conversion/`:
  - `air-rank-to-launch` serialisation pass.
  - GPU lowering with `hipSetDevice`.
- [ ] **Integration test** (optional, requires multi-GPU): two-rank GEMM splitting
  output rows across two GPUs.

### 8. Documentation

- [ ] `docs/AIRRankOp.md` (already drafted on `compute-model-explanation`).
- [ ] Update `docs/AIRComputeModel.md` §1.1 to show the four-level hierarchy.
- [ ] Update `docs/README.md` index.

---

## Non-Goals (for this issue)

- NPU backend lowering (NPUs are single-device; multi-chiplet support is future work).
- Formalising `L4` memory space in `air::MemorySpace` (reserved for follow-on issue).
- Fine-grained sub-universe device selection (e.g. pinning to a specific interconnect
  island); `air.universe` in this issue is cardinality-only.

---

## References

- Design document: `docs/AIRRankOp.md` (branch `compute-model-explanation`)
- Existing hierarchy op pattern: `AIR.td` `air_LaunchOp`, `air_SegmentOp`
- Existing interface definitions: `AIROpBase.td` `air_HierarchyInterface`
- arXiv:2510.14871 — "From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs
  with MLIR-AIR"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New op: air.rank — multi-device / multi-host parallelism above air.launch #1414

Summary

Motivation

New Operations and Types

`air.rank`

`air.universe` type and `air.universe.alloc`

Implementation Scope

1. Dialect changes (`mlir/include/air/Dialect/AIR/`)

2. C++ op implementation (`mlir/lib/Dialect/AIR/IR/`)

3. Verifier rules

4. Passes / transforms (`mlir/lib/Transform/`)

5. Lowering to GPU backend (`mlir/lib/Conversion/`)

6. `air.universe` runtime support

7. Tests

8. Documentation

Non-Goals (for this issue)

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

New op: air.rank — multi-device / multi-host parallelism above air.launch #1414

Description

Summary

Motivation

New Operations and Types

air.rank

air.universe type and air.universe.alloc

Implementation Scope

1. Dialect changes (mlir/include/air/Dialect/AIR/)

2. C++ op implementation (mlir/lib/Dialect/AIR/IR/)

3. Verifier rules

4. Passes / transforms (mlir/lib/Transform/)

5. Lowering to GPU backend (mlir/lib/Conversion/)

6. air.universe runtime support

7. Tests

8. Documentation

Non-Goals (for this issue)

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`air.rank`

`air.universe` type and `air.universe.alloc`

1. Dialect changes (`mlir/include/air/Dialect/AIR/`)

2. C++ op implementation (`mlir/lib/Dialect/AIR/IR/`)

4. Passes / transforms (`mlir/lib/Transform/`)

5. Lowering to GPU backend (`mlir/lib/Conversion/`)

6. `air.universe` runtime support