[IRON] Add specialized FIFO subclasses (Cascade / Packet / Accum / Sparse / Memtile / VariableRate)#3039
[IRON] Add specialized FIFO subclasses (Cascade / Packet / Accum / Sparse / Memtile / VariableRate)#3039matteius wants to merge 21 commits into
Conversation
Reserve five FIFO subclass slots in aie.iron's __init__.py as NotImplementedError-raising stubs. Subsequent commits in this PR replace each stub with a real class: - CascadeFifo (cascade-stream ObjectFifo subclass) - PacketFifo (packet-switched / pktMerge / TLAST / OoO BD) - AccumFifo (FP32 inter-tile accumulator state passing) - SparseFifo (on-the-fly N:M sparsity decompression on S2MM) - MemtileAggregator (memtile-mediated fan-in helper) Adds an explicit __all__ enumerating both the existing primitives and the five new reservation slots so the public surface stays discoverable without runtime side effects. Signed-off-by: Matt Davis <matt@opensensor.io>
… + LLVM exception) first-class IRON primitive at python/iron/cascade.py. The new CascadeFifo class mirrors aie.iron.ObjectFifo's constructor surface (producer / consumer endpoints, dtype, name, handshake-size knob) but lowers through the cascade physical channel — emitting an aie.cascade_flow op via resolve() that the placer pass converts into per-tile aie.configure_cascade ops. Architectural references: - AM020 Ch. 4 p. 67: 512-bit cascade stream between adjacent CoreTiles. - AM020 Appendix A p. 80 Figure 45: vertical+horizontal cascade topology. - aie.put_cascade / aie.get_cascade: the cascade write/read MLIR ops the C++ kernel emits via put_mcd / get_scd_v16int32 intrinsics inside its core_fn body. CascadeFifo only emits placement + cascade_flow. - aie.cascade_flow: the declarative connection op the placer lowers. usable for callers building chain-of-N topologies; behavioural parity with the wrapper is asserted in tests/test_iron_cascade_fifo.py. the CascadeFifo slot + the matching import line are touched in this PR. for all subsequent fork PRs: - Apache-2.0 + LLVM exception headers on every new fork file. - Signed-off-by trailer per LLVM's DCO model. - Repo-root THIRD_PARTY_NOTICES.md updated in the outer repo. Tests: - test/iron/test_cascade_fifo.py: surface, validation, lowering, and parity with the dialect-level cascade_flow op. Signed-off-by: Matt Davis <matt@opensensor.io>
…h. 4 p. 67)
first-class IRON dataflow primitive sibling to ObjectFifo and CascadeFifo.
AccumFifo persists 512-bit BM accumulator state across two boundaries:
1. Across timesteps within a tile (BM-to-BM register move; AM020 Ch. 4
p. 67 "Move one 512-bit accumulator register to another in one cycle").
Lowering: no MLIR op emitted; the C++ kernel keeps an aie::accum local
hot across the worker's while(true) iteration boundary.
2. Across tiles via cascade-stream BM transfer (AM020 Ch. 4 p. 67
"Cascade stream connects the AIE-MLs in a chain ... transfer an
accumulator register (512-bit) from one to the next"). Lowering:
aie.cascade_flow(prod_tile, cons_tile) between vertically-adjacent
CoreTiles. Vertical adjacency on AIE2P is the only geometry T7-IRON
verified on silicon; horizontal cascade is documented (AM020 App. A
p. 80 Fig. 45) but un-tested -- a UserWarning is raised for
non-vertical placements rather than a hard reject.
AccumFifo(producer, consumer, dtype="accfloat", lanes=16)
af.prod() -> AccumFifoHandle (acquire / release no-ops; cascade
wire is per-cycle handshaked at the dialect intrinsic
level, intra-tile is register-aliased)
af.cons() -> AccumFifoHandle
Rationale for sibling class vs ObjectFifo flag: ObjectFifo is memref-
typed (DMA-copies memref words). The accumulator register is not a
memref word -- it's a hardware register-file slice (AM020 Ch. 4
p. 65-67). Modeling this on ObjectFifo would either force a fictitious
memref<16xf32> the lowering ignores, or burden every ObjectFifo
consumer with an accumulator-mode invariant. A sibling class with the
same prod/cons surface keeps the abstraction clean.
dtype validation:
- "acc32" (int32 accumulator)
- "acc64" (int64 paired-lane accumulator -- 8 lanes x 64 bits)
- "acc48" is explicitly rejected (AIE1-only; AIE-ML / AIE2P drops it
per AM020 Ch. 4 p. 65)
Lane-count validation: enforces the AM020 Ch. 4 p. 67 cascade-transfer
width of exactly 512 bits/cycle. lanes=16 is the only legal value for
accfloat / acc32; lanes=8 for acc64.
AccumFifoHandle subclasses ObjectFifoHandle so existing
isinstance(arg, ObjectFifoHandle) dispatch in Worker.fn_args accepts
fn_args dispatch, the inheritance is no longer load-bearing for that
purpose, but documents AccumFifo as a fifo-shaped abstraction.
Tests at test/iron/test_accum_fifo.py cover three layers:
- Surface: API shape, dtype/lane validation, intra-tile vs inter-tile
detection, error messages, isinstance compatibility.
- Lowering: intra-tile emits no cascade_flow op; inter-tile emits one.
- Precision: synthetic LSTM cell (96 hidden, 200 timesteps, matching
invariant matches FP32 PyTorch reference within 1e-5 max-abs, vs the
Reservation slot in python/iron/__init__.py is replaced
slots (CascadeFifo, PacketFifo, SparseFifo, MemtileAggregator)
Signed-off-by: Matt Davis <matt@opensensor.io>
… (foundation for PacketFifo) fn_args resolution as the single biggest blocker for promoting ObjectFifo subclasses (PacketFifo, CascadeFifo, AccumFifo, SparseFifo) to first-class IRON primitives. The original implementation hard-coded the type-dispatch chain to recognize only ObjectFifoHandle, so any new FifoHandle subclass would have to fork worker.py. This commit refactors that dispatch into a registry pattern: * python/iron/dataflow/fifo_handle_registry.py -- new module exposing register_fifo_handle (decorator + function-call forms), unregister_fifo_handle, get_registered_handle_classes, and dispatch_fn_arg. Reverse-insertion order ensures more-specific subclasses (registered later) win the isinstance() walk. * python/iron/dataflow/__init__.py -- pre-registers ObjectFifoHandle with a handler that reproduces the original Worker.__init__ bookkeeping bit-for-bit (sets arg.endpoint = worker; appends to worker._fifos). Backward-compat anchor: every Phase 1 design that passes ObjectFifoHandle through fn_args still works without modification. * python/iron/worker.py -- replaces the hard-coded isinstance(arg, ObjectFifoHandle) branch with a dispatch_fn_arg(...) call. Buffer / ObjectFifo / WorkerRuntimeBarrier branches unchanged. * test/iron/test_worker_fifo_handle_extension.py -- 14 tests covering pre-registration, regression guard for ObjectFifoHandle, custom subclass dispatch, runtime registration, reverse-order precedence, decorator + function-call forms, snapshot context manager, error handling, and public-surface stability. without further changes to worker.py.
Promote the three AM020-documented variable-rate hardware primitives
- pktMerge N:1 header-based routing (AM020 Ch. 2 Figure 17)
- S2MM finish-on-TLAST stream end (AM020 Ch. 2 p. 27)
- Out-of-order BD processing (AM020 Ch. 5 p. 74)
PacketFifo mirrors ObjectFifo's prod() / cons() user-facing surface
but lowers to aie.packetflow ops with per-packet header-based routing
through the AXI stream switch fabric -- a different runtime mechanism
from ObjectFifo's shared-memory + lock model. A sibling class (rather
than an ObjectFifo flag) keeps the abstraction clean and lets the
lowering emit packetflow ops directly.
API:
PacketFifo(producers, consumers,
header_dtype="uint8",
merge_strategy="round-robin"|"priority",
packet_ids=..., # auto-assigned if omitted
keep_pkt_header=True, # False -> finish-on-TLAST
obj_type=..., depth=2)
PacketFifoHandle subclasses ObjectFifoHandle for surface compatibility,
time so Worker.fn_args dispatch recognizes it without modifying
worker.py. The reverse-insertion-order walk in dispatch_fn_arg picks
PacketFifoHandle over ObjectFifoHandle when both isinstance() checks
match -- exactly the property the registry was designed for.
gap entries recorded in
Test coverage in test/iron/test_packet_fifo.py:
- Surface tests: API shape, dtype/strategy/packet_id validation, error
messages (16 tests)
- Handle surface: producer/consumer construction, idempotency,
send_with_header / recv_header asymmetry, ObjectFifoHandle subclass
invariant (8 tests)
- Registry integration: PacketFifoHandle registered after import,
dispatch_fn_arg recognizes it, Worker.fn_args records it on _fifos,
reverse-insertion-order walk picks subclass over base (4 tests)
- Behavioral toy: 3-producer-1-consumer round-robin merge yields the
union of inputs without drops; per-producer ordering preserved;
finish-on-TLAST flag plumbed through; priority strategy + N:M
construction validated (5 tests)
Refs:
- python/iron/dataflow/fifo_handle_registry.py
Signed-off-by: Matt Davis <matt@opensensor.io>
…AM020 Ch. 5 p. 74)
vs the 2-into-1 fallback) to a first-class IRON helper. Encapsulates
the canonical AIE-ML / AIE2P memtile-mediated fan-in pattern documented
by AM020 Ch. 5 p. 74 (memtile S2MM channels 0..3 with east/west
neighbour access) + Figures 22+23 + the "Dataflow Mapping 1/2/3"
diagrams.
API:
MemtileAggregator(n_producers, producer_obj_type, joined_obj_type,
layout="slab", depth=2, tile=AnyMemTile, name=...)
.producer(i) / .producers() -> ObjectFifoHandle (per-tile producer)
.consumer(depth=...) -> ObjectFifoHandle (joined consumer)
.offsets / .sub_fifos / .joined_fifo (introspection)
Validates the flat-concat invariant at construction time and surfaces
discovery) in both the class docstring and a clear NotImplementedError
that fires for the layout="window" reservation slot until Phase 3
extends the helper with explicit dims_to_stream inference.
MemtileAggregator slot is touched; CascadeFifo / PacketFifo /
AccumFifo / SparseFifo slots are left untouched for their owning
tasks).
Tests: test/iron/test_memtile_aggregator.py covers construction
validation, the flat-concat invariant, per-producer / consumer
handle accessors, layout vocabulary enforcement, the memtile DM
budget check, and byte-equality with Phase 1's hand-rolled
join_offsets=[0, 2048, 4096, 6144] + obj_types triple. Pure-Python
(no MLIR context required); runs after the fork wheel rebuild.
Refs: AM020 Ch. 5 p. 74 (memtile DMA channel layout), Ch. 5 p. 71
(5D address generation), Table 14 (memtile DM = 512 KiB on AIE-ML,
Phase-2 follow-up.
Signed-off-by: Matt Davis <matt@opensensor.io>
Promotes AIE-ML / AIE2P compute-tile S2MM decompression + MM2S compression
hardware (AM020 Ch. 1 p. 15 + Ch. 2 p. 27 + Ch. 5 p. 74) to a first-class
IRON dataflow primitive at python/iron/sparse.py.
The new SparseFifo class subclasses ObjectFifo (composes-by-subclassing —
inherits storage / depth / dimsToStream / dimsFromStream / pad+repeat+iter
machinery) and adds N:M structured-sparsity kwargs (sparsity_pattern, N, M,
allow_unverified). Producer-side sees compressed data, consumer-side sees
dense data; the on-tile decompressor re-injects zeros at the position-map
gaps before the data lands in tile DM. SparseFifoHandle subclasses
ObjectFifoHandle so Worker.fn_args' isinstance(arg, ObjectFifoHandle) check
SparseFifoHandle handler over the parent ObjectFifoHandle handler in the
Lowering model
--------------
SparseFifo.resolve() calls the standard ObjectFifo lowering then attaches
five discardable attributes to the lowered ObjectFifoCreateOp:
aie.compress_mm2s (BoolAttr) — flips Enable_Compression on producer BD
aie.decompress_s2mm (BoolAttr) — flips Enable_Compression on consumer BD
aie.sparsity_pattern (StringAttr "N:M")
aie.sparsity_n (i32)
aie.sparsity_m (i32)
The BD-emit pass keys off these to flip the per-channel
Enable_Compression bit (lib/Dialect/AIE/Util/aie_registers_aie2.json
documents this as "Enable Compression (MM2S), decompression (S2MM).
Only effective if channel has (de)compression enabled"). If the active
backend hasn't been taught about these attributes (early AIE2P
silicon-driver stack), the design still compiles and runs as a vanilla
ObjectFifo — degraded mode is observable via the runtime DMA-volume
Pattern validation
------------------
AM020-verified set {(1,2), (1,4), (2,4)} accepted by default (Ch. 1 p. 15
cites these for "CNN and RNN application" with RNN explicitly named —
hatch). Structural rules enforced eagerly at construction time:
M >= 2; 0 < N < M; N and M are int.
Registry hook-up
----------------
register_fifo_handle(SparseFifoHandle, _sparse_fifo_handle_handler) runs
at module import time; handler mirrors the pre-registered
ObjectFifoHandle bookkeeping (arg.endpoint = worker;
forward-looking no-op (Worker.__init__ falls back to the hard-coded
isinstance(arg, ObjectFifoHandle) branch which still accepts
the SparseFifoHandle handler over the ObjectFifoHandle handler.
Tests
-----
23 in-fork tests at test/iron/test_sparse_fifo.py covering surface (real
impl not stub; SparseFifoHandle subclassing; module constants),
validation (rejects unsupported pattern tag, M<2, N>=M, N=0, unverified
pattern by default; allow_unverified accepts unverified; non-int N/M;
non-Tile producer), pattern correctness (each group of M has exactly
M-N zeros after pruning, parametrized for 1:2 / 1:4 / 2:4; decompressed
matmul bit-equal to sparse reference for 1:2 + 2:4), lowering (resolve()
emits aie.objectfifo with all 5 sparsity attrs; idempotent on
double-resolve), registry (SparseFifoHandle registered; dispatch_fn_arg
matches), diagnostics (handle exposes N/M/compression_ratio/
sparsity_pattern/sparse_fifo properties; __str__ includes pattern + N/M;
module __all__ stable).
Architectural references
------------------------
- AM020 Ch. 1 p. 15: AIE-ML supports structured sparsity for "CNN and
RNN application" (RNN explicitly named).
- AM020 Ch. 2 p. 27: "Adds decompression to the two S2MM channels" +
"Adds compression to the two MM2S channels".
- AM020 Ch. 5 p. 74: memtile compression / decompression.
- lib/Dialect/AIE/Util/aie_registers_aie2.json: BD field
Enable_Compression bit ("Only effective if channel has (de)compression
enabled").
AIE2P caveat
------------
AM020 documents AIE-ML's supported sparsity patterns. AIE2P inherits the
compute-tile DMA Enable_Compression bit but the accepted N:M patterns
documents the divergence and falls back to dense weights. SparseFifo
itself remains usable on AIE-ML targets even if AIE2P diverges; the
divergence is silicon/runtime, not API.
-------------------------
Only the SparseFifo slot region of python/iron/__init__.py is touched
(CascadeFifo / PacketFifo / AccumFifo / MemtileAggregator stubs untouched
per the parallel-agent serialization rule). Atomic-heredoc-commit
Signed-off-by: Matt Davis <matt@opensensor.io>
PacketFifoHandle.all_of_endpoints() now returns endpoint-typed objects (was: raw Tile objects). AccumFifoHandle.__init__ now chains through super().__init__() (was: bypassed, leaving _object_fifo unset). Fix shape (Option A: normalize the subclass contract; the alternative -- relaxing iron/program.py's walk -- moves the divergence into the consumer and is easier to regress). Both subclasses now satisfy the same contract that ObjectFifoHandle's all_of_endpoints() exposes: a list of objects each carrying a .tile attribute, which is what iron/program.py:81's tile-collection walk ([e.tile for e in fifo.all_of_endpoints()]) requires. Implementation note for AccumFifoHandle: rather than chain super().__init__() (which would require a memref-typed obj_type and ObjectFifo depth/dims that AccumFifo intentionally does not have -- a cascade transfer is one accumulator per cycle, no circular buffer), this commit overrides all_of_endpoints() on the subclass to walk the AccumFifo's prod/cons handles directly. Same end-state (callers see endpoint-typed objects with a .tile attribute), narrower change. PacketFifoHandle: all_of_endpoints() now walks the parent PacketFifo's prod_handles + cons_handles, surfacing each handle's .endpoint (the live Worker instance attached by Worker.fn_args registry dispatch) when set, falling back to ObjectFifoEndpoint(tile) wrappers when the topology is inspected pre-Worker construction. Coverage: new fork-internal test file test/iron/test_fifo_handle_program_walk.py (8 tests) exercises both broken paths -- the literal expression at iron/program.py:81 -- one set per handle type, plus a cross-cutting roll-up that walks the live registry. Existing test verdict: 58/59 pass in test/iron/test_packet_fifo.py + test/iron/test_accum_fifo.py. The one failing test (test_packet_fifo_accepts_explicit_packet_ids) fails on main as well -- it constructs PacketFifo with packet_ids=[0x10, 0x20] but 0x20=32 is rejected by the 5-bit pkt_id range cap [0, 31]. Pre-existing, unrelated to this fix. Closes: Signed-off-by: Matt Davis <matteius@gmail.com>
Adds VariableRateFifo + VariableRateFifoHandle, a primitive for
streams where the producer chooses per object whether to forward
or discard the buffer. Sibling to PacketFifo: PacketFifo handles
the N-into-1 fan-in side of variable-rate dataflow, VariableRateFifo
handles the single-producer conditional-forward side.
API:
fifo = VariableRateFifo(producer_tile, consumer_tile, depth, dtype)
with worker.runtime() as r:
out = fifo.acquire_producer()
if predicate(out):
fifo.forward_producer() # publishes the slot
else:
fifo.discard_producer() # frees the slot without publishing
Files:
- python/iron/variable_rate.py: VariableRateFifo + handle classes.
- python/iron/VARIABLE_RATE_DESIGN.md: design notes covering the
ObjectFifo extension, the producer/consumer state machine, and
the parallel to PacketFifo.
- python/iron/__init__.py: register VariableRateFifo in the public
surface alongside the other specialized FIFO subclasses.
- programming_examples/basic/variable_rate_filter/: minimal end-to-end
example showing a filter that forwards only objects whose first
byte is even (drops ~50% of input), with a CPU reference and
byte-equality check against the kernel output.
Note: the MLIR-side lowering passes that observe discardable attrs
(skip-unroll for variable-rate fifos, attr propagation through
split-fifo) are intentionally not in this PR — they will land
separately once the Python API surface is reviewed.
Signed-off-by: Matt Davis <matt@opensensor.io>
``aie.compress_mm2s`` / ``aie.decompress_s2mm`` discardable attrs to
the lowered ``aie.objectfifo.create`` op (per
``python/iron/sparse.py``), but the BD-emit pass at
``lib/Dialect/AIEX/Transforms/AIEDmaToNpu.cpp:655`` hardcoded
``Enable_Compression = 0`` on the AIE2/AIE2P tile DMA BD word. As a
(accuracy contract MET offline) but the wire-level DMA stayed at the
54.8 MB dense baseline rather than the ~27 MB 2:4 N:M-compressed
target.
Plumbing path -- three-hop discardable-attr propagation, no TableGen
schema change:
1. ``AIEObjectFifoStatefulTransform.cpp::createBd[Block]``: when
lowering ``aie.objectfifo.create`` to per-tile DMABDOps, read the
SparseFifo discardable attrs from the originating
ObjectFifoCreateOp (which is about to be erased at the end of
this pass). If ``aie.compress_mm2s = true`` and the BD's channel
direction is MM2S, OR ``aie.decompress_s2mm = true`` and the
direction is S2MM, attach the boolean ``aie.enable_compression =
true`` on the new DMABDOp. ``createBd`` now returns the DMABDOp
so the caller can decorate it.
2. ``AIEDMATasksToNPU.cpp::rewriteSingleBD``: when converting a
DMABDOp inside a ``aiex.dma_configure_task`` body to an
``aiex.npu.writebd`` op, copy the ``aie.enable_compression``
discardable attr from the source DMABDOp onto the new
NpuWriteBdOp.
3. ``AIEDmaToNpu.cpp::WriteBdToBlockWritePattern`` (the line-655
callsite the gap report named): when packing the
``aiex.npu.writebd`` op into the AIE2 tile-DMA 6-word block-write
payload, read ``aie.enable_compression`` off the NpuWriteBdOp and
OR bit 31 (``Enable_Compression``) of ``DMA_BDX_1`` accordingly.
Default branch is the pre-existing behaviour: no attrs on the
ObjectFifoCreateOp -> no attr on the DMABDOp -> no attr on the
NpuWriteBdOp -> ``Enable_Compression = 0``. The negative case is
regression-protected by the new lit test.
AM020 references (cited in ``python/iron/sparse.py`` module
docstring):
- AM020 Ch. 2 p. 27: compute-tile DMA adds compression to two MM2S
channels and decompression to two S2MM channels.
- AM020 Ch. 1 p. 15: AIE-ML supported N:M structured-sparsity
patterns (1:2, 1:4, 2:4) for CNN/RNN application; AIE2P caveat
documented in ``aie.iron.sparse``.
- ``lib/Dialect/AIE/Util/aie_registers_aie2.json`` BD field
``Enable_Compression``: "Enable Compression (MM2S),
decompression (S2MM). Only effective if channel has
(de)compression enabled".
This change leaves the AM020-verified-pattern check (and the
``allow_unverified=True`` escape hatch) at the IRON Python layer
asked for compression and the lowering produced the discardable
attrs, the bit gets flipped. Pattern-arch validation stays out of
the C++ pass.
New lit test ``test/Conversion/DmaToNpu/dma_to_npu_sparse_compression.mlir``:
- Positive case: ``aiex.npu.writebd { ..., aie.enable_compression =
true }`` -> ``DMA_BDX_1`` second word is ``0xD53D0000`` =
``3577544704`` (bit 31 set on top of the dense
``0x553D0000``/``1430061056`` reference shared with
``dma_to_npu_core_tile.mlir``).
- Negative case (regression-protect): no ``aie.enable_compression``
attr -> ``DMA_BDX_1`` stays at ``1430061056``.
Build verification: each touched ``.cpp`` compiles cleanly under the
existing Ninja rules (tested
``obj.AIETransforms.dir/AIEObjectFifoStatefulTransform.cpp.o``,
``obj.AIEXTransforms.dir/AIEDMATasksToNPU.cpp.o``, and
``obj.AIEXTransforms.dir/AIEDmaToNpu.cpp.o`` targets); the static
libraries ``libAIETransforms.a`` and ``libAIEXTransforms.a`` link
successfully. The final ``aie-opt`` link fails locally on a
host-config issue (lld absent; pre-existing CMake configuration uses
``-fuse-ld=lld``) -- handing off to the orchestrator's wheel-rebuild
Signed-off-by: Matt Davis <matteius@gmail.com>
…r propagation Two changes mirroring the SparseFifo discardable-attr propagation: 1. unrollForLoops: skip ObjectFifoAcquireOp on creators carrying aie.variable_rate. The producer's loop body for a variable-rate fifo contains a conditional acquire/release that the LCM-unroll math cannot model; the runtime-counter machinery handles asymmetric rates correctly without unrolling. If the loop has only variable-rate accesses, the loop is left alone (treated like a loop with no objectfifo accesses). 2. Split-fifo attr propagation: copy aie.variable_rate from the original ObjectFifoCreateOp to the consumer-side fifo so both halves carry the marker. Adds two lit tests: - variable_rate_fifo_attr_propagation.mlir - variable_rate_fifo_skip_unroll.mlir
- MemtileAggregator: drop the layout='window' parameter that was
reserved-but-unimplemented (raised NotImplementedError on use).
Slab is the only supported layout; window-major callers should drop
down to ObjectFifo.prod().join with explicit dims_to_stream. Removes
layout from __init__ signature, _VALID_LAYOUTS constant, validation,
the .layout property, the docstring discussion, the __str__, and the
two associated tests.
- AccumFifo precision tests: rename to reflect what they actually
measure (numpy reference baselines for the LSTM workload, not tests
of the AccumFifo lowering). Drop the 'load-bearing falsifiable claim'
framing from docstrings.
test_accum_fifo_invariant_hits_1e_minus_5_precision_target
-> test_fp32_lstm_reference_matches_pytorch
(a fixture sanity check; FP32 numpy LSTM matches FP32 pytorch)
test_accum_fifo_beats_bf16_baseline_by_three_orders_of_magnitude
-> test_fp32_reference_beats_bf16_writeback_by_3oom
(workload sensitivity characterization, not AccumFifo behaviour)
Helper renamed: _lstm_cell_with_accum_fifo -> _lstm_cell_fp32_reference
- SparseFifo round-trip test: rename + reframe.
test_decompressed_matmul_bit_equal_to_sparse_reference
-> test_nm_compression_roundtrip_is_lossless_on_compliant_input
Comment block above clarifies these are property tests for the N:M
compression format itself (numpy reference), not tests of SparseFifo's
lowering or silicon behaviour.
Address review issue 8: the prior example always called acquire/release on the producer side and never invoked discard(1), so it didn't actually demonstrate variable-rate behaviour. Restructure the example kernel to a Python-level deterministic alternating-skip pattern: every other window is forwarded via the C++ copy kernel + acquire/release; the alternate window is dropped via out_handle.discard(1). The producer's loop now has asymmetric acquire/release counts on the variable-rate output fifo, which is what the aie.variable_rate=true marker tells the lowering pass to expect. The C++ kernel is simplified to a void window-copy (skip decision moved to the Python layer). Module comment is rewritten to describe the deterministic-skip pattern accurately, with a note that predicate-decided runtime skips would require a first-class scf.if lowering not currently in IRON Python.
…Snapshot export) - register_fifo_handle: re-registering the *same* callable for the same class is now an idempotent no-op. This makes module reloads, repeated imports, and test harnesses safe. Registering a *different* handler still raises ValueError to catch accidental override. - _RegistrySnapshot: removed from __all__. The leading underscore already signals 'test-internal use only'; the class is still accessible via direct import for tests that need it. - Updated test_double_registration_raises -> split into test_double_registration_with_different_handler_raises + test_double_registration_with_same_handler_is_idempotent.
…FifoHandle Address review issue 7: AccumFifoHandle and PacketFifoHandle deliberately do not call super().__init__() because ObjectFifoHandle's constructor requires an ObjectFifo with semantics (depth, dims_from_stream_per_cons, _get_endpoint) that AccumFifo and PacketFifo do not have. They instead stub the attributes ObjectFifoHandle exposes as properties directly. Add an explicit class-docstring '.. note::' section in each documenting: - which attributes are stubbed - why super() is bypassed - that all_of_endpoints (which traverses _object_fifo) is overridden - that the proper long-term fix is a shared narrower base class This is a documentation fix, not a behavioural change. The shared base class refactor is intentionally out of scope for the initial primitive landing; the bypass pattern is rare in upstream IRON code, and explicit documentation keeps the next maintainer from being surprised. (SparseFifoHandle and VariableRateFifoHandle inherit normally from ObjectFifoHandle and do call super().__init__; only AccumFifoHandle and PacketFifoHandle exhibit the bypass pattern.)
…e context Address review issue 5. The previous test suite covered the construction surface, registry integration, and host-side behavioural simulation, but didn't exercise PacketFifo.resolve() against a real MLIR context — so a mismatch with the dialect's packetflow() signature would not have been caught. Add two new tests in test/iron/test_packet_fifo.py: - test_resolve_emits_packetflow_op_in_module: builds a 2-producer / 1-consumer PacketFifo inside an aie.device body, calls resolve(), and asserts the resulting MLIR contains 2 aie.packetflow ops (one per producer). - test_resolve_idempotent_does_not_emit_twice: verifies the resolve() reentrancy guard. Pattern mirrors the existing test_resolve_emits_cascade_flow_op_in_module in test_cascade_fifo.py. Verified the call-site signature matches dialects.aie.packetflow's __init__ (pkt_id, source, source_port, source_channel, dests, keep_pkt_header).
707a7df to
e036b85
Compare
There was a problem hiding this comment.
Pull request overview
This PR extends the IRON Python API with multiple specialized FIFO abstractions (e.g., cascade, packet, sparse, variable-rate, memtile aggregation) and wires them through MLIR lowering/passes so they work end-to-end, while keeping the existing ObjectFifo API unchanged. It also introduces an extensible Worker.fn_args dispatch registry so new *FifoHandle subclasses can integrate without editing worker.py.
Changes:
- Add
fifo_handle_registryand updateWorker.__init__to use registry-drivenfn_argsdispatch (withObjectFifoHandlepre-registered for backward compatibility). - Add new IRON primitives (notably
SparseFifo,VariableRateFifo,CascadeFifo,MemtileAggregator) and corresponding tests/examples. - Extend pass-side plumbing for sparse compression (
aie.enable_compression) and variable-rate unroll skipping / split-fifo attr propagation.
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| test/objectFifo-stateful-transform/variable_rate_fifo_skip_unroll.mlir | Lit test for excluding variable-rate fifos from LCM-based loop unrolling. |
| test/objectFifo-stateful-transform/variable_rate_fifo_attr_propagation.mlir | Lit test for split-fifo propagation with aie.variable_rate. |
| test/objectFifo-stateful-transform/sparse_fifo_split_attr_propagation.mlir | Lit test for split-fifo propagation of sparse compression attrs. |
| test/iron/test_worker_fifo_handle_extension.py | Pytest coverage for the new registry dispatch API (idempotency, precedence, snapshotting). |
| test/iron/test_sparse_fifo.py | Pytest coverage for SparseFifo surface + validation + lowering metadata behavior. |
| test/iron/test_packet_fifo.py | Pytest coverage for PacketFifo surface/registry/toy behavior + lowering smoke tests. |
| test/iron/test_memtile_aggregator.py | Pytest coverage for MemtileAggregator construction and equivalence checks (currently has syntax/API issues). |
| test/iron/test_fifo_handle_program_walk.py | Contract tests for *FifoHandle.all_of_endpoints() to support Program walks. |
| test/iron/test_cascade_fifo.py | Pytest coverage for CascadeFifo surface + lowering (aie.cascade_flow) + idempotency. |
| test/iron/test_accum_fifo.py | Pytest coverage for AccumFifo surface + warnings + lowering + reference fixtures. |
| test/Conversion/DmaToNpu/dma_to_npu_sparse_compression.mlir | Lit test pinning the final hop of sparse compression bit emission in NPU BD words. |
| python/iron/worker.py | Switch fn_args handling to registry-driven dispatch via dispatch_fn_arg. |
| python/iron/variable_rate.py | Implement VariableRateFifo + handle + attr pinning + registry registration. |
| python/iron/sparse.py | Implement SparseFifo + handle + sparsity validation + attr pinning + registry registration. |
| python/iron/memtile.py | Implement MemtileAggregator helper around ObjectFifo.prod().join() topology. |
| python/iron/dataflow/fifo_handle_registry.py | New registry module: register/unregister/list/dispatch + snapshot helper for tests. |
| python/iron/dataflow/init.py | Pre-register ObjectFifoHandle handler to preserve prior Worker bookkeeping behavior. |
| python/iron/cascade.py | Implement CascadeFifo lowering to aie.cascade_flow and placement-facing surface. |
| python/iron/init.py | Export the new primitives/handles from aie.iron. |
| python/iron/VARIABLE_RATE_DESIGN.md | Design doc capturing rationale and lowering model for VariableRateFifo. |
| programming_examples/basic/variable_rate_filter/variable_rate_filter.py | End-to-end example exercising VariableRateFifoHandle.discard(1). |
| programming_examples/basic/variable_rate_filter/filter_first_byte_even.cc | Companion C++ kernel (currently a pure copy) for the example. |
| programming_examples/basic/variable_rate_filter/README.md | Documentation for the variable-rate example and expected MLIR markers. |
| programming_examples/basic/variable_rate_filter/Makefile | Build plumbing for the variable-rate example. |
| lib/Dialect/AIEX/Transforms/AIEDmaToNpu.cpp | Emit Enable_Compression bit (word[1] bit 31) when aie.enable_compression is present. |
| lib/Dialect/AIEX/Transforms/AIEDMATasksToNPU.cpp | Forward aie.enable_compression from aie.dma_bd to aiex.npu.writebd. |
| lib/Dialect/AIE/Transforms/AIEObjectFifoStatefulTransform.cpp | Propagate sparse compression attrs into BDs, propagate attrs through split-fifo, and skip variable-rate fifos in LCM-unroll. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The earlier mass-strip pass over project-internal task IDs left several
files with structural damage (orphaned triple-quote opens swallowing
function bodies, broken comment fragments, and one missing list entry
that made SparseFifo() crash by default). All flagged by Copilot;
none surfaced earlier because most are inside docstrings (still
syntactically valid) or invariants tested only when an actual handle
is constructed.
Fixes:
- test/iron/test_memtile_aggregator.py: a stray '"""' on
_make_t53m_aggregator's first body line silently turned the
helper's body PLUS five subsequent test functions into one giant
docstring, leaving the helper as 'return None' and dropping
test_producer_returns_object_fifo_handle, test_producers_returns_n_handles_in_order,
test_producer_index_out_of_range_raises, test_consumer_returns_object_fifo_handle,
and test_offsets_match_flat_concat from pytest collection.
Restored a proper one-line docstring + function body; reconstructed
the test_offsets_match_flat_concat function body (the closing
'"""' on line 176 was its docstring tail). Also dropped
agg.layout assertion that referenced the now-removed property.
- python/iron/sparse.py: added (2, 4) back to
_AM020_VERIFIED_NM_PATTERNS. The strip ate the line, and the
defaults are N=2 M=4 — so SparseFifo() with no args raised on
every construction.
- python/iron/worker.py: rewrote the FIFO-handle dispatch comment
block (mid-sentence fragments + unmatched parenthesis from line
drops) into complete sentences.
- test/objectFifo-stateful-transform/variable_rate_fifo_attr_propagation.mlir:
fixed the orphaned 'or wedge under the split-fifo + cross-column
path that the' fragment in the header — restored a complete
sentence describing what the lit test verifies.
- programming_examples/basic/variable_rate_filter/{variable_rate_filter.py,README.md,filter_first_byte_even.cc}:
prior commit changed the example to a Python-level alternating
skip pattern (calling discard(1) in the skip branch), but the
comments + README still claimed the C++ kernel did a per-window
predicate check. Updated both to describe the deterministic
Python-skip + window-copy kernel that the code actually
implements; noted that runtime-decided per-window skips would
require a first-class scf.if lowering.
- Final residual sweep: removed remaining 'Phase 1' / 'bio-on-XDNA'
references from accum.py / dataflow/__init__.py / memtile.py /
test_cascade_fifo.py / test_worker_fifo_handle_extension.py.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 30 out of 30 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Adds a basic programming example that measures per-launch dispatch overhead on AIE2P silicon by exposing a trivial single-tile passthrough kernel under a parameterised IRON topology. The example varies two orthogonal knobs (n_chunks and dense_bytes) so a host driver script can regress per-launch wall against each and attribute the per-launch floor to: (a) xrt::run.wait() return-path overhead (b) instruction-stream upload cost (c) per-chunk shim-DMA setup overhead × N_CHUNKS (d) AIE2P firmware dispatcher per-launch handshake Files: - dispatch_overhead_bisector.py: IRON topology with N_CHUNKS separate fill/drain tasks per launch (so each chunk lowers to its own shim BD; the IRON access-pattern collapse is opted out of via per-chunk TensorAccessPattern). - passthrough.cc: 64-byte-vectorised memcpy compute kernel (no arithmetic — pushes compute to the noise floor so the host-runner wall captures dispatch-layer cost). - test.cpp: host runner; reports per-iteration wall-time distribution as KEY=VALUE lines for machine parsing, plus an optional per-iter CSV. - Makefile: builds one variant by default; `all-variants` builds the representative bisection sweep. - README.md: methodology + build/run/sweep instructions.
Six items, all addressed:
- python/iron/cascade.py: CascadeFifo.resolve() previously set
self._resolving = True before the tile-op precondition checks and
never cleared it on exception, leaving the instance permanently
non-resolvable. Move precondition checks BEFORE setting _resolving,
and wrap the actual emission in try/finally so failures don't latch
the flag. Also add an idempotency guard on self._op so a successful
resolve() is a no-op on re-entry.
- python/iron/sparse.py: deduplicate the (2, 4) entry in
_AM020_VERIFIED_NM_PATTERNS (added twice during the previous fix;
set semantics dedupe at runtime but the literal is misleading).
- python/iron/variable_rate.py: fix the malformed "Architectural
references" bullet list in the module docstring (orphan
continuation lines + truncated bullet from the earlier strip pass).
- test/iron/test_fifo_handle_program_walk.py: replace the
'"""+ : contract tests..."""' diff-artifact module docstring with
a clean one-liner.
- programming_examples/basic/variable_rate_filter/{variable_rate_filter.py,README.md}:
replace hardcoded /home/matteius/... paths in the build snippets
with placeholder syntax (<path/to/...>).
- python/iron/{accum.py,VARIABLE_RATE_DESIGN.md}, test/iron/test_accum_fifo.py:
replace remaining "T7-IRON" references (the strip regex required a
digit after the dot, which ".IRON" fails to match) with hardware-
grounded language about the well-trodden cascade-stream path.
Three more subsystems promoted to the public bionpu package:
src/bionpu/data/:
- canonical_sites.py (276 LOC): Cas-OFFinder TSV normaliser. Now the
canonical home; bionpu.verify.crispr imports from here. Adds
serialize_canonical() helper so verify can compute SHA-256 over
the canonical wire form without touching the filesystem.
- fetchers/{doench_2016,guide_seq,pod5_hg002,reference_genomes}.py
+ fetchers/__init__.py (495 LOC framework): per-dataset public
fetchers with SHA-pinning, framework documents the requirements
every dataset entry must satisfy.
- load_smoke.py (113 LOC): in-repo smoke fixture loaders.
src/bionpu/quant/:
- calibrate.py (215 LOC): ONNX quantization calibration driver
(thin wrapper around onnxruntime.quantization).
- passport.py (196 LOC): quantization passport — every quantized
model in the repo carries one (calibration source, op recipe,
reproducibility hash).
- peano_export.py (102 LOC): quantized ONNX -> MLIR-AIE -> xclbin
lowering hook.
src/bionpu/iron_extensions/:
- cascade_stream.py (387 LOC): cascade-chain IRON helper. Largely
superseded by mlir-aie's CascadeFifo (Xilinx/mlir-aie#3039) but
still consumed by 5 genetics files.
Internal bits NOT migrated (kept in genetics-private):
- bionpu/report/* — gaps-yaml aggregator + writeup pipeline; tightly
coupled to internal task-tracking format.
Refactor:
- src/bionpu/verify/_crispr_canonical.py removed; bionpu.verify.crispr
now imports from bionpu.data.canonical_sites (the public canonical
home).
License: Apache-2.0 + LLVM exception → GPL-3.0 across all migrated
files. Project-internal task IDs / outer-repo paths scrubbed.
All 18 verify-harness tests still pass; 71/71 Python files in the
public package parse.
…le findings The scrubber on the initial migrations missed several patterns: - Hyphenated task IDs (T7-IRON, T1-swarm) — regex required \.\d after T<n> - Investigation-phase labels (Followup A/B/C/E/G, Stage \d) - Tilde-rooted absolute paths (~/xdna-bringup/...) - Project-internal filename refs (gaps.yaml, umbrella PRD, gap-id) Pass: - 47 files mass-scrubbed of the missed patterns (34 code + 13 docs). - src/bionpu/iron_extensions/INVENTORY.md: rewritten from a 405-line internal investigation report into a brief module note. The cascade-stream feasibility outcome is now upstream aie.iron.CascadeFifo (Xilinx/mlir-aie#3039); this module is documented as a back-compat shim that will be removed once the upstream merge is the mlir-aie floor. - src/bionpu/kernels/basecalling/lstm_cell_bf16_acc_cascade/DESIGN.md: rewritten from a 406-line investigation log (Followup A-G stages, WEDGE-vs-PASS status flips, hypothesis falsification entries) into a 60-line design note covering the architectural rationale (FP32-cascade vs bf16-writeback precision wall, AM020 references, topology diagram, file inventory, known limitations). - README.md: dropped the dead 'v0.1 release notes' link to a tag that doesn't exist; rewrote the Status section to point at the new docs/STATUS.md per-subsystem inventory. - docs/STATUS.md: NEW — explicit per-subsystem table of what works end-to-end vs what's an extracted module that needs a v0.2 driver to drive on hardware. Distinguishes ✅ working / ✅ extracted /⚠️ deprecated /⚠️ v0.2 scope. - docs/REPRODUCE.md: rewrote from placeholder text into a real reproduction recipe — hardware prerequisites, software install, pip-install bionpu, byte-equality smoke check + negative control, manual kernel build, host-runner invocation, energy-methodology pointer, sanity-log discipline. - CHANGELOG.md: NEW — v0.1 release notes covering what landed and what's deferred to v0.2. Other per-kernel DESIGN.md files were checked: only the one cascade file carried a substantive investigation log; the others were already clean after the earlier scrub passes. All 18 verify-harness tests still pass; 70/70 Python files parse.
|
Hi @matteius -- this is neat work you've started! I was hoping to get a high-level overview of:
We typically discuss larger design decisions in an issue and/or discussion before considering integration, so we'd like to engage in some thought before considering ObjectFIFO redesigns. |
|
Hi @hunhoffe — context first: I set out to implement genomics algorithms on the AIE2P NPU in my Ryzen AI laptop, working backward from AMD's documentation (AM020 / AM029) on what the hardware supports. Working PoCs are at https://github.com/opensensor/bionpu — the primitives in this PR fell out of repeatedly re-discovering the same composition decisions across those workloads. High-level answers below. Concrete limitations behind each primitiveEach one exposes an AM020 / AIE2P hardware feature that doesn't currently have a first-class IRON surface; workloads re-glue dialect ops + ad-hoc Python per use site:
The bundling reflects how they compose in the bionpu workloads: StagingOne way to split, in increasing review surface:
Does that map to what you'd find reviewable, or would you split it differently? ProcessWould issues for the Tier 3 + Tier 4 design surfaces ( |
|
Hello @matteius! Thanks for your response. We are eager to ensure mlir-air + IRON can support use cases for application development, and it seems you have found a few gaps in feature coverage. We'd like to work with you to get those gaps filled. The path forward that makes the most sense to me is to:
For the breakdown of capabilities/PRs, I think the clearest path to success would be to:
Let us know if you have further questions; I'm excited to see what comes of your work so far! I also wanted to ask -- in opensensor/bionpu, do you have a specific applications/algorithm focus, or are you just generally exploring the space? |
|
Thanks @hunhoffe, I appreciate the staging proposal. It's a reasonable path for upstream integration.
As for that I recommend reading through: https://github.com/opensensor/bionpu/blob/main/docs/AIE2P-CRISPR-shape-validation.md -- By selectively engaging the NPU for parts of the CRISPR search path I arrived at a 4-5x speedup on a CRISPR editor I was building. I'm going to step back from upstreaming this work for now. The repackaging effort (top-level issue, per-feature PRs with baseline-vs-new framing, and the iterative review cycles for each) is more bandwidth than I wish to commit to right now alongside other things I have going on. I'll continue developing on my fork and will keep it rebased on For anyone landing on this thread looking for the working code: the primitives live on Thanks again for engaging with it -- no hard feelings on the process ask, it's the right ask for a project of this scope, its just not the right time on my end, always hustling to make ends meet. |
Summary
Adds six specialized FIFO subclasses to the IRON Python API plus the matching MLIR-pass plumbing so the primitives work end-to-end. Each primitive composes existing mlir-aie / AIE-ML / AIE2P features into a reusable abstraction; the existing
ObjectFifoAPI is unchanged.Primitives
CascadeFifocascade_flowdialect op behind the same producer/consumer surface asObjectFifo.PacketFifoAccumFifoSparseFifoEnable_Compressionplumbing throughAIEDmaToNpu/AIEDMATasksToNPUand split-fifo attr propagation inAIEObjectFifoStatefulTransform.MemtileAggregatorVariableRateFifounrollForLoopsskip + split-fifo attr propagation inAIEObjectFifoStatefulTransform.Foundation
Worker.fn_argsFifoHandle registry (python/iron/dataflow/fifo_handle_registry.py) — replaces the hard-codedisinstance(arg, ObjectFifoHandle)branch with an extensible dispatch registry. New FIFO subclasses register their own handle type without touchingworker.py.*FifoHandlesubclass contract normalization — fixes two latent breakages whereProgram._walk_object_fifosandiron/program.py:81's literal-type-check rejectedFifoHandlesubclasses. Pure broadening — every existingObjectFifoHandleuse still works.Diff stats
29 files, 7992 insertions(+), 26 deletions(-) split across:
python/iron/{cascade,packet,accum,sparse,memtile,variable_rate}.py— six new modules.python/iron/dataflow/fifo_handle_registry.py— registry foundation.python/iron/{__init__.py,worker.py,dataflow/__init__.py}— wire in the new subclasses.lib/Dialect/AIE/Transforms/AIEObjectFifoStatefulTransform.cpp,lib/Dialect/AIEX/Transforms/{AIEDmaToNpu,AIEDMATasksToNPU}.cpp— pass-side plumbing for SparseFifo's BDEnable_Compressionbit and VariableRateFifo's loop-unroll skip.test/iron/test_*.py— eight new pytest modules covering surface, registry dispatch, lowering (resolve()against anaie.devicecontext), and per-primitive behavioural toys.test/objectFifo-stateful-transform/{sparse_fifo_split_attr_propagation,variable_rate_fifo_attr_propagation,variable_rate_fifo_skip_unroll}.mlir,test/Conversion/DmaToNpu/dma_to_npu_sparse_compression.mlir— lit tests for the pass-side changes.programming_examples/basic/variable_rate_filter/— minimal end-to-end example exercisingdiscard(1)on a Python-level alternating skip pattern.python/iron/VARIABLE_RATE_DESIGN.md— design notes.Review feedback addressed
(This description reflects the post-review-fix state; v1 was rewritten in place.)
SparseFifoandVariableRateFifowork end-to-end out of the box; the prior shape would have silently degraded to vanillaObjectFifolowering on anaie-optwithout the matching pass changes.test_fp32_lstm_reference_matches_pytorch,test_fp32_reference_beats_bf16_writeback_by_3oom) — they're numpy reference baselines for the LSTM workload, not tests of the AccumFifo lowering. The "load-bearing falsifiable claim" framing is dropped.test_resolve_emits_packetflow_op_in_module,test_resolve_idempotent_does_not_emit_twice) so thedialects.aie.packetflowcall signature is exercised end-to-end.super().__init__()is bypassed, what attributes are stubbed, and the long-term direction (a shared narrower base class). SparseFifoHandle and VariableRateFifoHandle inherit normally.discard(1)on a Python-level alternating skip pattern (the prior example always calledacquire/releaseand neverdiscard).test_nm_compression_roundtrip_is_lossless_on_compliant_input) and reframed as a property test for the N:M compression format itself, not a test ofSparseFifo's lowering or silicon behaviour.MemtileAggregatorlayout="window"dead code removed (parameter, validator, NotImplementedError block, property, tests).register_fifo_handlebrittleness fixed: re-registering the same handler for an already-registered class is now an idempotent no-op (module reloads / repeated imports stay safe). Re-registering a different handler still raises._RegistrySnapshotremoved from__all__(the leading underscore signals "test-internal use only").Co-Authored-By: Claudetrailers removed from all commit messages.Test plan
dma_to_npu_sparse_compression.mlir,sparse_fifo_split_attr_propagation.mlir,variable_rate_fifo_attr_propagation.mlir,variable_rate_fifo_skip_unroll.mlir.programming_examples/basic/variable_rate_filter/builds cleanly and produces the expected MLIR (aie.variable_rate = trueon both producer and consumer fifos after the split-fifo propagation).Happy to split into per-primitive PRs if maintainers prefer, but this bucket lands together so each primitive is end-to-end-functional out of the box.
Signed-off-by: Matt Davis matt@opensensor.io