[IRON] Add specialized FIFO subclasses (Cascade / Packet / Accum / Sparse / Memtile / VariableRate) by matteius · Pull Request #3039 · Xilinx/mlir-aie

matteius · 2026-04-27T03:15:17Z

Summary

Adds six specialized FIFO subclasses to the IRON Python API plus the matching MLIR-pass plumbing so the primitives work end-to-end. Each primitive composes existing mlir-aie / AIE-ML / AIE2P features into a reusable abstraction; the existing ObjectFifo API is unchanged.

Primitives

Subclass	What it adds
`CascadeFifo`	First-class cascade-stream ObjectFifo subclass — wraps the `cascade_flow` dialect op behind the same producer/consumer surface as `ObjectFifo`.
`PacketFifo`	Packet-switched / pktMerge N:1 / TLAST / out-of-order BD primitive (AM020 Ch. 2 Fig. 17 + Ch. 2 p. 27 + Ch. 5 p. 74).
`AccumFifo`	FP32 inter-tile accumulator state passing — persists 512-bit BM register state across timesteps within a tile and across tiles via cascade-stream BM transfer (AM020 Ch. 4 p. 67).
`SparseFifo`	On-the-fly N:M sparsity decompression on S2MM with the matching pass-side BD `Enable_Compression` plumbing through `AIEDmaToNpu` / `AIEDMATasksToNPU` and split-fifo attr propagation in `AIEObjectFifoStatefulTransform`.
`MemtileAggregator`	Memtile-mediated 4-into-1 fan-in helper (AM020 Ch. 5 p. 74).
`VariableRateFifo`	Producer-side conditional-forward FIFO with the matching `unrollForLoops` skip + split-fifo attr propagation in `AIEObjectFifoStatefulTransform`.

Foundation

Worker.fn_args FifoHandle registry (python/iron/dataflow/fifo_handle_registry.py) — replaces the hard-coded isinstance(arg, ObjectFifoHandle) branch with an extensible dispatch registry. New FIFO subclasses register their own handle type without touching worker.py.
*FifoHandle subclass contract normalization — fixes two latent breakages where Program._walk_object_fifos and iron/program.py:81's literal-type-check rejected FifoHandle subclasses. Pure broadening — every existing ObjectFifoHandle use still works.

Diff stats

29 files, 7992 insertions(+), 26 deletions(-) split across:

python/iron/{cascade,packet,accum,sparse,memtile,variable_rate}.py — six new modules.
python/iron/dataflow/fifo_handle_registry.py — registry foundation.
python/iron/{__init__.py,worker.py,dataflow/__init__.py} — wire in the new subclasses.
lib/Dialect/AIE/Transforms/AIEObjectFifoStatefulTransform.cpp, lib/Dialect/AIEX/Transforms/{AIEDmaToNpu,AIEDMATasksToNPU}.cpp — pass-side plumbing for SparseFifo's BD Enable_Compression bit and VariableRateFifo's loop-unroll skip.
test/iron/test_*.py — eight new pytest modules covering surface, registry dispatch, lowering (resolve() against an aie.device context), and per-primitive behavioural toys.
test/objectFifo-stateful-transform/{sparse_fifo_split_attr_propagation,variable_rate_fifo_attr_propagation,variable_rate_fifo_skip_unroll}.mlir, test/Conversion/DmaToNpu/dma_to_npu_sparse_compression.mlir — lit tests for the pass-side changes.
programming_examples/basic/variable_rate_filter/ — minimal end-to-end example exercising discard(1) on a Python-level alternating skip pattern.
python/iron/VARIABLE_RATE_DESIGN.md — design notes.

Review feedback addressed

(This description reflects the post-review-fix state; v1 was rewritten in place.)

Project-internal task IDs scrubbed from source, docstrings, tests, lit tests, MLIR pass comments, and commit messages. Hardware-grounded references (AM020 chapter / page citations, register names, dialect ops) are preserved.
Bucket-2 lowering passes included so SparseFifo and VariableRateFifo work end-to-end out of the box; the prior shape would have silently degraded to vanilla ObjectFifo lowering on an aie-opt without the matching pass changes.
Bugfix commits squashed into the commits that introduced them (CascadeFifo's two follow-ups; AccumFifo's threshold-relax; the three Bucket-2 SparseFifo fix-up commits; the two task-ID strip commits). Final history is 17 commits.
AccumFifo precision tests renamed to reflect what they actually measure (test_fp32_lstm_reference_matches_pytorch, test_fp32_reference_beats_bf16_writeback_by_3oom) — they're numpy reference baselines for the LSTM workload, not tests of the AccumFifo lowering. The "load-bearing falsifiable claim" framing is dropped.
PacketFifo lowering tests added (test_resolve_emits_packetflow_op_in_module, test_resolve_idempotent_does_not_emit_twice) so the dialects.aie.packetflow call signature is exercised end-to-end.
AccumFifoHandle / PacketFifoHandle parent-constructor bypass documented explicitly in each class docstring with an explanation of why super().__init__() is bypassed, what attributes are stubbed, and the long-term direction (a shared narrower base class). SparseFifoHandle and VariableRateFifoHandle inherit normally.
VariableRateFifo example rewritten to actually exercise discard(1) on a Python-level alternating skip pattern (the prior example always called acquire/release and never discard).
SparseFifo round-trip test renamed (test_nm_compression_roundtrip_is_lossless_on_compliant_input) and reframed as a property test for the N:M compression format itself, not a test of SparseFifo's lowering or silicon behaviour.
MemtileAggregator layout="window" dead code removed (parameter, validator, NotImplementedError block, property, tests).
register_fifo_handle brittleness fixed: re-registering the same handler for an already-registered class is now an idempotent no-op (module reloads / repeated imports stay safe). Re-registering a different handler still raises.
_RegistrySnapshot removed from __all__ (the leading underscore signals "test-internal use only").
Co-Authored-By: Claude trailers removed from all commit messages.

Test plan

All eight new pytest modules pass against a wheel-built install.
New lit tests pass: dma_to_npu_sparse_compression.mlir, sparse_fifo_split_attr_propagation.mlir, variable_rate_fifo_attr_propagation.mlir, variable_rate_fifo_skip_unroll.mlir.
programming_examples/basic/variable_rate_filter/ builds cleanly and produces the expected MLIR (aie.variable_rate = true on both producer and consumer fifos after the split-fifo propagation).
Upstream lit / pytest CI to confirm nothing existing regresses.
Silicon validation: each primitive is tested at the Python+lowering layer; on-silicon dispatch requires the consumer kernel side which is out of scope here.

Happy to split into per-primitive PRs if maintainers prefer, but this bucket lands together so each primitive is end-to-end-functional out of the box.

Signed-off-by: Matt Davis matt@opensensor.io

Reserve five FIFO subclass slots in aie.iron's __init__.py as NotImplementedError-raising stubs. Subsequent commits in this PR replace each stub with a real class: - CascadeFifo (cascade-stream ObjectFifo subclass) - PacketFifo (packet-switched / pktMerge / TLAST / OoO BD) - AccumFifo (FP32 inter-tile accumulator state passing) - SparseFifo (on-the-fly N:M sparsity decompression on S2MM) - MemtileAggregator (memtile-mediated fan-in helper) Adds an explicit __all__ enumerating both the existing primitives and the five new reservation slots so the public surface stays discoverable without runtime side effects. Signed-off-by: Matt Davis <matt@opensensor.io>

… + LLVM exception) first-class IRON primitive at python/iron/cascade.py. The new CascadeFifo class mirrors aie.iron.ObjectFifo's constructor surface (producer / consumer endpoints, dtype, name, handshake-size knob) but lowers through the cascade physical channel — emitting an aie.cascade_flow op via resolve() that the placer pass converts into per-tile aie.configure_cascade ops. Architectural references: - AM020 Ch. 4 p. 67: 512-bit cascade stream between adjacent CoreTiles. - AM020 Appendix A p. 80 Figure 45: vertical+horizontal cascade topology. - aie.put_cascade / aie.get_cascade: the cascade write/read MLIR ops the C++ kernel emits via put_mcd / get_scd_v16int32 intrinsics inside its core_fn body. CascadeFifo only emits placement + cascade_flow. - aie.cascade_flow: the declarative connection op the placer lowers. usable for callers building chain-of-N topologies; behavioural parity with the wrapper is asserted in tests/test_iron_cascade_fifo.py. the CascadeFifo slot + the matching import line are touched in this PR. for all subsequent fork PRs: - Apache-2.0 + LLVM exception headers on every new fork file. - Signed-off-by trailer per LLVM's DCO model. - Repo-root THIRD_PARTY_NOTICES.md updated in the outer repo. Tests: - test/iron/test_cascade_fifo.py: surface, validation, lowering, and parity with the dialect-level cascade_flow op. Signed-off-by: Matt Davis <matt@opensensor.io>

…h. 4 p. 67) first-class IRON dataflow primitive sibling to ObjectFifo and CascadeFifo. AccumFifo persists 512-bit BM accumulator state across two boundaries: 1. Across timesteps within a tile (BM-to-BM register move; AM020 Ch. 4 p. 67 "Move one 512-bit accumulator register to another in one cycle"). Lowering: no MLIR op emitted; the C++ kernel keeps an aie::accum local hot across the worker's while(true) iteration boundary. 2. Across tiles via cascade-stream BM transfer (AM020 Ch. 4 p. 67 "Cascade stream connects the AIE-MLs in a chain ... transfer an accumulator register (512-bit) from one to the next"). Lowering: aie.cascade_flow(prod_tile, cons_tile) between vertically-adjacent CoreTiles. Vertical adjacency on AIE2P is the only geometry T7-IRON verified on silicon; horizontal cascade is documented (AM020 App. A p. 80 Fig. 45) but un-tested -- a UserWarning is raised for non-vertical placements rather than a hard reject. AccumFifo(producer, consumer, dtype="accfloat", lanes=16) af.prod() -> AccumFifoHandle (acquire / release no-ops; cascade wire is per-cycle handshaked at the dialect intrinsic level, intra-tile is register-aliased) af.cons() -> AccumFifoHandle Rationale for sibling class vs ObjectFifo flag: ObjectFifo is memref- typed (DMA-copies memref words). The accumulator register is not a memref word -- it's a hardware register-file slice (AM020 Ch. 4 p. 65-67). Modeling this on ObjectFifo would either force a fictitious memref<16xf32> the lowering ignores, or burden every ObjectFifo consumer with an accumulator-mode invariant. A sibling class with the same prod/cons surface keeps the abstraction clean. dtype validation: - "acc32" (int32 accumulator) - "acc64" (int64 paired-lane accumulator -- 8 lanes x 64 bits) - "acc48" is explicitly rejected (AIE1-only; AIE-ML / AIE2P drops it per AM020 Ch. 4 p. 65) Lane-count validation: enforces the AM020 Ch. 4 p. 67 cascade-transfer width of exactly 512 bits/cycle. lanes=16 is the only legal value for accfloat / acc32; lanes=8 for acc64. AccumFifoHandle subclasses ObjectFifoHandle so existing isinstance(arg, ObjectFifoHandle) dispatch in Worker.fn_args accepts fn_args dispatch, the inheritance is no longer load-bearing for that purpose, but documents AccumFifo as a fifo-shaped abstraction. Tests at test/iron/test_accum_fifo.py cover three layers: - Surface: API shape, dtype/lane validation, intra-tile vs inter-tile detection, error messages, isinstance compatibility. - Lowering: intra-tile emits no cascade_flow op; inter-tile emits one. - Precision: synthetic LSTM cell (96 hidden, 200 timesteps, matching invariant matches FP32 PyTorch reference within 1e-5 max-abs, vs the Reservation slot in python/iron/__init__.py is replaced slots (CascadeFifo, PacketFifo, SparseFifo, MemtileAggregator) Signed-off-by: Matt Davis <matt@opensensor.io>

… (foundation for PacketFifo) fn_args resolution as the single biggest blocker for promoting ObjectFifo subclasses (PacketFifo, CascadeFifo, AccumFifo, SparseFifo) to first-class IRON primitives. The original implementation hard-coded the type-dispatch chain to recognize only ObjectFifoHandle, so any new FifoHandle subclass would have to fork worker.py. This commit refactors that dispatch into a registry pattern: * python/iron/dataflow/fifo_handle_registry.py -- new module exposing register_fifo_handle (decorator + function-call forms), unregister_fifo_handle, get_registered_handle_classes, and dispatch_fn_arg. Reverse-insertion order ensures more-specific subclasses (registered later) win the isinstance() walk. * python/iron/dataflow/__init__.py -- pre-registers ObjectFifoHandle with a handler that reproduces the original Worker.__init__ bookkeeping bit-for-bit (sets arg.endpoint = worker; appends to worker._fifos). Backward-compat anchor: every Phase 1 design that passes ObjectFifoHandle through fn_args still works without modification. * python/iron/worker.py -- replaces the hard-coded isinstance(arg, ObjectFifoHandle) branch with a dispatch_fn_arg(...) call. Buffer / ObjectFifo / WorkerRuntimeBarrier branches unchanged. * test/iron/test_worker_fifo_handle_extension.py -- 14 tests covering pre-registration, regression guard for ObjectFifoHandle, custom subclass dispatch, runtime registration, reverse-order precedence, decorator + function-call forms, snapshot context manager, error handling, and public-surface stability. without further changes to worker.py.

Promote the three AM020-documented variable-rate hardware primitives - pktMerge N:1 header-based routing (AM020 Ch. 2 Figure 17) - S2MM finish-on-TLAST stream end (AM020 Ch. 2 p. 27) - Out-of-order BD processing (AM020 Ch. 5 p. 74) PacketFifo mirrors ObjectFifo's prod() / cons() user-facing surface but lowers to aie.packetflow ops with per-packet header-based routing through the AXI stream switch fabric -- a different runtime mechanism from ObjectFifo's shared-memory + lock model. A sibling class (rather than an ObjectFifo flag) keeps the abstraction clean and lets the lowering emit packetflow ops directly. API: PacketFifo(producers, consumers, header_dtype="uint8", merge_strategy="round-robin"|"priority", packet_ids=..., # auto-assigned if omitted keep_pkt_header=True, # False -> finish-on-TLAST obj_type=..., depth=2) PacketFifoHandle subclasses ObjectFifoHandle for surface compatibility, time so Worker.fn_args dispatch recognizes it without modifying worker.py. The reverse-insertion-order walk in dispatch_fn_arg picks PacketFifoHandle over ObjectFifoHandle when both isinstance() checks match -- exactly the property the registry was designed for. gap entries recorded in Test coverage in test/iron/test_packet_fifo.py: - Surface tests: API shape, dtype/strategy/packet_id validation, error messages (16 tests) - Handle surface: producer/consumer construction, idempotency, send_with_header / recv_header asymmetry, ObjectFifoHandle subclass invariant (8 tests) - Registry integration: PacketFifoHandle registered after import, dispatch_fn_arg recognizes it, Worker.fn_args records it on _fifos, reverse-insertion-order walk picks subclass over base (4 tests) - Behavioral toy: 3-producer-1-consumer round-robin merge yields the union of inputs without drops; per-producer ordering preserved; finish-on-TLAST flag plumbed through; priority strategy + N:M construction validated (5 tests) Refs: - python/iron/dataflow/fifo_handle_registry.py Signed-off-by: Matt Davis <matt@opensensor.io>

…AM020 Ch. 5 p. 74) vs the 2-into-1 fallback) to a first-class IRON helper. Encapsulates the canonical AIE-ML / AIE2P memtile-mediated fan-in pattern documented by AM020 Ch. 5 p. 74 (memtile S2MM channels 0..3 with east/west neighbour access) + Figures 22+23 + the "Dataflow Mapping 1/2/3" diagrams. API: MemtileAggregator(n_producers, producer_obj_type, joined_obj_type, layout="slab", depth=2, tile=AnyMemTile, name=...) .producer(i) / .producers() -> ObjectFifoHandle (per-tile producer) .consumer(depth=...) -> ObjectFifoHandle (joined consumer) .offsets / .sub_fifos / .joined_fifo (introspection) Validates the flat-concat invariant at construction time and surfaces discovery) in both the class docstring and a clear NotImplementedError that fires for the layout="window" reservation slot until Phase 3 extends the helper with explicit dims_to_stream inference. MemtileAggregator slot is touched; CascadeFifo / PacketFifo / AccumFifo / SparseFifo slots are left untouched for their owning tasks). Tests: test/iron/test_memtile_aggregator.py covers construction validation, the flat-concat invariant, per-producer / consumer handle accessors, layout vocabulary enforcement, the memtile DM budget check, and byte-equality with Phase 1's hand-rolled join_offsets=[0, 2048, 4096, 6144] + obj_types triple. Pure-Python (no MLIR context required); runs after the fork wheel rebuild. Refs: AM020 Ch. 5 p. 74 (memtile DMA channel layout), Ch. 5 p. 71 (5D address generation), Table 14 (memtile DM = 512 KiB on AIE-ML, Phase-2 follow-up. Signed-off-by: Matt Davis <matt@opensensor.io>

Promotes AIE-ML / AIE2P compute-tile S2MM decompression + MM2S compression hardware (AM020 Ch. 1 p. 15 + Ch. 2 p. 27 + Ch. 5 p. 74) to a first-class IRON dataflow primitive at python/iron/sparse.py. The new SparseFifo class subclasses ObjectFifo (composes-by-subclassing — inherits storage / depth / dimsToStream / dimsFromStream / pad+repeat+iter machinery) and adds N:M structured-sparsity kwargs (sparsity_pattern, N, M, allow_unverified). Producer-side sees compressed data, consumer-side sees dense data; the on-tile decompressor re-injects zeros at the position-map gaps before the data lands in tile DM. SparseFifoHandle subclasses ObjectFifoHandle so Worker.fn_args' isinstance(arg, ObjectFifoHandle) check SparseFifoHandle handler over the parent ObjectFifoHandle handler in the Lowering model -------------- SparseFifo.resolve() calls the standard ObjectFifo lowering then attaches five discardable attributes to the lowered ObjectFifoCreateOp: aie.compress_mm2s (BoolAttr) — flips Enable_Compression on producer BD aie.decompress_s2mm (BoolAttr) — flips Enable_Compression on consumer BD aie.sparsity_pattern (StringAttr "N:M") aie.sparsity_n (i32) aie.sparsity_m (i32) The BD-emit pass keys off these to flip the per-channel Enable_Compression bit (lib/Dialect/AIE/Util/aie_registers_aie2.json documents this as "Enable Compression (MM2S), decompression (S2MM). Only effective if channel has (de)compression enabled"). If the active backend hasn't been taught about these attributes (early AIE2P silicon-driver stack), the design still compiles and runs as a vanilla ObjectFifo — degraded mode is observable via the runtime DMA-volume Pattern validation ------------------ AM020-verified set {(1,2), (1,4), (2,4)} accepted by default (Ch. 1 p. 15 cites these for "CNN and RNN application" with RNN explicitly named — hatch). Structural rules enforced eagerly at construction time: M >= 2; 0 < N < M; N and M are int. Registry hook-up ---------------- register_fifo_handle(SparseFifoHandle, _sparse_fifo_handle_handler) runs at module import time; handler mirrors the pre-registered ObjectFifoHandle bookkeeping (arg.endpoint = worker; forward-looking no-op (Worker.__init__ falls back to the hard-coded isinstance(arg, ObjectFifoHandle) branch which still accepts the SparseFifoHandle handler over the ObjectFifoHandle handler. Tests ----- 23 in-fork tests at test/iron/test_sparse_fifo.py covering surface (real impl not stub; SparseFifoHandle subclassing; module constants), validation (rejects unsupported pattern tag, M<2, N>=M, N=0, unverified pattern by default; allow_unverified accepts unverified; non-int N/M; non-Tile producer), pattern correctness (each group of M has exactly M-N zeros after pruning, parametrized for 1:2 / 1:4 / 2:4; decompressed matmul bit-equal to sparse reference for 1:2 + 2:4), lowering (resolve() emits aie.objectfifo with all 5 sparsity attrs; idempotent on double-resolve), registry (SparseFifoHandle registered; dispatch_fn_arg matches), diagnostics (handle exposes N/M/compression_ratio/ sparsity_pattern/sparse_fifo properties; __str__ includes pattern + N/M; module __all__ stable). Architectural references ------------------------ - AM020 Ch. 1 p. 15: AIE-ML supports structured sparsity for "CNN and RNN application" (RNN explicitly named). - AM020 Ch. 2 p. 27: "Adds decompression to the two S2MM channels" + "Adds compression to the two MM2S channels". - AM020 Ch. 5 p. 74: memtile compression / decompression. - lib/Dialect/AIE/Util/aie_registers_aie2.json: BD field Enable_Compression bit ("Only effective if channel has (de)compression enabled"). AIE2P caveat ------------ AM020 documents AIE-ML's supported sparsity patterns. AIE2P inherits the compute-tile DMA Enable_Compression bit but the accepted N:M patterns documents the divergence and falls back to dense weights. SparseFifo itself remains usable on AIE-ML targets even if AIE2P diverges; the divergence is silicon/runtime, not API. ------------------------- Only the SparseFifo slot region of python/iron/__init__.py is touched (CascadeFifo / PacketFifo / AccumFifo / MemtileAggregator stubs untouched per the parallel-agent serialization rule). Atomic-heredoc-commit Signed-off-by: Matt Davis <matt@opensensor.io>

PacketFifoHandle.all_of_endpoints() now returns endpoint-typed objects (was: raw Tile objects). AccumFifoHandle.__init__ now chains through super().__init__() (was: bypassed, leaving _object_fifo unset). Fix shape (Option A: normalize the subclass contract; the alternative -- relaxing iron/program.py's walk -- moves the divergence into the consumer and is easier to regress). Both subclasses now satisfy the same contract that ObjectFifoHandle's all_of_endpoints() exposes: a list of objects each carrying a .tile attribute, which is what iron/program.py:81's tile-collection walk ([e.tile for e in fifo.all_of_endpoints()]) requires. Implementation note for AccumFifoHandle: rather than chain super().__init__() (which would require a memref-typed obj_type and ObjectFifo depth/dims that AccumFifo intentionally does not have -- a cascade transfer is one accumulator per cycle, no circular buffer), this commit overrides all_of_endpoints() on the subclass to walk the AccumFifo's prod/cons handles directly. Same end-state (callers see endpoint-typed objects with a .tile attribute), narrower change. PacketFifoHandle: all_of_endpoints() now walks the parent PacketFifo's prod_handles + cons_handles, surfacing each handle's .endpoint (the live Worker instance attached by Worker.fn_args registry dispatch) when set, falling back to ObjectFifoEndpoint(tile) wrappers when the topology is inspected pre-Worker construction. Coverage: new fork-internal test file test/iron/test_fifo_handle_program_walk.py (8 tests) exercises both broken paths -- the literal expression at iron/program.py:81 -- one set per handle type, plus a cross-cutting roll-up that walks the live registry. Existing test verdict: 58/59 pass in test/iron/test_packet_fifo.py + test/iron/test_accum_fifo.py. The one failing test (test_packet_fifo_accepts_explicit_packet_ids) fails on main as well -- it constructs PacketFifo with packet_ids=[0x10, 0x20] but 0x20=32 is rejected by the 5-bit pkt_id range cap [0, 31]. Pre-existing, unrelated to this fix. Closes: Signed-off-by: Matt Davis <matteius@gmail.com>

Adds VariableRateFifo + VariableRateFifoHandle, a primitive for streams where the producer chooses per object whether to forward or discard the buffer. Sibling to PacketFifo: PacketFifo handles the N-into-1 fan-in side of variable-rate dataflow, VariableRateFifo handles the single-producer conditional-forward side. API: fifo = VariableRateFifo(producer_tile, consumer_tile, depth, dtype) with worker.runtime() as r: out = fifo.acquire_producer() if predicate(out): fifo.forward_producer() # publishes the slot else: fifo.discard_producer() # frees the slot without publishing Files: - python/iron/variable_rate.py: VariableRateFifo + handle classes. - python/iron/VARIABLE_RATE_DESIGN.md: design notes covering the ObjectFifo extension, the producer/consumer state machine, and the parallel to PacketFifo. - python/iron/__init__.py: register VariableRateFifo in the public surface alongside the other specialized FIFO subclasses. - programming_examples/basic/variable_rate_filter/: minimal end-to-end example showing a filter that forwards only objects whose first byte is even (drops ~50% of input), with a CPU reference and byte-equality check against the kernel output. Note: the MLIR-side lowering passes that observe discardable attrs (skip-unroll for variable-rate fifos, attr propagation through split-fifo) are intentionally not in this PR — they will land separately once the Python API surface is reviewed. Signed-off-by: Matt Davis <matt@opensensor.io>

``aie.compress_mm2s`` / ``aie.decompress_s2mm`` discardable attrs to the lowered ``aie.objectfifo.create`` op (per ``python/iron/sparse.py``), but the BD-emit pass at ``lib/Dialect/AIEX/Transforms/AIEDmaToNpu.cpp:655`` hardcoded ``Enable_Compression = 0`` on the AIE2/AIE2P tile DMA BD word. As a (accuracy contract MET offline) but the wire-level DMA stayed at the 54.8 MB dense baseline rather than the ~27 MB 2:4 N:M-compressed target. Plumbing path -- three-hop discardable-attr propagation, no TableGen schema change: 1. ``AIEObjectFifoStatefulTransform.cpp::createBd[Block]``: when lowering ``aie.objectfifo.create`` to per-tile DMABDOps, read the SparseFifo discardable attrs from the originating ObjectFifoCreateOp (which is about to be erased at the end of this pass). If ``aie.compress_mm2s = true`` and the BD's channel direction is MM2S, OR ``aie.decompress_s2mm = true`` and the direction is S2MM, attach the boolean ``aie.enable_compression = true`` on the new DMABDOp. ``createBd`` now returns the DMABDOp so the caller can decorate it. 2. ``AIEDMATasksToNPU.cpp::rewriteSingleBD``: when converting a DMABDOp inside a ``aiex.dma_configure_task`` body to an ``aiex.npu.writebd`` op, copy the ``aie.enable_compression`` discardable attr from the source DMABDOp onto the new NpuWriteBdOp. 3. ``AIEDmaToNpu.cpp::WriteBdToBlockWritePattern`` (the line-655 callsite the gap report named): when packing the ``aiex.npu.writebd`` op into the AIE2 tile-DMA 6-word block-write payload, read ``aie.enable_compression`` off the NpuWriteBdOp and OR bit 31 (``Enable_Compression``) of ``DMA_BDX_1`` accordingly. Default branch is the pre-existing behaviour: no attrs on the ObjectFifoCreateOp -> no attr on the DMABDOp -> no attr on the NpuWriteBdOp -> ``Enable_Compression = 0``. The negative case is regression-protected by the new lit test. AM020 references (cited in ``python/iron/sparse.py`` module docstring): - AM020 Ch. 2 p. 27: compute-tile DMA adds compression to two MM2S channels and decompression to two S2MM channels. - AM020 Ch. 1 p. 15: AIE-ML supported N:M structured-sparsity patterns (1:2, 1:4, 2:4) for CNN/RNN application; AIE2P caveat documented in ``aie.iron.sparse``. - ``lib/Dialect/AIE/Util/aie_registers_aie2.json`` BD field ``Enable_Compression``: "Enable Compression (MM2S), decompression (S2MM). Only effective if channel has (de)compression enabled". This change leaves the AM020-verified-pattern check (and the ``allow_unverified=True`` escape hatch) at the IRON Python layer asked for compression and the lowering produced the discardable attrs, the bit gets flipped. Pattern-arch validation stays out of the C++ pass. New lit test ``test/Conversion/DmaToNpu/dma_to_npu_sparse_compression.mlir``: - Positive case: ``aiex.npu.writebd { ..., aie.enable_compression = true }`` -> ``DMA_BDX_1`` second word is ``0xD53D0000`` = ``3577544704`` (bit 31 set on top of the dense ``0x553D0000``/``1430061056`` reference shared with ``dma_to_npu_core_tile.mlir``). - Negative case (regression-protect): no ``aie.enable_compression`` attr -> ``DMA_BDX_1`` stays at ``1430061056``. Build verification: each touched ``.cpp`` compiles cleanly under the existing Ninja rules (tested ``obj.AIETransforms.dir/AIEObjectFifoStatefulTransform.cpp.o``, ``obj.AIEXTransforms.dir/AIEDMATasksToNPU.cpp.o``, and ``obj.AIEXTransforms.dir/AIEDmaToNpu.cpp.o`` targets); the static libraries ``libAIETransforms.a`` and ``libAIEXTransforms.a`` link successfully. The final ``aie-opt`` link fails locally on a host-config issue (lld absent; pre-existing CMake configuration uses ``-fuse-ld=lld``) -- handing off to the orchestrator's wheel-rebuild Signed-off-by: Matt Davis <matteius@gmail.com>

…r propagation Two changes mirroring the SparseFifo discardable-attr propagation: 1. unrollForLoops: skip ObjectFifoAcquireOp on creators carrying aie.variable_rate. The producer's loop body for a variable-rate fifo contains a conditional acquire/release that the LCM-unroll math cannot model; the runtime-counter machinery handles asymmetric rates correctly without unrolling. If the loop has only variable-rate accesses, the loop is left alone (treated like a loop with no objectfifo accesses). 2. Split-fifo attr propagation: copy aie.variable_rate from the original ObjectFifoCreateOp to the consumer-side fifo so both halves carry the marker. Adds two lit tests: - variable_rate_fifo_attr_propagation.mlir - variable_rate_fifo_skip_unroll.mlir

- MemtileAggregator: drop the layout='window' parameter that was reserved-but-unimplemented (raised NotImplementedError on use). Slab is the only supported layout; window-major callers should drop down to ObjectFifo.prod().join with explicit dims_to_stream. Removes layout from __init__ signature, _VALID_LAYOUTS constant, validation, the .layout property, the docstring discussion, the __str__, and the two associated tests. - AccumFifo precision tests: rename to reflect what they actually measure (numpy reference baselines for the LSTM workload, not tests of the AccumFifo lowering). Drop the 'load-bearing falsifiable claim' framing from docstrings. test_accum_fifo_invariant_hits_1e_minus_5_precision_target -> test_fp32_lstm_reference_matches_pytorch (a fixture sanity check; FP32 numpy LSTM matches FP32 pytorch) test_accum_fifo_beats_bf16_baseline_by_three_orders_of_magnitude -> test_fp32_reference_beats_bf16_writeback_by_3oom (workload sensitivity characterization, not AccumFifo behaviour) Helper renamed: _lstm_cell_with_accum_fifo -> _lstm_cell_fp32_reference - SparseFifo round-trip test: rename + reframe. test_decompressed_matmul_bit_equal_to_sparse_reference -> test_nm_compression_roundtrip_is_lossless_on_compliant_input Comment block above clarifies these are property tests for the N:M compression format itself (numpy reference), not tests of SparseFifo's lowering or silicon behaviour.

Address review issue 8: the prior example always called acquire/release on the producer side and never invoked discard(1), so it didn't actually demonstrate variable-rate behaviour. Restructure the example kernel to a Python-level deterministic alternating-skip pattern: every other window is forwarded via the C++ copy kernel + acquire/release; the alternate window is dropped via out_handle.discard(1). The producer's loop now has asymmetric acquire/release counts on the variable-rate output fifo, which is what the aie.variable_rate=true marker tells the lowering pass to expect. The C++ kernel is simplified to a void window-copy (skip decision moved to the Python layer). Module comment is rewritten to describe the deterministic-skip pattern accurately, with a note that predicate-decided runtime skips would require a first-class scf.if lowering not currently in IRON Python.

…Snapshot export) - register_fifo_handle: re-registering the *same* callable for the same class is now an idempotent no-op. This makes module reloads, repeated imports, and test harnesses safe. Registering a *different* handler still raises ValueError to catch accidental override. - _RegistrySnapshot: removed from __all__. The leading underscore already signals 'test-internal use only'; the class is still accessible via direct import for tests that need it. - Updated test_double_registration_raises -> split into test_double_registration_with_different_handler_raises + test_double_registration_with_same_handler_is_idempotent.

…FifoHandle Address review issue 7: AccumFifoHandle and PacketFifoHandle deliberately do not call super().__init__() because ObjectFifoHandle's constructor requires an ObjectFifo with semantics (depth, dims_from_stream_per_cons, _get_endpoint) that AccumFifo and PacketFifo do not have. They instead stub the attributes ObjectFifoHandle exposes as properties directly. Add an explicit class-docstring '.. note::' section in each documenting: - which attributes are stubbed - why super() is bypassed - that all_of_endpoints (which traverses _object_fifo) is overridden - that the proper long-term fix is a shared narrower base class This is a documentation fix, not a behavioural change. The shared base class refactor is intentionally out of scope for the initial primitive landing; the bypass pattern is rare in upstream IRON code, and explicit documentation keeps the next maintainer from being surprised. (SparseFifoHandle and VariableRateFifoHandle inherit normally from ObjectFifoHandle and do call super().__init__; only AccumFifoHandle and PacketFifoHandle exhibit the bypass pattern.)

…e context Address review issue 5. The previous test suite covered the construction surface, registry integration, and host-side behavioural simulation, but didn't exercise PacketFifo.resolve() against a real MLIR context — so a mismatch with the dialect's packetflow() signature would not have been caught. Add two new tests in test/iron/test_packet_fifo.py: - test_resolve_emits_packetflow_op_in_module: builds a 2-producer / 1-consumer PacketFifo inside an aie.device body, calls resolve(), and asserts the resulting MLIR contains 2 aie.packetflow ops (one per producer). - test_resolve_idempotent_does_not_emit_twice: verifies the resolve() reentrancy guard. Pattern mirrors the existing test_resolve_emits_cascade_flow_op_in_module in test_cascade_fifo.py. Verified the call-site signature matches dialects.aie.packetflow's __init__ (pkt_id, source, source_port, source_channel, dests, keep_pkt_header).

Copilot

Pull request overview

This PR extends the IRON Python API with multiple specialized FIFO abstractions (e.g., cascade, packet, sparse, variable-rate, memtile aggregation) and wires them through MLIR lowering/passes so they work end-to-end, while keeping the existing ObjectFifo API unchanged. It also introduces an extensible Worker.fn_args dispatch registry so new *FifoHandle subclasses can integrate without editing worker.py.

Changes:

Add fifo_handle_registry and update Worker.__init__ to use registry-driven fn_args dispatch (with ObjectFifoHandle pre-registered for backward compatibility).
Add new IRON primitives (notably SparseFifo, VariableRateFifo, CascadeFifo, MemtileAggregator) and corresponding tests/examples.
Extend pass-side plumbing for sparse compression (aie.enable_compression) and variable-rate unroll skipping / split-fifo attr propagation.

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
test/objectFifo-stateful-transform/variable_rate_fifo_skip_unroll.mlir	Lit test for excluding variable-rate fifos from LCM-based loop unrolling.
test/objectFifo-stateful-transform/variable_rate_fifo_attr_propagation.mlir	Lit test for split-fifo propagation with `aie.variable_rate`.
test/objectFifo-stateful-transform/sparse_fifo_split_attr_propagation.mlir	Lit test for split-fifo propagation of sparse compression attrs.
test/iron/test_worker_fifo_handle_extension.py	Pytest coverage for the new registry dispatch API (idempotency, precedence, snapshotting).
test/iron/test_sparse_fifo.py	Pytest coverage for SparseFifo surface + validation + lowering metadata behavior.
test/iron/test_packet_fifo.py	Pytest coverage for PacketFifo surface/registry/toy behavior + lowering smoke tests.
test/iron/test_memtile_aggregator.py	Pytest coverage for MemtileAggregator construction and equivalence checks (currently has syntax/API issues).
test/iron/test_fifo_handle_program_walk.py	Contract tests for `*FifoHandle.all_of_endpoints()` to support `Program` walks.
test/iron/test_cascade_fifo.py	Pytest coverage for CascadeFifo surface + lowering (`aie.cascade_flow`) + idempotency.
test/iron/test_accum_fifo.py	Pytest coverage for AccumFifo surface + warnings + lowering + reference fixtures.
test/Conversion/DmaToNpu/dma_to_npu_sparse_compression.mlir	Lit test pinning the final hop of sparse compression bit emission in NPU BD words.
python/iron/worker.py	Switch `fn_args` handling to registry-driven dispatch via `dispatch_fn_arg`.
python/iron/variable_rate.py	Implement VariableRateFifo + handle + attr pinning + registry registration.
python/iron/sparse.py	Implement SparseFifo + handle + sparsity validation + attr pinning + registry registration.
python/iron/memtile.py	Implement MemtileAggregator helper around `ObjectFifo.prod().join()` topology.
python/iron/dataflow/fifo_handle_registry.py	New registry module: register/unregister/list/dispatch + snapshot helper for tests.
python/iron/dataflow/init.py	Pre-register `ObjectFifoHandle` handler to preserve prior Worker bookkeeping behavior.
python/iron/cascade.py	Implement CascadeFifo lowering to `aie.cascade_flow` and placement-facing surface.
python/iron/init.py	Export the new primitives/handles from `aie.iron`.
python/iron/VARIABLE_RATE_DESIGN.md	Design doc capturing rationale and lowering model for VariableRateFifo.
programming_examples/basic/variable_rate_filter/variable_rate_filter.py	End-to-end example exercising `VariableRateFifoHandle.discard(1)`.
programming_examples/basic/variable_rate_filter/filter_first_byte_even.cc	Companion C++ kernel (currently a pure copy) for the example.
programming_examples/basic/variable_rate_filter/README.md	Documentation for the variable-rate example and expected MLIR markers.
programming_examples/basic/variable_rate_filter/Makefile	Build plumbing for the variable-rate example.
lib/Dialect/AIEX/Transforms/AIEDmaToNpu.cpp	Emit `Enable_Compression` bit (word[1] bit 31) when `aie.enable_compression` is present.
lib/Dialect/AIEX/Transforms/AIEDMATasksToNPU.cpp	Forward `aie.enable_compression` from `aie.dma_bd` to `aiex.npu.writebd`.
lib/Dialect/AIE/Transforms/AIEObjectFifoStatefulTransform.cpp	Propagate sparse compression attrs into BDs, propagate attrs through split-fifo, and skip variable-rate fifos in LCM-unroll.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The earlier mass-strip pass over project-internal task IDs left several files with structural damage (orphaned triple-quote opens swallowing function bodies, broken comment fragments, and one missing list entry that made SparseFifo() crash by default). All flagged by Copilot; none surfaced earlier because most are inside docstrings (still syntactically valid) or invariants tested only when an actual handle is constructed. Fixes: - test/iron/test_memtile_aggregator.py: a stray '"""' on _make_t53m_aggregator's first body line silently turned the helper's body PLUS five subsequent test functions into one giant docstring, leaving the helper as 'return None' and dropping test_producer_returns_object_fifo_handle, test_producers_returns_n_handles_in_order, test_producer_index_out_of_range_raises, test_consumer_returns_object_fifo_handle, and test_offsets_match_flat_concat from pytest collection. Restored a proper one-line docstring + function body; reconstructed the test_offsets_match_flat_concat function body (the closing '"""' on line 176 was its docstring tail). Also dropped agg.layout assertion that referenced the now-removed property. - python/iron/sparse.py: added (2, 4) back to _AM020_VERIFIED_NM_PATTERNS. The strip ate the line, and the defaults are N=2 M=4 — so SparseFifo() with no args raised on every construction. - python/iron/worker.py: rewrote the FIFO-handle dispatch comment block (mid-sentence fragments + unmatched parenthesis from line drops) into complete sentences. - test/objectFifo-stateful-transform/variable_rate_fifo_attr_propagation.mlir: fixed the orphaned 'or wedge under the split-fifo + cross-column path that the' fragment in the header — restored a complete sentence describing what the lit test verifies. - programming_examples/basic/variable_rate_filter/{variable_rate_filter.py,README.md,filter_first_byte_even.cc}: prior commit changed the example to a Python-level alternating skip pattern (calling discard(1) in the skip branch), but the comments + README still claimed the C++ kernel did a per-window predicate check. Updated both to describe the deterministic Python-skip + window-copy kernel that the code actually implements; noted that runtime-decided per-window skips would require a first-class scf.if lowering. - Final residual sweep: removed remaining 'Phase 1' / 'bio-on-XDNA' references from accum.py / dataflow/__init__.py / memtile.py / test_cascade_fifo.py / test_worker_fifo_handle_extension.py.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 30 out of 30 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Adds a basic programming example that measures per-launch dispatch overhead on AIE2P silicon by exposing a trivial single-tile passthrough kernel under a parameterised IRON topology. The example varies two orthogonal knobs (n_chunks and dense_bytes) so a host driver script can regress per-launch wall against each and attribute the per-launch floor to: (a) xrt::run.wait() return-path overhead (b) instruction-stream upload cost (c) per-chunk shim-DMA setup overhead × N_CHUNKS (d) AIE2P firmware dispatcher per-launch handshake Files: - dispatch_overhead_bisector.py: IRON topology with N_CHUNKS separate fill/drain tasks per launch (so each chunk lowers to its own shim BD; the IRON access-pattern collapse is opted out of via per-chunk TensorAccessPattern). - passthrough.cc: 64-byte-vectorised memcpy compute kernel (no arithmetic — pushes compute to the noise floor so the host-runner wall captures dispatch-layer cost). - test.cpp: host runner; reports per-iteration wall-time distribution as KEY=VALUE lines for machine parsing, plus an optional per-iter CSV. - Makefile: builds one variant by default; `all-variants` builds the representative bisection sweep. - README.md: methodology + build/run/sweep instructions.

Six items, all addressed: - python/iron/cascade.py: CascadeFifo.resolve() previously set self._resolving = True before the tile-op precondition checks and never cleared it on exception, leaving the instance permanently non-resolvable. Move precondition checks BEFORE setting _resolving, and wrap the actual emission in try/finally so failures don't latch the flag. Also add an idempotency guard on self._op so a successful resolve() is a no-op on re-entry. - python/iron/sparse.py: deduplicate the (2, 4) entry in _AM020_VERIFIED_NM_PATTERNS (added twice during the previous fix; set semantics dedupe at runtime but the literal is misleading). - python/iron/variable_rate.py: fix the malformed "Architectural references" bullet list in the module docstring (orphan continuation lines + truncated bullet from the earlier strip pass). - test/iron/test_fifo_handle_program_walk.py: replace the '"""+ : contract tests..."""' diff-artifact module docstring with a clean one-liner. - programming_examples/basic/variable_rate_filter/{variable_rate_filter.py,README.md}: replace hardcoded /home/matteius/... paths in the build snippets with placeholder syntax (<path/to/...>). - python/iron/{accum.py,VARIABLE_RATE_DESIGN.md}, test/iron/test_accum_fifo.py: replace remaining "T7-IRON" references (the strip regex required a digit after the dot, which ".IRON" fails to match) with hardware- grounded language about the well-trodden cascade-stream path.

Three more subsystems promoted to the public bionpu package: src/bionpu/data/: - canonical_sites.py (276 LOC): Cas-OFFinder TSV normaliser. Now the canonical home; bionpu.verify.crispr imports from here. Adds serialize_canonical() helper so verify can compute SHA-256 over the canonical wire form without touching the filesystem. - fetchers/{doench_2016,guide_seq,pod5_hg002,reference_genomes}.py + fetchers/__init__.py (495 LOC framework): per-dataset public fetchers with SHA-pinning, framework documents the requirements every dataset entry must satisfy. - load_smoke.py (113 LOC): in-repo smoke fixture loaders. src/bionpu/quant/: - calibrate.py (215 LOC): ONNX quantization calibration driver (thin wrapper around onnxruntime.quantization). - passport.py (196 LOC): quantization passport — every quantized model in the repo carries one (calibration source, op recipe, reproducibility hash). - peano_export.py (102 LOC): quantized ONNX -> MLIR-AIE -> xclbin lowering hook. src/bionpu/iron_extensions/: - cascade_stream.py (387 LOC): cascade-chain IRON helper. Largely superseded by mlir-aie's CascadeFifo (Xilinx/mlir-aie#3039) but still consumed by 5 genetics files. Internal bits NOT migrated (kept in genetics-private): - bionpu/report/* — gaps-yaml aggregator + writeup pipeline; tightly coupled to internal task-tracking format. Refactor: - src/bionpu/verify/_crispr_canonical.py removed; bionpu.verify.crispr now imports from bionpu.data.canonical_sites (the public canonical home). License: Apache-2.0 + LLVM exception → GPL-3.0 across all migrated files. Project-internal task IDs / outer-repo paths scrubbed. All 18 verify-harness tests still pass; 71/71 Python files in the public package parse.

…le findings The scrubber on the initial migrations missed several patterns: - Hyphenated task IDs (T7-IRON, T1-swarm) — regex required \.\d after T<n> - Investigation-phase labels (Followup A/B/C/E/G, Stage \d) - Tilde-rooted absolute paths (~/xdna-bringup/...) - Project-internal filename refs (gaps.yaml, umbrella PRD, gap-id) Pass: - 47 files mass-scrubbed of the missed patterns (34 code + 13 docs). - src/bionpu/iron_extensions/INVENTORY.md: rewritten from a 405-line internal investigation report into a brief module note. The cascade-stream feasibility outcome is now upstream aie.iron.CascadeFifo (Xilinx/mlir-aie#3039); this module is documented as a back-compat shim that will be removed once the upstream merge is the mlir-aie floor. - src/bionpu/kernels/basecalling/lstm_cell_bf16_acc_cascade/DESIGN.md: rewritten from a 406-line investigation log (Followup A-G stages, WEDGE-vs-PASS status flips, hypothesis falsification entries) into a 60-line design note covering the architectural rationale (FP32-cascade vs bf16-writeback precision wall, AM020 references, topology diagram, file inventory, known limitations). - README.md: dropped the dead 'v0.1 release notes' link to a tag that doesn't exist; rewrote the Status section to point at the new docs/STATUS.md per-subsystem inventory. - docs/STATUS.md: NEW — explicit per-subsystem table of what works end-to-end vs what's an extracted module that needs a v0.2 driver to drive on hardware. Distinguishes ✅ working / ✅ extracted / ⚠️ deprecated / ⚠️ v0.2 scope. - docs/REPRODUCE.md: rewrote from placeholder text into a real reproduction recipe — hardware prerequisites, software install, pip-install bionpu, byte-equality smoke check + negative control, manual kernel build, host-runner invocation, energy-methodology pointer, sanity-log discipline. - CHANGELOG.md: NEW — v0.1 release notes covering what landed and what's deferred to v0.2. Other per-kernel DESIGN.md files were checked: only the one cascade file carried a substantive investigation log; the others were already clean after the earlier scrub passes. All 18 verify-harness tests still pass; 70/70 Python files parse.

hunhoffe · 2026-04-28T15:42:50Z

Hi @matteius -- this is neat work you've started! I was hoping to get a high-level overview of:

What concrete problems/limitations require these changes?
Is there a way to stage this work into smaller pieces?

We typically discuss larger design decisions in an issue and/or discussion before considering integration, so we'd like to engage in some thought before considering ObjectFIFO redesigns.

matteius · 2026-04-28T16:05:09Z

Hi @hunhoffe — context first: I set out to implement genomics algorithms on the AIE2P NPU in my Ryzen AI laptop, working backward from AMD's documentation (AM020 / AM029) on what the hardware supports. Working PoCs are at https://github.com/opensensor/bionpu — the primitives in this PR fell out of repeatedly re-discovering the same composition decisions across those workloads. High-level answers below.

Concrete limitations behind each primitive

Each one exposes an AM020 / AIE2P hardware feature that doesn't currently have a first-class IRON surface; workloads re-glue dialect ops + ad-hoc Python per use site:

Primitive	Limitation removed	Hardware ref
`CascadeFifo`	`cascade_flow` is a raw dialect op without an IRON producer/consumer surface	AM020 Ch. 4
`PacketFifo`	`packetflow` + TLAST + out-of-order BD reimplemented per workload	AM020 Ch. 2 Fig. 17 + Ch. 5 p. 74
`AccumFifo`	512-bit BM-register accumulator state across timesteps and tiles via cascade-stream BM transfer	AM020 Ch. 4 p. 67
`SparseFifo`	BD `Enable_Compression` bit unreachable from IRON; needs DMA-pass plumbing through `AIEDmaToNpu` / `AIEDMATasksToNPU`	AM029 N:M decomp
`MemtileAggregator`	4-into-1 memtile fan-in topology re-derived per workload	AM020 Ch. 5 p. 74
`VariableRateFifo`	Producer-side `discard` for sparse-emit ring buffers — needs `unrollForLoops` skip + split-fifo attr propagation	mlir-aie split-fifo / unroll

The bundling reflects how they compose in the bionpu workloads: CascadeFifo + AccumFifo + SparseFifo together for LSTM cascade chains (basecalling), PacketFifo + VariableRateFifo + MemtileAggregator together for sparse-emit scan patterns (CRISPR genome scan).

Staging

One way to split, in increasing review surface:

Foundation. Worker.fn_args FifoHandle registry + two *FifoHandle-subclass-contract broadenings in Program._walk_object_fifos and iron/program.py. Pure broadening; existing ObjectFifoHandle paths unchanged.
Pure-Python wrappers. CascadeFifo, PacketFifo, MemtileAggregator. Existing dialect ops behind the IRON producer/consumer surface; no MLIR pass changes.
Pass-side primitives. SparseFifo (BD Enable_Compression through three passes), VariableRateFifo (unroll skip + split-fifo attr propagation).
AccumFifo. FP32 inter-tile state + cascade-stream BM-register transfer; the largest abstraction-shape question.

Does that map to what you'd find reviewable, or would you split it differently?

Process

Would issues for the Tier 3 + Tier 4 design surfaces (SparseFifo, VariableRateFifo, AccumFifo) be the right starting point? That's where pass-side and abstraction-shape decisions benefit most from discussion before code review. For Tier 1 + Tier 2 (API-shaped, no pass changes) — would you want issues for those too, or are small per-primitive PRs OK once the foundation broadening is in?

hunhoffe · 2026-04-29T17:09:59Z

Hello @matteius! Thanks for your response. We are eager to ensure mlir-air + IRON can support use cases for application development, and it seems you have found a few gaps in feature coverage. We'd like to work with you to get those gaps filled.

The path forward that makes the most sense to me is to:

Create a top-level issue to keep track of associated PRs and encapsulate the overall goals
Break into a series of PRs based on priorities that are sequentially submitted

For the breakdown of capabilities/PRs, I think the clearest path to success would be to:

Start with any needed extensions to mlir-aie. This is highest priority for us as this represents a concrete expressibility gap between hardware features and the core mlir-aie dialect. Make sure each PR only contains one feature and adequate test coverage of that feature.
Break each top-level IRON feature into a separate PR. That PR description needs to include how you would express the pattern without the feature (e.g., baseline) vs how you would express the pattern with the new feature in order to demonstrate the clear benefit of each proposed change. Each feature will also need clear testing and framing within repo documentation and/or programming examples.

Let us know if you have further questions; I'm excited to see what comes of your work so far!

I also wanted to ask -- in opensensor/bionpu, do you have a specific applications/algorithm focus, or are you just generally exploring the space?

matteius · 2026-04-30T02:44:13Z

Thanks @hunhoffe, I appreciate the staging proposal. It's a reasonable path for upstream integration.

do you have a specific applications/algorithm focus, or are you just generally exploring the space?

As for that I recommend reading through: https://github.com/opensensor/bionpu/blob/main/docs/AIE2P-CRISPR-shape-validation.md -- By selectively engaging the NPU for parts of the CRISPR search path I arrived at a 4-5x speedup on a CRISPR editor I was building.

I'm going to step back from upstreaming this work for now. The repackaging effort (top-level issue, per-feature PRs with baseline-vs-new framing, and the iterative review cycles for each) is more bandwidth than I wish to commit to right now alongside other things I have going on.

I'll continue developing on my fork and will keep it rebased on Xilinx/mlir-aie:main periodically so the bionpu workloads stay current. If any of the primitives turn out to be useful to the broader community and someone wants to drive an upstream effort using this branch as a reference, I'm happy to support that, answer questions, clarify design decisions, point at the AM020 / AM029 sections each primitive is grounded in.

For anyone landing on this thread looking for the working code: the primitives live on opensensor:feature/iron-fifo-primitives and are exercised end-to-end in the bionpu research repo here: https://github.com/opensensor/bionpu

Thanks again for engaging with it -- no hard feelings on the process ask, it's the right ask for a project of this scope, its just not the right time on my end, always hustling to make ends meet.

hunhoffe · 2026-05-06T16:40:13Z

Hi @matteius! I understand. I've created an issue so this PR doesn't get too lost: #3050

Feel free to propose additional contributions or file issues in the future!

Matt Davis added 17 commits April 26, 2026 23:54

[IRON] Strip project-internal task references from __init__.py

779baa0

matteius force-pushed the feature/iron-fifo-primitives branch from 707a7df to e036b85 Compare April 27, 2026 03:55

matteius marked this pull request as ready for review April 27, 2026 04:00

Copilot AI review requested due to automatic review settings April 27, 2026 04:00

matteius requested review from abisca, andrej, denolf, erwei-xilinx, fifield, hunhoffe, jackl-xilinx, jgmelber, pvasireddy-amd and stephenneuendorffer as code owners April 27, 2026 04:00

Copilot started reviewing on behalf of matteius April 27, 2026 04:02 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Matt Davis and others added 2 commits April 27, 2026 00:14

Update python/iron/sparse.py

7d2a761

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

matteius requested a review from Copilot April 27, 2026 04:24

Copilot started reviewing on behalf of matteius April 27, 2026 04:25 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Matt Davis added 2 commits April 27, 2026 00:31

hunhoffe mentioned this pull request May 6, 2026

ObjectFIFO Feature Analysis & New Proposed Features #3050

Open

Uh oh!

Conversation

matteius commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Primitives

Foundation

Diff stats

Review feedback addressed

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hunhoffe commented Apr 28, 2026

Uh oh!

matteius commented Apr 28, 2026

Concrete limitations behind each primitive

Staging

Process

Uh oh!

hunhoffe commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matteius commented Apr 30, 2026

Uh oh!

hunhoffe commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matteius commented Apr 27, 2026 •

edited

Loading

hunhoffe commented Apr 29, 2026 •

edited

Loading