Skip to content

[IRON] Add specialized FIFO subclasses (Cascade / Packet / Accum / Sparse / Memtile / VariableRate)#3039

Open
matteius wants to merge 21 commits into
Xilinx:mainfrom
opensensor:feature/iron-fifo-primitives
Open

[IRON] Add specialized FIFO subclasses (Cascade / Packet / Accum / Sparse / Memtile / VariableRate)#3039
matteius wants to merge 21 commits into
Xilinx:mainfrom
opensensor:feature/iron-fifo-primitives

Conversation

@matteius
Copy link
Copy Markdown

@matteius matteius commented Apr 27, 2026

Summary

Adds six specialized FIFO subclasses to the IRON Python API plus the matching MLIR-pass plumbing so the primitives work end-to-end. Each primitive composes existing mlir-aie / AIE-ML / AIE2P features into a reusable abstraction; the existing ObjectFifo API is unchanged.

Primitives

Subclass What it adds
CascadeFifo First-class cascade-stream ObjectFifo subclass — wraps the cascade_flow dialect op behind the same producer/consumer surface as ObjectFifo.
PacketFifo Packet-switched / pktMerge N:1 / TLAST / out-of-order BD primitive (AM020 Ch. 2 Fig. 17 + Ch. 2 p. 27 + Ch. 5 p. 74).
AccumFifo FP32 inter-tile accumulator state passing — persists 512-bit BM register state across timesteps within a tile and across tiles via cascade-stream BM transfer (AM020 Ch. 4 p. 67).
SparseFifo On-the-fly N:M sparsity decompression on S2MM with the matching pass-side BD Enable_Compression plumbing through AIEDmaToNpu / AIEDMATasksToNPU and split-fifo attr propagation in AIEObjectFifoStatefulTransform.
MemtileAggregator Memtile-mediated 4-into-1 fan-in helper (AM020 Ch. 5 p. 74).
VariableRateFifo Producer-side conditional-forward FIFO with the matching unrollForLoops skip + split-fifo attr propagation in AIEObjectFifoStatefulTransform.

Foundation

  • Worker.fn_args FifoHandle registry (python/iron/dataflow/fifo_handle_registry.py) — replaces the hard-coded isinstance(arg, ObjectFifoHandle) branch with an extensible dispatch registry. New FIFO subclasses register their own handle type without touching worker.py.
  • *FifoHandle subclass contract normalization — fixes two latent breakages where Program._walk_object_fifos and iron/program.py:81's literal-type-check rejected FifoHandle subclasses. Pure broadening — every existing ObjectFifoHandle use still works.

Diff stats

29 files, 7992 insertions(+), 26 deletions(-) split across:

  • python/iron/{cascade,packet,accum,sparse,memtile,variable_rate}.py — six new modules.
  • python/iron/dataflow/fifo_handle_registry.py — registry foundation.
  • python/iron/{__init__.py,worker.py,dataflow/__init__.py} — wire in the new subclasses.
  • lib/Dialect/AIE/Transforms/AIEObjectFifoStatefulTransform.cpp, lib/Dialect/AIEX/Transforms/{AIEDmaToNpu,AIEDMATasksToNPU}.cpp — pass-side plumbing for SparseFifo's BD Enable_Compression bit and VariableRateFifo's loop-unroll skip.
  • test/iron/test_*.py — eight new pytest modules covering surface, registry dispatch, lowering (resolve() against an aie.device context), and per-primitive behavioural toys.
  • test/objectFifo-stateful-transform/{sparse_fifo_split_attr_propagation,variable_rate_fifo_attr_propagation,variable_rate_fifo_skip_unroll}.mlir, test/Conversion/DmaToNpu/dma_to_npu_sparse_compression.mlir — lit tests for the pass-side changes.
  • programming_examples/basic/variable_rate_filter/ — minimal end-to-end example exercising discard(1) on a Python-level alternating skip pattern.
  • python/iron/VARIABLE_RATE_DESIGN.md — design notes.

Review feedback addressed

(This description reflects the post-review-fix state; v1 was rewritten in place.)

  • Project-internal task IDs scrubbed from source, docstrings, tests, lit tests, MLIR pass comments, and commit messages. Hardware-grounded references (AM020 chapter / page citations, register names, dialect ops) are preserved.
  • Bucket-2 lowering passes included so SparseFifo and VariableRateFifo work end-to-end out of the box; the prior shape would have silently degraded to vanilla ObjectFifo lowering on an aie-opt without the matching pass changes.
  • Bugfix commits squashed into the commits that introduced them (CascadeFifo's two follow-ups; AccumFifo's threshold-relax; the three Bucket-2 SparseFifo fix-up commits; the two task-ID strip commits). Final history is 17 commits.
  • AccumFifo precision tests renamed to reflect what they actually measure (test_fp32_lstm_reference_matches_pytorch, test_fp32_reference_beats_bf16_writeback_by_3oom) — they're numpy reference baselines for the LSTM workload, not tests of the AccumFifo lowering. The "load-bearing falsifiable claim" framing is dropped.
  • PacketFifo lowering tests added (test_resolve_emits_packetflow_op_in_module, test_resolve_idempotent_does_not_emit_twice) so the dialects.aie.packetflow call signature is exercised end-to-end.
  • AccumFifoHandle / PacketFifoHandle parent-constructor bypass documented explicitly in each class docstring with an explanation of why super().__init__() is bypassed, what attributes are stubbed, and the long-term direction (a shared narrower base class). SparseFifoHandle and VariableRateFifoHandle inherit normally.
  • VariableRateFifo example rewritten to actually exercise discard(1) on a Python-level alternating skip pattern (the prior example always called acquire/release and never discard).
  • SparseFifo round-trip test renamed (test_nm_compression_roundtrip_is_lossless_on_compliant_input) and reframed as a property test for the N:M compression format itself, not a test of SparseFifo's lowering or silicon behaviour.
  • MemtileAggregator layout="window" dead code removed (parameter, validator, NotImplementedError block, property, tests).
  • register_fifo_handle brittleness fixed: re-registering the same handler for an already-registered class is now an idempotent no-op (module reloads / repeated imports stay safe). Re-registering a different handler still raises.
  • _RegistrySnapshot removed from __all__ (the leading underscore signals "test-internal use only").
  • Co-Authored-By: Claude trailers removed from all commit messages.

Test plan

  • All eight new pytest modules pass against a wheel-built install.
  • New lit tests pass: dma_to_npu_sparse_compression.mlir, sparse_fifo_split_attr_propagation.mlir, variable_rate_fifo_attr_propagation.mlir, variable_rate_fifo_skip_unroll.mlir.
  • programming_examples/basic/variable_rate_filter/ builds cleanly and produces the expected MLIR (aie.variable_rate = true on both producer and consumer fifos after the split-fifo propagation).
  • Upstream lit / pytest CI to confirm nothing existing regresses.
  • Silicon validation: each primitive is tested at the Python+lowering layer; on-silicon dispatch requires the consumer kernel side which is out of scope here.

Happy to split into per-primitive PRs if maintainers prefer, but this bucket lands together so each primitive is end-to-end-functional out of the box.

Signed-off-by: Matt Davis matt@opensensor.io

Matt Davis added 17 commits April 26, 2026 23:54
Reserve five FIFO subclass slots in aie.iron's __init__.py as
NotImplementedError-raising stubs. Subsequent commits in this PR
replace each stub with a real class:

- CascadeFifo (cascade-stream ObjectFifo subclass)
- PacketFifo (packet-switched / pktMerge / TLAST / OoO BD)
- AccumFifo (FP32 inter-tile accumulator state passing)
- SparseFifo (on-the-fly N:M sparsity decompression on S2MM)
- MemtileAggregator (memtile-mediated fan-in helper)

Adds an explicit __all__ enumerating both the existing primitives and
the five new reservation slots so the public surface stays
discoverable without runtime side effects.

Signed-off-by: Matt Davis <matt@opensensor.io>
… + LLVM exception)

first-class IRON primitive at python/iron/cascade.py. The new CascadeFifo
class mirrors aie.iron.ObjectFifo's constructor surface (producer / consumer
endpoints, dtype, name, handshake-size knob) but lowers through the cascade
physical channel — emitting an aie.cascade_flow op via resolve() that the
placer pass converts into per-tile aie.configure_cascade ops.

Architectural references:
- AM020 Ch. 4 p. 67: 512-bit cascade stream between adjacent CoreTiles.
- AM020 Appendix A p. 80 Figure 45: vertical+horizontal cascade topology.
- aie.put_cascade / aie.get_cascade: the cascade write/read MLIR ops the
  C++ kernel emits via put_mcd / get_scd_v16int32 intrinsics inside its
  core_fn body. CascadeFifo only emits placement + cascade_flow.
- aie.cascade_flow: the declarative connection op the placer lowers.

usable for callers building chain-of-N topologies; behavioural parity with
the wrapper is asserted in tests/test_iron_cascade_fifo.py.

the CascadeFifo slot + the matching import line are touched in this PR.

for all subsequent fork PRs:
- Apache-2.0 + LLVM exception headers on every new fork file.
- Signed-off-by trailer per LLVM's DCO model.
- Repo-root THIRD_PARTY_NOTICES.md updated in the outer repo.

Tests:
- test/iron/test_cascade_fifo.py: surface, validation, lowering, and
  parity with the dialect-level cascade_flow op.

Signed-off-by: Matt Davis <matt@opensensor.io>
…h. 4 p. 67)

first-class IRON dataflow primitive sibling to ObjectFifo and CascadeFifo.

AccumFifo persists 512-bit BM accumulator state across two boundaries:

1. Across timesteps within a tile (BM-to-BM register move; AM020 Ch. 4
   p. 67 "Move one 512-bit accumulator register to another in one cycle").
   Lowering: no MLIR op emitted; the C++ kernel keeps an aie::accum local
   hot across the worker's while(true) iteration boundary.

2. Across tiles via cascade-stream BM transfer (AM020 Ch. 4 p. 67
   "Cascade stream connects the AIE-MLs in a chain ... transfer an
   accumulator register (512-bit) from one to the next"). Lowering:
   aie.cascade_flow(prod_tile, cons_tile) between vertically-adjacent
   CoreTiles. Vertical adjacency on AIE2P is the only geometry T7-IRON
   verified on silicon; horizontal cascade is documented (AM020 App. A
   p. 80 Fig. 45) but un-tested -- a UserWarning is raised for
   non-vertical placements rather than a hard reject.

    AccumFifo(producer, consumer, dtype="accfloat", lanes=16)
    af.prod() -> AccumFifoHandle (acquire / release no-ops; cascade
                 wire is per-cycle handshaked at the dialect intrinsic
                 level, intra-tile is register-aliased)
    af.cons() -> AccumFifoHandle

Rationale for sibling class vs ObjectFifo flag: ObjectFifo is memref-
typed (DMA-copies memref words). The accumulator register is not a
memref word -- it's a hardware register-file slice (AM020 Ch. 4
p. 65-67). Modeling this on ObjectFifo would either force a fictitious
memref<16xf32> the lowering ignores, or burden every ObjectFifo
consumer with an accumulator-mode invariant. A sibling class with the
same prod/cons surface keeps the abstraction clean.

dtype validation:
- "acc32"    (int32 accumulator)
- "acc64"    (int64 paired-lane accumulator -- 8 lanes x 64 bits)
- "acc48" is explicitly rejected (AIE1-only; AIE-ML / AIE2P drops it
  per AM020 Ch. 4 p. 65)

Lane-count validation: enforces the AM020 Ch. 4 p. 67 cascade-transfer
width of exactly 512 bits/cycle. lanes=16 is the only legal value for
accfloat / acc32; lanes=8 for acc64.

AccumFifoHandle subclasses ObjectFifoHandle so existing
isinstance(arg, ObjectFifoHandle) dispatch in Worker.fn_args accepts
fn_args dispatch, the inheritance is no longer load-bearing for that
purpose, but documents AccumFifo as a fifo-shaped abstraction.

Tests at test/iron/test_accum_fifo.py cover three layers:
- Surface: API shape, dtype/lane validation, intra-tile vs inter-tile
  detection, error messages, isinstance compatibility.
- Lowering: intra-tile emits no cascade_flow op; inter-tile emits one.
- Precision: synthetic LSTM cell (96 hidden, 200 timesteps, matching
  invariant matches FP32 PyTorch reference within 1e-5 max-abs, vs the

Reservation slot in python/iron/__init__.py is replaced
slots (CascadeFifo, PacketFifo, SparseFifo, MemtileAggregator)

Signed-off-by: Matt Davis <matt@opensensor.io>
… (foundation for PacketFifo)

fn_args resolution as the single biggest blocker for promoting
ObjectFifo subclasses (PacketFifo, CascadeFifo, AccumFifo, SparseFifo)
to first-class IRON primitives. The original implementation hard-coded
the type-dispatch chain to recognize only ObjectFifoHandle, so any new
FifoHandle subclass would have to fork worker.py.

This commit refactors that dispatch into a registry pattern:

* python/iron/dataflow/fifo_handle_registry.py -- new module exposing
  register_fifo_handle (decorator + function-call forms),
  unregister_fifo_handle, get_registered_handle_classes, and
  dispatch_fn_arg. Reverse-insertion order ensures more-specific
  subclasses (registered later) win the isinstance() walk.

* python/iron/dataflow/__init__.py -- pre-registers ObjectFifoHandle
  with a handler that reproduces the original Worker.__init__
  bookkeeping bit-for-bit (sets arg.endpoint = worker; appends to
  worker._fifos). Backward-compat anchor: every Phase 1 design that
  passes ObjectFifoHandle through fn_args still works without
  modification.

* python/iron/worker.py -- replaces the hard-coded
  isinstance(arg, ObjectFifoHandle) branch with a dispatch_fn_arg(...)
  call. Buffer / ObjectFifo / WorkerRuntimeBarrier branches unchanged.

* test/iron/test_worker_fifo_handle_extension.py -- 14 tests covering
  pre-registration, regression guard for ObjectFifoHandle, custom
  subclass dispatch, runtime registration, reverse-order precedence,
  decorator + function-call forms, snapshot context manager, error
  handling, and public-surface stability.

without further changes to worker.py.
Promote the three AM020-documented variable-rate hardware primitives

- pktMerge N:1 header-based routing  (AM020 Ch. 2 Figure 17)
- S2MM finish-on-TLAST stream end    (AM020 Ch. 2 p. 27)
- Out-of-order BD processing         (AM020 Ch. 5 p. 74)

PacketFifo mirrors ObjectFifo's prod() / cons() user-facing surface
but lowers to aie.packetflow ops with per-packet header-based routing
through the AXI stream switch fabric -- a different runtime mechanism
from ObjectFifo's shared-memory + lock model. A sibling class (rather
than an ObjectFifo flag) keeps the abstraction clean and lets the
lowering emit packetflow ops directly.

API:
    PacketFifo(producers, consumers,
               header_dtype="uint8",
               merge_strategy="round-robin"|"priority",
               packet_ids=...,         # auto-assigned if omitted
               keep_pkt_header=True,   # False -> finish-on-TLAST
               obj_type=..., depth=2)

PacketFifoHandle subclasses ObjectFifoHandle for surface compatibility,
time so Worker.fn_args dispatch recognizes it without modifying
worker.py. The reverse-insertion-order walk in dispatch_fn_arg picks
PacketFifoHandle over ObjectFifoHandle when both isinstance() checks
match -- exactly the property the registry was designed for.

gap entries recorded in

Test coverage in test/iron/test_packet_fifo.py:
- Surface tests: API shape, dtype/strategy/packet_id validation, error
  messages (16 tests)
- Handle surface: producer/consumer construction, idempotency,
  send_with_header / recv_header asymmetry, ObjectFifoHandle subclass
  invariant (8 tests)
- Registry integration: PacketFifoHandle registered after import,
  dispatch_fn_arg recognizes it, Worker.fn_args records it on _fifos,
  reverse-insertion-order walk picks subclass over base (4 tests)
- Behavioral toy: 3-producer-1-consumer round-robin merge yields the
  union of inputs without drops; per-producer ordering preserved;
  finish-on-TLAST flag plumbed through; priority strategy + N:M
  construction validated (5 tests)

Refs:
- python/iron/dataflow/fifo_handle_registry.py

Signed-off-by: Matt Davis <matt@opensensor.io>
…AM020 Ch. 5 p. 74)

vs the 2-into-1 fallback) to a first-class IRON helper. Encapsulates
the canonical AIE-ML / AIE2P memtile-mediated fan-in pattern documented
by AM020 Ch. 5 p. 74 (memtile S2MM channels 0..3 with east/west
neighbour access) + Figures 22+23 + the "Dataflow Mapping 1/2/3"
diagrams.

API:
  MemtileAggregator(n_producers, producer_obj_type, joined_obj_type,
                    layout="slab", depth=2, tile=AnyMemTile, name=...)
  .producer(i) / .producers() -> ObjectFifoHandle (per-tile producer)
  .consumer(depth=...)        -> ObjectFifoHandle (joined consumer)
  .offsets / .sub_fifos / .joined_fifo (introspection)

Validates the flat-concat invariant at construction time and surfaces
discovery) in both the class docstring and a clear NotImplementedError
that fires for the layout="window" reservation slot until Phase 3
extends the helper with explicit dims_to_stream inference.

MemtileAggregator slot is touched; CascadeFifo / PacketFifo /
AccumFifo / SparseFifo slots are left untouched for their owning
tasks).

Tests: test/iron/test_memtile_aggregator.py covers construction
validation, the flat-concat invariant, per-producer / consumer
handle accessors, layout vocabulary enforcement, the memtile DM
budget check, and byte-equality with Phase 1's hand-rolled
join_offsets=[0, 2048, 4096, 6144] + obj_types triple. Pure-Python
(no MLIR context required); runs after the fork wheel rebuild.

Refs: AM020 Ch. 5 p. 74 (memtile DMA channel layout), Ch. 5 p. 71
(5D address generation), Table 14 (memtile DM = 512 KiB on AIE-ML,
Phase-2 follow-up.

Signed-off-by: Matt Davis <matt@opensensor.io>
Promotes AIE-ML / AIE2P compute-tile S2MM decompression + MM2S compression
hardware (AM020 Ch. 1 p. 15 + Ch. 2 p. 27 + Ch. 5 p. 74) to a first-class
IRON dataflow primitive at python/iron/sparse.py.

The new SparseFifo class subclasses ObjectFifo (composes-by-subclassing —
inherits storage / depth / dimsToStream / dimsFromStream / pad+repeat+iter
machinery) and adds N:M structured-sparsity kwargs (sparsity_pattern, N, M,
allow_unverified). Producer-side sees compressed data, consumer-side sees
dense data; the on-tile decompressor re-injects zeros at the position-map
gaps before the data lands in tile DM. SparseFifoHandle subclasses
ObjectFifoHandle so Worker.fn_args' isinstance(arg, ObjectFifoHandle) check
SparseFifoHandle handler over the parent ObjectFifoHandle handler in the

Lowering model
--------------
SparseFifo.resolve() calls the standard ObjectFifo lowering then attaches
five discardable attributes to the lowered ObjectFifoCreateOp:

  aie.compress_mm2s       (BoolAttr) — flips Enable_Compression on producer BD
  aie.decompress_s2mm     (BoolAttr) — flips Enable_Compression on consumer BD
  aie.sparsity_pattern    (StringAttr "N:M")
  aie.sparsity_n          (i32)
  aie.sparsity_m          (i32)

The BD-emit pass keys off these to flip the per-channel
Enable_Compression bit (lib/Dialect/AIE/Util/aie_registers_aie2.json
documents this as "Enable Compression (MM2S), decompression (S2MM).
Only effective if channel has (de)compression enabled"). If the active
backend hasn't been taught about these attributes (early AIE2P
silicon-driver stack), the design still compiles and runs as a vanilla
ObjectFifo — degraded mode is observable via the runtime DMA-volume

Pattern validation
------------------
AM020-verified set {(1,2), (1,4), (2,4)} accepted by default (Ch. 1 p. 15
cites these for "CNN and RNN application" with RNN explicitly named —
hatch). Structural rules enforced eagerly at construction time:
M >= 2; 0 < N < M; N and M are int.

Registry hook-up
----------------
register_fifo_handle(SparseFifoHandle, _sparse_fifo_handle_handler) runs
at module import time; handler mirrors the pre-registered
ObjectFifoHandle bookkeeping (arg.endpoint = worker;
forward-looking no-op (Worker.__init__ falls back to the hard-coded
isinstance(arg, ObjectFifoHandle) branch which still accepts
the SparseFifoHandle handler over the ObjectFifoHandle handler.

Tests
-----
23 in-fork tests at test/iron/test_sparse_fifo.py covering surface (real
impl not stub; SparseFifoHandle subclassing; module constants),
validation (rejects unsupported pattern tag, M<2, N>=M, N=0, unverified
pattern by default; allow_unverified accepts unverified; non-int N/M;
non-Tile producer), pattern correctness (each group of M has exactly
M-N zeros after pruning, parametrized for 1:2 / 1:4 / 2:4; decompressed
matmul bit-equal to sparse reference for 1:2 + 2:4), lowering (resolve()
emits aie.objectfifo with all 5 sparsity attrs; idempotent on
double-resolve), registry (SparseFifoHandle registered; dispatch_fn_arg
matches), diagnostics (handle exposes N/M/compression_ratio/
sparsity_pattern/sparse_fifo properties; __str__ includes pattern + N/M;
module __all__ stable).

Architectural references
------------------------
- AM020 Ch. 1 p. 15: AIE-ML supports structured sparsity for "CNN and
  RNN application" (RNN explicitly named).
- AM020 Ch. 2 p. 27: "Adds decompression to the two S2MM channels" +
  "Adds compression to the two MM2S channels".
- AM020 Ch. 5 p. 74: memtile compression / decompression.
- lib/Dialect/AIE/Util/aie_registers_aie2.json: BD field
  Enable_Compression bit ("Only effective if channel has (de)compression
  enabled").

AIE2P caveat
------------
AM020 documents AIE-ML's supported sparsity patterns. AIE2P inherits the
compute-tile DMA Enable_Compression bit but the accepted N:M patterns
documents the divergence and falls back to dense weights. SparseFifo
itself remains usable on AIE-ML targets even if AIE2P diverges; the
divergence is silicon/runtime, not API.

-------------------------
Only the SparseFifo slot region of python/iron/__init__.py is touched
(CascadeFifo / PacketFifo / AccumFifo / MemtileAggregator stubs untouched
per the parallel-agent serialization rule). Atomic-heredoc-commit

Signed-off-by: Matt Davis <matt@opensensor.io>
PacketFifoHandle.all_of_endpoints() now returns endpoint-typed objects (was: raw Tile objects).
AccumFifoHandle.__init__ now chains through super().__init__() (was: bypassed, leaving _object_fifo unset).

Fix shape (Option A: normalize the subclass contract; the alternative -- relaxing
iron/program.py's walk -- moves the divergence into the consumer and is easier to
regress). Both subclasses now satisfy the same contract that ObjectFifoHandle's
all_of_endpoints() exposes: a list of objects each carrying a .tile attribute,
which is what iron/program.py:81's tile-collection walk
([e.tile for e in fifo.all_of_endpoints()]) requires.

Implementation note for AccumFifoHandle: rather than chain super().__init__()
(which would require a memref-typed obj_type and ObjectFifo depth/dims that
AccumFifo intentionally does not have -- a cascade transfer is one accumulator
per cycle, no circular buffer), this commit overrides all_of_endpoints() on the
subclass to walk the AccumFifo's prod/cons handles directly. Same end-state
(callers see endpoint-typed objects with a .tile attribute), narrower change.

PacketFifoHandle: all_of_endpoints() now walks the parent PacketFifo's
prod_handles + cons_handles, surfacing each handle's .endpoint (the live Worker
instance attached by Worker.fn_args registry dispatch) when set, falling back
to ObjectFifoEndpoint(tile) wrappers when the topology is inspected pre-Worker
construction.

Coverage: new fork-internal test file test/iron/test_fifo_handle_program_walk.py
(8 tests) exercises both broken paths -- the literal expression at
iron/program.py:81 -- one set per handle type, plus a cross-cutting roll-up
that walks the live registry.

Existing test verdict: 58/59 pass in test/iron/test_packet_fifo.py +
test/iron/test_accum_fifo.py. The one failing test
(test_packet_fifo_accepts_explicit_packet_ids) fails on main as well -- it
constructs PacketFifo with packet_ids=[0x10, 0x20] but 0x20=32 is rejected by
the 5-bit pkt_id range cap [0, 31]. Pre-existing, unrelated to this fix.

Closes:

Signed-off-by: Matt Davis <matteius@gmail.com>
Adds VariableRateFifo + VariableRateFifoHandle, a primitive for
streams where the producer chooses per object whether to forward
or discard the buffer. Sibling to PacketFifo: PacketFifo handles
the N-into-1 fan-in side of variable-rate dataflow, VariableRateFifo
handles the single-producer conditional-forward side.

API:
  fifo = VariableRateFifo(producer_tile, consumer_tile, depth, dtype)
  with worker.runtime() as r:
      out = fifo.acquire_producer()
      if predicate(out):
          fifo.forward_producer()  # publishes the slot
      else:
          fifo.discard_producer()  # frees the slot without publishing

Files:
- python/iron/variable_rate.py: VariableRateFifo + handle classes.
- python/iron/VARIABLE_RATE_DESIGN.md: design notes covering the
  ObjectFifo extension, the producer/consumer state machine, and
  the parallel to PacketFifo.
- python/iron/__init__.py: register VariableRateFifo in the public
  surface alongside the other specialized FIFO subclasses.
- programming_examples/basic/variable_rate_filter/: minimal end-to-end
  example showing a filter that forwards only objects whose first
  byte is even (drops ~50% of input), with a CPU reference and
  byte-equality check against the kernel output.

Note: the MLIR-side lowering passes that observe discardable attrs
(skip-unroll for variable-rate fifos, attr propagation through
split-fifo) are intentionally not in this PR — they will land
separately once the Python API surface is reviewed.

Signed-off-by: Matt Davis <matt@opensensor.io>
``aie.compress_mm2s`` / ``aie.decompress_s2mm`` discardable attrs to
the lowered ``aie.objectfifo.create`` op (per
``python/iron/sparse.py``), but the BD-emit pass at
``lib/Dialect/AIEX/Transforms/AIEDmaToNpu.cpp:655`` hardcoded
``Enable_Compression = 0`` on the AIE2/AIE2P tile DMA BD word. As a
(accuracy contract MET offline) but the wire-level DMA stayed at the
54.8 MB dense baseline rather than the ~27 MB 2:4 N:M-compressed
target.

Plumbing path -- three-hop discardable-attr propagation, no TableGen
schema change:

  1. ``AIEObjectFifoStatefulTransform.cpp::createBd[Block]``: when
     lowering ``aie.objectfifo.create`` to per-tile DMABDOps, read the
     SparseFifo discardable attrs from the originating
     ObjectFifoCreateOp (which is about to be erased at the end of
     this pass). If ``aie.compress_mm2s = true`` and the BD's channel
     direction is MM2S, OR ``aie.decompress_s2mm = true`` and the
     direction is S2MM, attach the boolean ``aie.enable_compression =
     true`` on the new DMABDOp. ``createBd`` now returns the DMABDOp
     so the caller can decorate it.

  2. ``AIEDMATasksToNPU.cpp::rewriteSingleBD``: when converting a
     DMABDOp inside a ``aiex.dma_configure_task`` body to an
     ``aiex.npu.writebd`` op, copy the ``aie.enable_compression``
     discardable attr from the source DMABDOp onto the new
     NpuWriteBdOp.

  3. ``AIEDmaToNpu.cpp::WriteBdToBlockWritePattern`` (the line-655
     callsite the gap report named): when packing the
     ``aiex.npu.writebd`` op into the AIE2 tile-DMA 6-word block-write
     payload, read ``aie.enable_compression`` off the NpuWriteBdOp and
     OR bit 31 (``Enable_Compression``) of ``DMA_BDX_1`` accordingly.

Default branch is the pre-existing behaviour: no attrs on the
ObjectFifoCreateOp -> no attr on the DMABDOp -> no attr on the
NpuWriteBdOp -> ``Enable_Compression = 0``. The negative case is
regression-protected by the new lit test.

AM020 references (cited in ``python/iron/sparse.py`` module
docstring):

- AM020 Ch. 2 p. 27: compute-tile DMA adds compression to two MM2S
  channels and decompression to two S2MM channels.
- AM020 Ch. 1 p. 15: AIE-ML supported N:M structured-sparsity
  patterns (1:2, 1:4, 2:4) for CNN/RNN application; AIE2P caveat
  documented in ``aie.iron.sparse``.
- ``lib/Dialect/AIE/Util/aie_registers_aie2.json`` BD field
  ``Enable_Compression``: "Enable Compression (MM2S),
  decompression (S2MM). Only effective if channel has
  (de)compression enabled".

This change leaves the AM020-verified-pattern check (and the
``allow_unverified=True`` escape hatch) at the IRON Python layer
asked for compression and the lowering produced the discardable
attrs, the bit gets flipped. Pattern-arch validation stays out of
the C++ pass.

New lit test ``test/Conversion/DmaToNpu/dma_to_npu_sparse_compression.mlir``:

- Positive case: ``aiex.npu.writebd { ..., aie.enable_compression =
  true }`` -> ``DMA_BDX_1`` second word is ``0xD53D0000`` =
  ``3577544704`` (bit 31 set on top of the dense
  ``0x553D0000``/``1430061056`` reference shared with
  ``dma_to_npu_core_tile.mlir``).
- Negative case (regression-protect): no ``aie.enable_compression``
  attr -> ``DMA_BDX_1`` stays at ``1430061056``.

Build verification: each touched ``.cpp`` compiles cleanly under the
existing Ninja rules (tested
``obj.AIETransforms.dir/AIEObjectFifoStatefulTransform.cpp.o``,
``obj.AIEXTransforms.dir/AIEDMATasksToNPU.cpp.o``, and
``obj.AIEXTransforms.dir/AIEDmaToNpu.cpp.o`` targets); the static
libraries ``libAIETransforms.a`` and ``libAIEXTransforms.a`` link
successfully. The final ``aie-opt`` link fails locally on a
host-config issue (lld absent; pre-existing CMake configuration uses
``-fuse-ld=lld``) -- handing off to the orchestrator's wheel-rebuild

Signed-off-by: Matt Davis <matteius@gmail.com>
…r propagation

Two changes mirroring the SparseFifo discardable-attr propagation:

1. unrollForLoops: skip ObjectFifoAcquireOp on creators carrying
   aie.variable_rate. The producer's loop body for a variable-rate
   fifo contains a conditional acquire/release that the LCM-unroll
   math cannot model; the runtime-counter machinery handles
   asymmetric rates correctly without unrolling. If the loop has
   only variable-rate accesses, the loop is left alone (treated
   like a loop with no objectfifo accesses).

2. Split-fifo attr propagation: copy aie.variable_rate from the
   original ObjectFifoCreateOp to the consumer-side fifo so both
   halves carry the marker.

Adds two lit tests:
  - variable_rate_fifo_attr_propagation.mlir
  - variable_rate_fifo_skip_unroll.mlir
- MemtileAggregator: drop the layout='window' parameter that was
  reserved-but-unimplemented (raised NotImplementedError on use).
  Slab is the only supported layout; window-major callers should drop
  down to ObjectFifo.prod().join with explicit dims_to_stream. Removes
  layout from __init__ signature, _VALID_LAYOUTS constant, validation,
  the .layout property, the docstring discussion, the __str__, and the
  two associated tests.

- AccumFifo precision tests: rename to reflect what they actually
  measure (numpy reference baselines for the LSTM workload, not tests
  of the AccumFifo lowering). Drop the 'load-bearing falsifiable claim'
  framing from docstrings.

  test_accum_fifo_invariant_hits_1e_minus_5_precision_target
    -> test_fp32_lstm_reference_matches_pytorch
       (a fixture sanity check; FP32 numpy LSTM matches FP32 pytorch)
  test_accum_fifo_beats_bf16_baseline_by_three_orders_of_magnitude
    -> test_fp32_reference_beats_bf16_writeback_by_3oom
       (workload sensitivity characterization, not AccumFifo behaviour)

  Helper renamed: _lstm_cell_with_accum_fifo -> _lstm_cell_fp32_reference

- SparseFifo round-trip test: rename + reframe.
  test_decompressed_matmul_bit_equal_to_sparse_reference
    -> test_nm_compression_roundtrip_is_lossless_on_compliant_input
  Comment block above clarifies these are property tests for the N:M
  compression format itself (numpy reference), not tests of SparseFifo's
  lowering or silicon behaviour.
Address review issue 8: the prior example always called acquire/release
on the producer side and never invoked discard(1), so it didn't actually
demonstrate variable-rate behaviour.

Restructure the example kernel to a Python-level deterministic
alternating-skip pattern: every other window is forwarded via the C++
copy kernel + acquire/release; the alternate window is dropped via
out_handle.discard(1). The producer's loop now has asymmetric
acquire/release counts on the variable-rate output fifo, which is what
the aie.variable_rate=true marker tells the lowering pass to expect.

The C++ kernel is simplified to a void window-copy (skip decision
moved to the Python layer). Module comment is rewritten to describe
the deterministic-skip pattern accurately, with a note that
predicate-decided runtime skips would require a first-class scf.if
lowering not currently in IRON Python.
…Snapshot export)

- register_fifo_handle: re-registering the *same* callable for the
  same class is now an idempotent no-op. This makes module reloads,
  repeated imports, and test harnesses safe. Registering a *different*
  handler still raises ValueError to catch accidental override.

- _RegistrySnapshot: removed from __all__. The leading underscore
  already signals 'test-internal use only'; the class is still
  accessible via direct import for tests that need it.

- Updated test_double_registration_raises -> split into
  test_double_registration_with_different_handler_raises +
  test_double_registration_with_same_handler_is_idempotent.
…FifoHandle

Address review issue 7: AccumFifoHandle and PacketFifoHandle deliberately
do not call super().__init__() because ObjectFifoHandle's constructor
requires an ObjectFifo with semantics (depth, dims_from_stream_per_cons,
_get_endpoint) that AccumFifo and PacketFifo do not have. They instead
stub the attributes ObjectFifoHandle exposes as properties directly.

Add an explicit class-docstring '.. note::' section in each documenting:
  - which attributes are stubbed
  - why super() is bypassed
  - that all_of_endpoints (which traverses _object_fifo) is overridden
  - that the proper long-term fix is a shared narrower base class

This is a documentation fix, not a behavioural change. The shared base
class refactor is intentionally out of scope for the initial primitive
landing; the bypass pattern is rare in upstream IRON code, and explicit
documentation keeps the next maintainer from being surprised.

(SparseFifoHandle and VariableRateFifoHandle inherit normally from
ObjectFifoHandle and do call super().__init__; only AccumFifoHandle
and PacketFifoHandle exhibit the bypass pattern.)
…e context

Address review issue 5. The previous test suite covered the construction
surface, registry integration, and host-side behavioural simulation, but
didn't exercise PacketFifo.resolve() against a real MLIR context — so a
mismatch with the dialect's packetflow() signature would not have been
caught.

Add two new tests in test/iron/test_packet_fifo.py:

- test_resolve_emits_packetflow_op_in_module: builds a 2-producer /
  1-consumer PacketFifo inside an aie.device body, calls resolve(),
  and asserts the resulting MLIR contains 2 aie.packetflow ops (one
  per producer).

- test_resolve_idempotent_does_not_emit_twice: verifies the resolve()
  reentrancy guard.

Pattern mirrors the existing test_resolve_emits_cascade_flow_op_in_module
in test_cascade_fifo.py. Verified the call-site signature matches
dialects.aie.packetflow's __init__ (pkt_id, source, source_port,
source_channel, dests, keep_pkt_header).
@matteius matteius force-pushed the feature/iron-fifo-primitives branch from 707a7df to e036b85 Compare April 27, 2026 03:55
@matteius matteius marked this pull request as ready for review April 27, 2026 04:00
Copilot AI review requested due to automatic review settings April 27, 2026 04:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the IRON Python API with multiple specialized FIFO abstractions (e.g., cascade, packet, sparse, variable-rate, memtile aggregation) and wires them through MLIR lowering/passes so they work end-to-end, while keeping the existing ObjectFifo API unchanged. It also introduces an extensible Worker.fn_args dispatch registry so new *FifoHandle subclasses can integrate without editing worker.py.

Changes:

  • Add fifo_handle_registry and update Worker.__init__ to use registry-driven fn_args dispatch (with ObjectFifoHandle pre-registered for backward compatibility).
  • Add new IRON primitives (notably SparseFifo, VariableRateFifo, CascadeFifo, MemtileAggregator) and corresponding tests/examples.
  • Extend pass-side plumbing for sparse compression (aie.enable_compression) and variable-rate unroll skipping / split-fifo attr propagation.

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
test/objectFifo-stateful-transform/variable_rate_fifo_skip_unroll.mlir Lit test for excluding variable-rate fifos from LCM-based loop unrolling.
test/objectFifo-stateful-transform/variable_rate_fifo_attr_propagation.mlir Lit test for split-fifo propagation with aie.variable_rate.
test/objectFifo-stateful-transform/sparse_fifo_split_attr_propagation.mlir Lit test for split-fifo propagation of sparse compression attrs.
test/iron/test_worker_fifo_handle_extension.py Pytest coverage for the new registry dispatch API (idempotency, precedence, snapshotting).
test/iron/test_sparse_fifo.py Pytest coverage for SparseFifo surface + validation + lowering metadata behavior.
test/iron/test_packet_fifo.py Pytest coverage for PacketFifo surface/registry/toy behavior + lowering smoke tests.
test/iron/test_memtile_aggregator.py Pytest coverage for MemtileAggregator construction and equivalence checks (currently has syntax/API issues).
test/iron/test_fifo_handle_program_walk.py Contract tests for *FifoHandle.all_of_endpoints() to support Program walks.
test/iron/test_cascade_fifo.py Pytest coverage for CascadeFifo surface + lowering (aie.cascade_flow) + idempotency.
test/iron/test_accum_fifo.py Pytest coverage for AccumFifo surface + warnings + lowering + reference fixtures.
test/Conversion/DmaToNpu/dma_to_npu_sparse_compression.mlir Lit test pinning the final hop of sparse compression bit emission in NPU BD words.
python/iron/worker.py Switch fn_args handling to registry-driven dispatch via dispatch_fn_arg.
python/iron/variable_rate.py Implement VariableRateFifo + handle + attr pinning + registry registration.
python/iron/sparse.py Implement SparseFifo + handle + sparsity validation + attr pinning + registry registration.
python/iron/memtile.py Implement MemtileAggregator helper around ObjectFifo.prod().join() topology.
python/iron/dataflow/fifo_handle_registry.py New registry module: register/unregister/list/dispatch + snapshot helper for tests.
python/iron/dataflow/init.py Pre-register ObjectFifoHandle handler to preserve prior Worker bookkeeping behavior.
python/iron/cascade.py Implement CascadeFifo lowering to aie.cascade_flow and placement-facing surface.
python/iron/init.py Export the new primitives/handles from aie.iron.
python/iron/VARIABLE_RATE_DESIGN.md Design doc capturing rationale and lowering model for VariableRateFifo.
programming_examples/basic/variable_rate_filter/variable_rate_filter.py End-to-end example exercising VariableRateFifoHandle.discard(1).
programming_examples/basic/variable_rate_filter/filter_first_byte_even.cc Companion C++ kernel (currently a pure copy) for the example.
programming_examples/basic/variable_rate_filter/README.md Documentation for the variable-rate example and expected MLIR markers.
programming_examples/basic/variable_rate_filter/Makefile Build plumbing for the variable-rate example.
lib/Dialect/AIEX/Transforms/AIEDmaToNpu.cpp Emit Enable_Compression bit (word[1] bit 31) when aie.enable_compression is present.
lib/Dialect/AIEX/Transforms/AIEDMATasksToNPU.cpp Forward aie.enable_compression from aie.dma_bd to aiex.npu.writebd.
lib/Dialect/AIE/Transforms/AIEObjectFifoStatefulTransform.cpp Propagate sparse compression attrs into BDs, propagate attrs through split-fifo, and skip variable-rate fifos in LCM-unroll.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/iron/test_memtile_aggregator.py Outdated
Comment thread test/iron/test_memtile_aggregator.py Outdated
Comment thread python/iron/worker.py Outdated
Comment thread test/objectFifo-stateful-transform/variable_rate_fifo_attr_propagation.mlir Outdated
Comment thread programming_examples/basic/variable_rate_filter/variable_rate_filter.py Outdated
Comment thread programming_examples/basic/variable_rate_filter/README.md Outdated
Comment thread python/iron/sparse.py
Comment thread test/iron/test_memtile_aggregator.py
Matt Davis and others added 2 commits April 27, 2026 00:14
The earlier mass-strip pass over project-internal task IDs left several
files with structural damage (orphaned triple-quote opens swallowing
function bodies, broken comment fragments, and one missing list entry
that made SparseFifo() crash by default). All flagged by Copilot;
none surfaced earlier because most are inside docstrings (still
syntactically valid) or invariants tested only when an actual handle
is constructed.

Fixes:

- test/iron/test_memtile_aggregator.py: a stray '"""' on
  _make_t53m_aggregator's first body line silently turned the
  helper's body PLUS five subsequent test functions into one giant
  docstring, leaving the helper as 'return None' and dropping
  test_producer_returns_object_fifo_handle, test_producers_returns_n_handles_in_order,
  test_producer_index_out_of_range_raises, test_consumer_returns_object_fifo_handle,
  and test_offsets_match_flat_concat from pytest collection.
  Restored a proper one-line docstring + function body; reconstructed
  the test_offsets_match_flat_concat function body (the closing
  '"""' on line 176 was its docstring tail). Also dropped
  agg.layout assertion that referenced the now-removed property.
- python/iron/sparse.py: added (2, 4) back to
  _AM020_VERIFIED_NM_PATTERNS. The strip ate the line, and the
  defaults are N=2 M=4 — so SparseFifo() with no args raised on
  every construction.
- python/iron/worker.py: rewrote the FIFO-handle dispatch comment
  block (mid-sentence fragments + unmatched parenthesis from line
  drops) into complete sentences.
- test/objectFifo-stateful-transform/variable_rate_fifo_attr_propagation.mlir:
  fixed the orphaned 'or wedge under the split-fifo + cross-column
  path that the' fragment in the header — restored a complete
  sentence describing what the lit test verifies.
- programming_examples/basic/variable_rate_filter/{variable_rate_filter.py,README.md,filter_first_byte_even.cc}:
  prior commit changed the example to a Python-level alternating
  skip pattern (calling discard(1) in the skip branch), but the
  comments + README still claimed the C++ kernel did a per-window
  predicate check. Updated both to describe the deterministic
  Python-skip + window-copy kernel that the code actually
  implements; noted that runtime-decided per-window skips would
  require a first-class scf.if lowering.
- Final residual sweep: removed remaining 'Phase 1' / 'bio-on-XDNA'
  references from accum.py / dataflow/__init__.py / memtile.py /
  test_cascade_fifo.py / test_worker_fifo_handle_extension.py.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 30 out of 30 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/iron/cascade.py Outdated
Comment thread python/iron/sparse.py Outdated
Comment thread python/iron/variable_rate.py Outdated
Comment thread test/iron/test_fifo_handle_program_walk.py Outdated
Comment thread programming_examples/basic/variable_rate_filter/variable_rate_filter.py Outdated
Comment thread programming_examples/basic/variable_rate_filter/README.md Outdated
Comment thread python/iron/VARIABLE_RATE_DESIGN.md Outdated
Comment thread test/iron/test_accum_fifo.py
Matt Davis added 2 commits April 27, 2026 00:31
Adds a basic programming example that measures per-launch dispatch
overhead on AIE2P silicon by exposing a trivial single-tile passthrough
kernel under a parameterised IRON topology.

The example varies two orthogonal knobs (n_chunks and dense_bytes) so a
host driver script can regress per-launch wall against each and
attribute the per-launch floor to:

  (a) xrt::run.wait() return-path overhead
  (b) instruction-stream upload cost
  (c) per-chunk shim-DMA setup overhead × N_CHUNKS
  (d) AIE2P firmware dispatcher per-launch handshake

Files:

- dispatch_overhead_bisector.py: IRON topology with N_CHUNKS separate
  fill/drain tasks per launch (so each chunk lowers to its own shim
  BD; the IRON access-pattern collapse is opted out of via per-chunk
  TensorAccessPattern).
- passthrough.cc: 64-byte-vectorised memcpy compute kernel (no
  arithmetic — pushes compute to the noise floor so the host-runner
  wall captures dispatch-layer cost).
- test.cpp: host runner; reports per-iteration wall-time distribution
  as KEY=VALUE lines for machine parsing, plus an optional per-iter
  CSV.
- Makefile: builds one variant by default; `all-variants` builds the
  representative bisection sweep.
- README.md: methodology + build/run/sweep instructions.
Six items, all addressed:

- python/iron/cascade.py: CascadeFifo.resolve() previously set
  self._resolving = True before the tile-op precondition checks and
  never cleared it on exception, leaving the instance permanently
  non-resolvable. Move precondition checks BEFORE setting _resolving,
  and wrap the actual emission in try/finally so failures don't latch
  the flag. Also add an idempotency guard on self._op so a successful
  resolve() is a no-op on re-entry.

- python/iron/sparse.py: deduplicate the (2, 4) entry in
  _AM020_VERIFIED_NM_PATTERNS (added twice during the previous fix;
  set semantics dedupe at runtime but the literal is misleading).

- python/iron/variable_rate.py: fix the malformed "Architectural
  references" bullet list in the module docstring (orphan
  continuation lines + truncated bullet from the earlier strip pass).

- test/iron/test_fifo_handle_program_walk.py: replace the
  '"""+ : contract tests..."""' diff-artifact module docstring with
  a clean one-liner.

- programming_examples/basic/variable_rate_filter/{variable_rate_filter.py,README.md}:
  replace hardcoded /home/matteius/... paths in the build snippets
  with placeholder syntax (<path/to/...>).

- python/iron/{accum.py,VARIABLE_RATE_DESIGN.md}, test/iron/test_accum_fifo.py:
  replace remaining "T7-IRON" references (the strip regex required a
  digit after the dot, which ".IRON" fails to match) with hardware-
  grounded language about the well-trodden cascade-stream path.
matteius pushed a commit to opensensor/bionpu that referenced this pull request Apr 27, 2026
Three more subsystems promoted to the public bionpu package:

src/bionpu/data/:
- canonical_sites.py (276 LOC): Cas-OFFinder TSV normaliser. Now the
  canonical home; bionpu.verify.crispr imports from here. Adds
  serialize_canonical() helper so verify can compute SHA-256 over
  the canonical wire form without touching the filesystem.
- fetchers/{doench_2016,guide_seq,pod5_hg002,reference_genomes}.py
  + fetchers/__init__.py (495 LOC framework): per-dataset public
  fetchers with SHA-pinning, framework documents the requirements
  every dataset entry must satisfy.
- load_smoke.py (113 LOC): in-repo smoke fixture loaders.

src/bionpu/quant/:
- calibrate.py (215 LOC): ONNX quantization calibration driver
  (thin wrapper around onnxruntime.quantization).
- passport.py (196 LOC): quantization passport — every quantized
  model in the repo carries one (calibration source, op recipe,
  reproducibility hash).
- peano_export.py (102 LOC): quantized ONNX -> MLIR-AIE -> xclbin
  lowering hook.

src/bionpu/iron_extensions/:
- cascade_stream.py (387 LOC): cascade-chain IRON helper. Largely
  superseded by mlir-aie's CascadeFifo (Xilinx/mlir-aie#3039) but
  still consumed by 5 genetics files.

Internal bits NOT migrated (kept in genetics-private):
- bionpu/report/* — gaps-yaml aggregator + writeup pipeline; tightly
  coupled to internal task-tracking format.

Refactor:
- src/bionpu/verify/_crispr_canonical.py removed; bionpu.verify.crispr
  now imports from bionpu.data.canonical_sites (the public canonical
  home).

License: Apache-2.0 + LLVM exception → GPL-3.0 across all migrated
files. Project-internal task IDs / outer-repo paths scrubbed.

All 18 verify-harness tests still pass; 71/71 Python files in the
public package parse.
matteius pushed a commit to opensensor/bionpu that referenced this pull request Apr 27, 2026
…le findings

The scrubber on the initial migrations missed several patterns:
- Hyphenated task IDs (T7-IRON, T1-swarm) — regex required \.\d after T<n>
- Investigation-phase labels (Followup A/B/C/E/G, Stage \d)
- Tilde-rooted absolute paths (~/xdna-bringup/...)
- Project-internal filename refs (gaps.yaml, umbrella PRD, gap-id)

Pass:

- 47 files mass-scrubbed of the missed patterns (34 code + 13 docs).
- src/bionpu/iron_extensions/INVENTORY.md: rewritten from a 405-line
  internal investigation report into a brief module note. The
  cascade-stream feasibility outcome is now upstream
  aie.iron.CascadeFifo (Xilinx/mlir-aie#3039); this module is
  documented as a back-compat shim that will be removed once the
  upstream merge is the mlir-aie floor.
- src/bionpu/kernels/basecalling/lstm_cell_bf16_acc_cascade/DESIGN.md:
  rewritten from a 406-line investigation log (Followup A-G stages,
  WEDGE-vs-PASS status flips, hypothesis falsification entries)
  into a 60-line design note covering the architectural rationale
  (FP32-cascade vs bf16-writeback precision wall, AM020 references,
  topology diagram, file inventory, known limitations).
- README.md: dropped the dead 'v0.1 release notes' link to a tag
  that doesn't exist; rewrote the Status section to point at the
  new docs/STATUS.md per-subsystem inventory.
- docs/STATUS.md: NEW — explicit per-subsystem table of what works
  end-to-end vs what's an extracted module that needs a v0.2 driver
  to drive on hardware. Distinguishes ✅ working / ✅ extracted /
  ⚠️ deprecated / ⚠️ v0.2 scope.
- docs/REPRODUCE.md: rewrote from placeholder text into a real
  reproduction recipe — hardware prerequisites, software install,
  pip-install bionpu, byte-equality smoke check + negative control,
  manual kernel build, host-runner invocation, energy-methodology
  pointer, sanity-log discipline.
- CHANGELOG.md: NEW — v0.1 release notes covering what landed and
  what's deferred to v0.2.

Other per-kernel DESIGN.md files were checked: only the one cascade
file carried a substantive investigation log; the others were already
clean after the earlier scrub passes.

All 18 verify-harness tests still pass; 70/70 Python files parse.
@hunhoffe
Copy link
Copy Markdown
Collaborator

Hi @matteius -- this is neat work you've started! I was hoping to get a high-level overview of:

  1. What concrete problems/limitations require these changes?
  2. Is there a way to stage this work into smaller pieces?

We typically discuss larger design decisions in an issue and/or discussion before considering integration, so we'd like to engage in some thought before considering ObjectFIFO redesigns.

@matteius
Copy link
Copy Markdown
Author

Hi @hunhoffe — context first: I set out to implement genomics algorithms on the AIE2P NPU in my Ryzen AI laptop, working backward from AMD's documentation (AM020 / AM029) on what the hardware supports. Working PoCs are at https://github.com/opensensor/bionpu — the primitives in this PR fell out of repeatedly re-discovering the same composition decisions across those workloads. High-level answers below.

Concrete limitations behind each primitive

Each one exposes an AM020 / AIE2P hardware feature that doesn't currently have a first-class IRON surface; workloads re-glue dialect ops + ad-hoc Python per use site:

Primitive Limitation removed Hardware ref
CascadeFifo cascade_flow is a raw dialect op without an IRON producer/consumer surface AM020 Ch. 4
PacketFifo packetflow + TLAST + out-of-order BD reimplemented per workload AM020 Ch. 2 Fig. 17 + Ch. 5 p. 74
AccumFifo 512-bit BM-register accumulator state across timesteps and tiles via cascade-stream BM transfer AM020 Ch. 4 p. 67
SparseFifo BD Enable_Compression bit unreachable from IRON; needs DMA-pass plumbing through AIEDmaToNpu / AIEDMATasksToNPU AM029 N:M decomp
MemtileAggregator 4-into-1 memtile fan-in topology re-derived per workload AM020 Ch. 5 p. 74
VariableRateFifo Producer-side discard for sparse-emit ring buffers — needs unrollForLoops skip + split-fifo attr propagation mlir-aie split-fifo / unroll

The bundling reflects how they compose in the bionpu workloads: CascadeFifo + AccumFifo + SparseFifo together for LSTM cascade chains (basecalling), PacketFifo + VariableRateFifo + MemtileAggregator together for sparse-emit scan patterns (CRISPR genome scan).

Staging

One way to split, in increasing review surface:

  1. Foundation. Worker.fn_args FifoHandle registry + two *FifoHandle-subclass-contract broadenings in Program._walk_object_fifos and iron/program.py. Pure broadening; existing ObjectFifoHandle paths unchanged.
  2. Pure-Python wrappers. CascadeFifo, PacketFifo, MemtileAggregator. Existing dialect ops behind the IRON producer/consumer surface; no MLIR pass changes.
  3. Pass-side primitives. SparseFifo (BD Enable_Compression through three passes), VariableRateFifo (unroll skip + split-fifo attr propagation).
  4. AccumFifo. FP32 inter-tile state + cascade-stream BM-register transfer; the largest abstraction-shape question.

Does that map to what you'd find reviewable, or would you split it differently?

Process

Would issues for the Tier 3 + Tier 4 design surfaces (SparseFifo, VariableRateFifo, AccumFifo) be the right starting point? That's where pass-side and abstraction-shape decisions benefit most from discussion before code review. For Tier 1 + Tier 2 (API-shaped, no pass changes) — would you want issues for those too, or are small per-primitive PRs OK once the foundation broadening is in?

@hunhoffe
Copy link
Copy Markdown
Collaborator

hunhoffe commented Apr 29, 2026

Hello @matteius! Thanks for your response. We are eager to ensure mlir-air + IRON can support use cases for application development, and it seems you have found a few gaps in feature coverage. We'd like to work with you to get those gaps filled.

The path forward that makes the most sense to me is to:

  1. Create a top-level issue to keep track of associated PRs and encapsulate the overall goals
  2. Break into a series of PRs based on priorities that are sequentially submitted

For the breakdown of capabilities/PRs, I think the clearest path to success would be to:

  1. Start with any needed extensions to mlir-aie. This is highest priority for us as this represents a concrete expressibility gap between hardware features and the core mlir-aie dialect. Make sure each PR only contains one feature and adequate test coverage of that feature.
  2. Break each top-level IRON feature into a separate PR. That PR description needs to include how you would express the pattern without the feature (e.g., baseline) vs how you would express the pattern with the new feature in order to demonstrate the clear benefit of each proposed change. Each feature will also need clear testing and framing within repo documentation and/or programming examples.

Let us know if you have further questions; I'm excited to see what comes of your work so far!

I also wanted to ask -- in opensensor/bionpu, do you have a specific applications/algorithm focus, or are you just generally exploring the space?

@matteius
Copy link
Copy Markdown
Author

Thanks @hunhoffe, I appreciate the staging proposal. It's a reasonable path for upstream integration.

do you have a specific applications/algorithm focus, or are you just generally exploring the space?

As for that I recommend reading through: https://github.com/opensensor/bionpu/blob/main/docs/AIE2P-CRISPR-shape-validation.md -- By selectively engaging the NPU for parts of the CRISPR search path I arrived at a 4-5x speedup on a CRISPR editor I was building.

I'm going to step back from upstreaming this work for now. The repackaging effort (top-level issue, per-feature PRs with baseline-vs-new framing, and the iterative review cycles for each) is more bandwidth than I wish to commit to right now alongside other things I have going on.

I'll continue developing on my fork and will keep it rebased on Xilinx/mlir-aie:main periodically so the bionpu workloads stay current. If any of the primitives turn out to be useful to the broader community and someone wants to drive an upstream effort using this branch as a reference, I'm happy to support that, answer questions, clarify design decisions, point at the AM020 / AM029 sections each primitive is grounded in.

For anyone landing on this thread looking for the working code: the primitives live on opensensor:feature/iron-fifo-primitives and are exercised end-to-end in the bionpu research repo here: https://github.com/opensensor/bionpu

Thanks again for engaging with it -- no hard feelings on the process ask, it's the right ask for a project of this scope, its just not the right time on my end, always hustling to make ends meet.

@hunhoffe
Copy link
Copy Markdown
Collaborator

hunhoffe commented May 6, 2026

Hi @matteius! I understand. I've created an issue so this PR doesn't get too lost: #3050

Feel free to propose additional contributions or file issues in the future!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants