Skip to content

Port MobileNet to the IRON API#3059

Merged
hunhoffe merged 101 commits into
mainfrom
mobilenet-py
May 30, 2026
Merged

Port MobileNet to the IRON API#3059
hunhoffe merged 101 commits into
mainfrom
mobilenet-py

Conversation

@hunhoffe
Copy link
Copy Markdown
Collaborator

@hunhoffe hunhoffe commented May 8, 2026

Adds an IRON implementation of MobileNet (aie2_mobilenet_iron.py) that produces a functionally-equivalent xclbin for Strix. The original placed-API design (aie2_mobilenet.py) did the hard work of validating the network topology, kernel selection, and tile placement on Strix; this PR ports those proven decisions into a more declarative form using IRON.

Summary

  • IRON design: bottleneck families expressed as reusable per-family helpers (bottleneck/{regular,cascade,pipeline}.py) driven by a single NETWORK spec; scale factors carried as a dict.
  • StaticWeightStream (lowlevel_dma.py): MemTile→compute weight DMA with optional ping-pong, for layers where ObjectFifo's semantics weren't a fit.
  • CascadeFlow in IRON: first-class CascadeFlow(src, dst) that lowers to aie.cascade_flow after placement.
  • ObjectFifo additions: disable_synchronization, allocate_on.
  • Validation: mobilenet_numpy.py bit-exact reference, per-block IRON e2e + numpy lit suites.

Test plan

  • make build/final_mobilenet.xclbin + test_mobilenet.py pass on Strix against data/aie_output.txt.
  • run_e2e.lit and run_numpy_per_bn.lit green.
  • test/python/npu-xrt/test_cascade_flow.py passes.

Note for reviewers

AIE output has a known max-delta=9 vs the brevitas golden (mlir-aie #3009); test_mobilenet.py follows the existing convention of atol=9 against data/golden_output.txt.

hunhoffe and others added 30 commits April 13, 2026 15:25
- Add CascadeFlow class to python/iron/dataflow for cascade stream connections
- Add PersistentBuffer stub to python/iron/dataflow for static weight DMA
- Add Runtime.cascade_flow() method for registering cascade connections
- Resolve CascadeFlow ops in Program.resolve_program() after worker resolution
- Export CascadeFlow and PersistentBuffer from python/iron top-level
- Account for PersistentBufferHandle DMA channels in SequentialPlacer
- Add test_cascade_flow.py hardware test for new cascade API
- Add bottleneck/ directory with IRON rewrites of bn0-bn14 blocks:
  - regular.py: bn0-bn9 (single tile per block)
  - pipeline.py: bn10-bn12 (one tile per layer pipeline)
  - cascade.py: bn13-bn14 (5-tile cascade stream blocks)

Scale factors are compile-time Python int constants in Worker fn_args;
no RTP buffers or NpuWriteRTPOp calls needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ObjectFifo.set_repeat_count(): store pre-resolve, emit during resolve()
  as attribute on the MLIR op; distinct from set_iter_count()
- ObjectFifoHandle.set_repeat_count(): delegates to underlying ObjectFifo
- cascade.py: replace per-tile weight ObjectFifos with full-weight fifos
  that split at MemTile using cons().split() — matches object_fifo_link
  with offsets pattern from original placed dialect code
- cascade.py: replace element-wise skip copy workers with forward()
  (MemTile DMA forwarding, no extra compute tile needed)
- cascade.py: set_repeat_count(InH=7) on cascade weight sub-fifos
- pipeline.py: fix weight filenames to match bn10_1_chain.txt convention
- pipeline.py: fix Kernel .o paths (remove erroneous data_dir prefix)
- pipeline.py: replace element-wise bn11 skip copy with forward()
- aie2_mobilenet_iron.py: initial orchestrator (cascades wts_fifos)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- pipeline.py: replace self-loop ObjectFifo (bn12 DW->PW) with tile-local
  Buffer; both are tile-SRAM but Buffer avoids lock/BD overhead for same-tile
  sequential ops where no sync is needed
- pipeline.py: remove extra bn11 skip copy Worker; use forward() instead
- cascade.py: fix wts_fifos return to use full-weight fifos (host fills these,
  MemTile splits to put/get via cons().split())
- aie2_mobilenet_iron.py: add cascade weight DMA in runtime sequence;
  fix act_in depth to 5 (allow 3-row window); fix cascade 4-tuple unpack;
  fix bn0 scale factor count (no sfAdd for 2-layer block)
- objectfifo.py: add set_repeat_count() with pre-resolve storage pattern,
  matching how set_iter_count() works; apply during resolve()

NOTE: regular.py has structural issue - ObjectFifos declared inside worker
function bodies; needs to be moved outside and passed as fn_args per IRON API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IRON framework (python/helpers/dialects/func.py):
- Auto-cast index-typed args to expected IntegerType in func.call()
  so loop induction vars from range_() work without manual IndexCastOp

regular.py:
- Move internal ObjectFifos outside worker function bodies; pass both
  .prod() and .cons() to the same Worker (IRON self-loop ObjectFifos)
- Fix all Kernel .o paths (remove os.path.join(data_dir, ...) prefix)

pipeline.py:
- Fix bn12 DW/PW weight buffers: split combined buffer into separate
  bn12_dw_wts and bn12_pw_wts (matching separate chain files)
- Fix bn11 skip: use forward() instead of copy Worker
- Fix weight filenames to *_chain.txt convention

cascade.py:
- Fix L3 split weight size: 480*80=38400 (not 480*40=19200)
- Remove copy Workers; use forward() for skip connections
- Use cons().split() + set_repeat_count() for cascade weight fifos

aie2_mobilenet_iron.py:
- Fix runtime sequence to fill cascade weight fifos
- Fix act_in depth and post_l1 worker pattern
- Fix FC weight buffer size in Kernel signature

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each worker function in regular.py now uses memref_view(wts_buf.op, ...)
to extract per-layer weight sub-views from the combined weight Buffer.
This matches the original placed-dialect code's memref_view pattern and
produces correct kernel argument types.

All blocks fixed: bn0, bn1, bn2, bn3, bn4+5, bn6, bn7, bn8+9.

The IRON design now generates valid MLIR (2438 lines).
Comparison with original:
  - cascade_flow ops: identical (4 ops, same tile pairs)
  - aie.core count: 32 (matches)
  - aie.objectfifo count: 68 IRON vs 70 original (2 fewer)
  - aie.tile count: 40 IRON vs 48 original (8 fewer shim/memtile)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch MLIR_FILE from aie2_mobilenet to aie2_mobilenet_iron.
The IRON design generates valid MLIR (2438 lines, verified).

Key stats vs original:
  - cascade_flow ops: identical (4 ops, same tile pairs)
  - aie.core count: 32 (matches)
  - Code: ~3500 lines vs ~7000 original (2x reduction)

Framework changes also included:
  - python/helpers/dialects/func.py: auto index->int cast in func.call
  - python/iron/dataflow/objectfifo.py: set_repeat_count() pre-resolve storage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bn13 and bn14 L3 weight fifos (put/get) now have repeat_count=7,
matching the original design's set_repeat_count(RepeatChannels=7) calls.

Structural comparison with original:
  aie.core:        32 = 32  ✓
  cascade_flow:     4 = 4   ✓
  repeat_count:     8 = 8   ✓
  runtime_sequence: 1 = 1   ✓
  aie.objectfifo:  68 ≈ 70 (2 fewer: cascade L2 DW as Buffer not fifo)
  aie.buffer:      26 < 59 (missing: PostL2 FC MemTile buffers, PersistentBuffer TODO)
  aie.tile:        40 < 48 (IRON declares tiles on-demand, not all upfront)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without sep=',', np.fromfile reads the text file as raw binary bytes,
producing an array of ASCII character codes (135983 bytes) instead of
the actual int8 weights (38400 elements). This caused aiecc to fail
with 'elements hex data size is invalid' when parsing the generated MLIR.

All other bottleneck modules (regular.py, cascade.py) already had sep=','
set correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The original design places bn14 L2 DW at tile(6,2) which is the same row
as L3 GET at tile(5,2), allowing shared-memory (streaming) access without
consuming a DMA channel. Tile(7,4) was incorrect from the plan document
and caused 'number of input DMA channel exceeded' in aiecc.

Also fix pipeline.py _load_weights: add sep=',' to np.fromfile so comma-
separated weight chain files are parsed correctly (not as raw binary).
Without sep=',' the ASCII text was being read as raw bytes, producing wrong
weight data and invalid hex encoding in the MLIR dense attribute.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The original design places bn14_tile_layer2 at tile(6,2), adjacent to the
bn14_l3_get tile at tile(5,2). This adjacency allows aiecc to use
shared local memory instead of DMA for bn14_of_l2_l3_second, avoiding
the S2MM DMA channel overflow on tile(5,2).

Also corrects the comment in the docstring (plan had Tile(7,4) which was
based on an incorrect reading of the original code).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Distribute cascade weight and skip ObjectFifos across dedicated MemTiles
matching the original design, preventing BD exhaustion on any single channel.

Original MemTile assignments:
  bn13 L1 weights → MemTile(0,1)
  bn13 L3 weights → MemTile(1,1)
  bn14 L1 weights → MemTile(2,1)
  bn14 L3 weights → MemTile(3,1)
  bn13 skip       → MemTile(5,1)
  bn14 skip       → MemTile(7,1)

Before: mem_tile_0_1 had 8 fifos (exceeded 4 MM2S/S2MM DMA channels).
After:  max 5 fifos per memtile (within NPU2 resource limits).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ObjectFifo.__init__: accept disable_synchronization=True parameter,
passed through to the underlying aie.objectfifo MLIR op during resolve().

ObjectFifo.allocate_on(tile): redirect buffer allocation to a neighboring
tile using aie.objectfifo.allocate. Can be called before resolve() (tile
stored in _alloc_tile_ops and applied during f.resolve()) or after.

These two together allow fused bottleneck pairs (bn4+5, bn8+9) to keep
their internal self-loop ObjectFifos on the compute tile logically but
store their buffer memory on an adjacent tile's SRAM — avoiding the 64KB
tile memory overflow caused by 62KB weights + ~23KB internal fifo buffers.

This mirrors the original placed-dialect pattern:
  object_fifo("name", tile, {tile}, depth) {disable_synchronization=true}
  fifo.allocate(L1_tile)

Also wires disable_synchronization into objectfifo resolve() and applies
pending _alloc_tile_ops list during resolve().

Next steps:
- Call allocate_on() in regular.py for bn45 and bn89 internal fifos,
  targeting adjacent tiles (Tile(0,2) for bn45, Tile(3,4) for bn89)
- This matches original: bn4+5 allocates on init tile, bn8+9 on pipeline tile

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…airs

Apply disable_synchronization=True + allocate_on(adjacent_tile) to the
internal self-loop ObjectFifos in the bn4+5 and bn8+9 fused bottleneck
workers. This exactly matches the original placed-dialect design:

  aie.objectfifo @bn8_bn9_act_bn8_1_2(%tile, {%tile}, 3) {disable_synchronization=true}
  aie.objectfifo.allocate @bn8_bn9_act_bn8_1_2(%adj_tile)

Original allocation targets (from original MLIR analysis):
  bn4+5 internal fifos → Tile(0,2) (init tile, same as original)
  bn8+9 internal fifos → Tile(3,4) (pipeline bn11_l1 tile, same as original)

This redirects the ~23KB of intermediate activation fifo buffers off the
compute tile's 64KB SRAM, allowing the 62KB weight buffer to fit.

Also fix program.py to include _alloc_tile_ops in the tile resolution list
so hardcoded Tile(col,row) objects are resolved before fifo resolution.

Generated MLIR now contains:
  - 10x {disable_synchronization = true} attributes
  - 10x aie.objectfifo.allocate ops on correct neighboring tiles
  Matches original MLIR structure exactly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ocate_on targets

The objectfifo.allocate op requires memory-adjacent tiles. Added explicit
placement to the bn4+5 and bn8+9 fused workers to match the original design:

  bn45_worker: placement=Tile(1,2)  — original: bn4_5_tile = tile(1,2)
    allocate_on(Tile(0,2)) — init tile, adjacent (same row, col 1→0) ✓

  bn89_worker: placement=Tile(3,3)  — original: bn8_9_tile = tile(3,3)
    allocate_on(Tile(3,4)) — bn11_l1 tile, adjacent (same col, row 3→4) ✓

After AIEObjectFifoStatefulTransform, bn89 internal fifo buffers now appear
on tile(3,4) matching the original's memory layout exactly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…output

The bn7→bn8+9 boundary fifo uses shared local memory (not DMA) when tiles
are adjacent. This eliminates the consumer buffer from tile(3,3)'s SRAM.
  bn7_worker: placement=Tile(2,3) — adjacent to bn89_worker at Tile(3,3)

Split-depth output fifo act_bn9_out (bn8+9→pipeline): [1,2] matches the
original's [1,2] depth, allocating only 1 buffer on tile(3,3) instead of 2.

Memory layout for tile(3,3) now matches original:
  stack:          1024 bytes
  buf_7 (wts):  62192 bytes
  act_bn9_out:   1120 bytes (depth=1 producer side)
  Total:        64336 bytes < 65536 (64KB) ✓

Also fix ObjectFifo/ObjectFifoHandle to handle list (split) depths:
  - prod() extracts depth[0] for producer handle
  - cons() extracts depth[-1] for consumer handle
  - Handle.__init__: guard depth<1 with isinstance check

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds Worker placement=Tile() to all 8 pipeline workers to match the
original's tile(n,m) assignments. This ensures adjacent-tile pairs
use shared local memory (not DMA) for inter-worker ObjectFifos,
eliminating consumer buffer allocation on downstream tiles.

Original assignments (tileColIndex=0):
  bn10 L1: Tile(1,5)  bn10 L2: Tile(2,4)  bn10 L3: Tile(2,5)
  bn11 L1: Tile(3,2)  bn11 L2: Tile(3,4)  bn11 L3: Tile(2,2)
  bn12 L1: Tile(3,5)  bn12 L2+L3: Tile(4,4)

Memory after transform (all < 64KB):
  tile(2,5) bn10_L3: ~54784 bytes  tile(2,2) bn11_L3: ~38656 bytes
  tile(3,2) bn11_L1: ~38656 bytes  tile(3,4) bn11_L2:  ~4048 bytes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…treaming

Replaces the stub NotImplementedError with a working implementation that
matches the original placed-dialect pattern:
  buffer(memtile, large_type, name, initial_value=arr) + 4x lock() +
  flow(memtile, DMA, mm2s, compute, DMA, s2mm) + @memtile_dma + @mem

Use in orchestrator for post-processing weights:
  post-L1 (76,800 bytes): MemTile(4,1) → PostL1Tile(6,4), chunk=9,600
  post-L2 FC (819,200 bytes): 4x MemTile → 4x PostL2Tile, chunk=10,240

Also update program.py to:
  1. Resolve PersistentBuffer when found in fn_args (via PersistentBufferHandle)
  2. Include PersistentBuffer tiles in the tile resolution list

Structural comparison with original now matches on:
  aie.core(32), cascade_flow(4), repeat_count(8),
  aie.flow, memtile_dma, aie.mem

Remaining differences:
  - aie.objectfifo: 68 vs 70 (cascade L2 DW as Buffer)
  - aie.buffer: 31 vs 59 (no RTP buffers — replaced by compile-time constants)
  - aie.lock: 20 vs 28 (same reason)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… design

PersistentBuffer now supports two-buffer ping-pong via ping_pong_buf param:
  DMA BD chain: BD1(FC2 on MemTile_N+1) → BD2(FC1 on MemTile_N) → BD1 ...

This matches the original placed-dialect @memtile_dma pattern for PostL2:
  MemTile(1,1): mem_01_buff (FC1_0, 409,600B) + mem_11_buff (FC2_0, 409,600B)
  ping-pong DMA: BD1→BD2→BD1→... one cycle per PostOutputSplitL2=40 iteration

Structural comparison with original now matches on ALL key ops:
  aie.core(32) cascade_flow(4) repeat_count(8)
  aie.flow(5) memtile_dma(10) aie.mem(5) ← ALL IDENTICAL

Also: add ping_pong_memtile to tile resolution in program.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- init_conv2dk3.o: fix kernel filename (was init_conv2dk3_stride2.o)
- post-L1 kernel: use conv2dk1_xy_pool_fused_relu_large_padded_i8_ui8
  with link post_conv2dk1_relu_xy_pool_padded_i8_ui8.o (matches Makefile)
- post-L2 kernel: use post_L2_conv2dk1_relu_i16_ui16_pad
  with link post_L2_conv2dk1_relu_ui16_ui16_pad.o (matches Makefile)
- Output type: uint16 (not uint8) — FC output is 16-bit quantized
- PersistentBuffer: use unplaced locks (lock_id=None → AIEAssignLockIDs)
  eliminates manual lock ID tracking and conflict management

Build status: resource allocation passes, routing passes, Peano compiles
core (0,2) OK. Only remaining issue was missing kernel .o files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The AIEObjectFifoStatefulTransform pass has two modes:
- Without dynamic_objfifo_lowering: calls unrollForLoops() which unrolls
  inner scf.for loops by LCM(objectfifo depths). Combined with Peano's O1
  optimizer further unrolling constant-trip loops, this generates thousands
  of kernel call sites per core (~3360 for the init core), causing ELF text
  section overflow of the AIE tile program memory limit.
- With dynamic_objfifo_lowering=True: uses dynamicGlobalObjectFifos() which
  preserves loops and uses runtime acquire/release. This matches how the
  original placed-dialect design compiles with Peano, producing compact ELFs.

Also revert aiecc.cpp changes (globaldce + unroll-max-count flags) — we use
the same flags as the original CI build: default<O1> + -inline-threshold=10.

Also revert AIECoreToStandard.cpp buffer visibility change (was incorrect).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t.py

The test expects (input, combined_cascade_wts, output) — 3 buffers.
Combined weight layout:
  bn13_L1 (76800) | bn13_L3 (76800) | bn14_L1 (76800) | bn14_L3 (76800)
  = 307200 bytes total

Use TensorAccessPattern with offsets to slice each sub-weight from
the combined buffer via separate DMA BDs, matching the original design's
npu_dma_memcpy_nd calls with offsets into weightsFromL3.

Generated MLIR confirms correct offset-based DMA slicing:
  @bn13_wts_l1_full: offset=0,      size=76800
  @bn13_wts_l3_full: offset=76800,  size=76800
  @bn14_wts_l1_full: offset=153600, size=76800
  @bn14_wts_l3_full: offset=230400, size=76800

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The MemTile cons_lock initialization was wrong:
- For ping-pong buffers: should be init=1 (one buffer ready at a time)
  Original: mem_11_cons_lock = lock(MemTile11, init=1)
- For single-buffer (post-L1): should be init=repeat_count=7
  Original: post_L1_wts_cons_lock = lock(MemTile41, init=PostRepeatChannels=7)

Using init=repeat_count for ping-pong (40) was causing the hardware deadlock
since the lock protocol expects only 1 unit to be released per DMA transfer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sform

The AIEObjectFifoStatefulTransform runs before AIEAssignLockIDs and requires
all lock_id attributes to be explicitly set. Using lock_id=None (auto-assigned)
caused an assertion failure: LockOp::getLockIDValue() requires lock IDs to be set.

This was the root cause of the hardware timeout — the MLIR pipeline failed
silently with incorrect code that deadlocked on hardware.

Lock IDs match the original placed-dialect design:
  post-L1: MemTile(4,1) lock_id=2,3; PostL1Tile(6,4) lock_id=0,1
  FC tiles: MemTile lock_id=0,1; compute tile lock_id=2,3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The original placed-dialect design does NOT use dynamic_objfifo_lowering.
Using True caused the dynamicGlobalObjectFifos() path which:
  1. Does NOT generate blocking locks (non-blocking index-based access)
  2. Prevents loop unrolling by MLIR's unrollForLoops() pass
  3. BUT Peano then fully unrolls constant-trip loops (3360 calls → 77KB ELF)

With False (original behavior):
  1. unrollForLoops() does LCM-based partial unrolling (21 calls → compact ELF)
  2. Locks are generated for proper blocking synchronization
  3. Peano preserves the loop structure → correct code size

Also adds dynamic_objfifo_lowering parameter to Worker for future use.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ndex loop

Move act_out objectfifo acquire/release inside the FC weight loop so
unrollForLoops sees the acquire op and preserves the loop (LCM=1),
keeping ELF sizes compact and avoiding CDO program memory overflow.

Also fix join() element type: use co=8 channels per element (not
fc_out_per_tile=320) to match the original design's
ty_post_Layer2_out_split = memref<8 x uint16>.

Minor cleanup in persistentbuffer.py: rename redundant docstring params,
improve lock init comments for ping-pong vs single-buffer cases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Library refactor (python/iron):
- Remove PersistentBuffer; demonstrate placed-API-style escape hatch lives in
  user code via Resolvable subclass (mobilenet/lowlevel_dma.py)
- Add generic tiles() method on Resolvable so user-side Resolvables can declare
  their tile dependencies; replace PB-specific hooks in program.py + placers.py
- ObjectFifo: add disable_synchronization, repeat_count, delegate_tile, via_DMA
  kwargs; repeat_counts list-kwarg on split()/join()/forward() (replaces
  set_repeat_count method); revert split-depth list (use existing per-handle
  override); generic tiles() pickup for delegate_tile
- CascadeFlow: standalone construct that self-registers on its source Worker
  (no global registry); replaces Runtime.cascade_flow method
- Worker: add dynamic_objfifo_lowering kwarg (default None = compiler chooses)
- aiecc.cpp: remove stale comment about reverted Peano flags

Mobilenet IRON user code (programming_examples/ml/mobilenet):
- New lowlevel_dma.py: StaticWeightStream Resolvable demonstrates inline
  placed-dialect lock/BD/buffer/DMA programming inside a high-level IRON design
- aie2_mobilenet_iron.py + bottleneck/{regular,pipeline,cascade}.py: use new
  ObjectFifo kwargs (delegate_tile, repeat_counts, via_DMA, named fifos);
  CascadeFlow(src, dst) replaces rt.cascade_flow loop; bn0..bn9 workers use
  while_true=False for compact ELFs; depth/cons-tile-ordering matched to
  placed-API original to keep LCM unrolling small enough to fit AIE tile
  program memory
- Makefile: switch to IRON variant; pass --dynamic-objFifos=false to opt out
  of new aiecc default

Build status: hardware xclbin builds successfully (all per-core ELFs <16KB,
fitting in tile program memory). Runtime kernel timed out in initial test —
likely needs further investigation of multi-frame iteration semantics with
while_true=False. Library refactor itself is verified MLIR-equivalent through
the entire reorganization.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
No semantic changes.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
…rker

When Worker(while_true=False), use the IRON range_ helper so an scf.for(0, 1)
wrapper is emitted around the worker body — matching the placed-API pattern.
Python range(1) emits the body inline with no scf.for wrapper, which the
dataflow lowerer treats differently and can cause runtime hangs on data flow
that crosses single-shot worker boundaries.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
@jgmelber jgmelber requested a review from Copilot May 11, 2026 14:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@jackl-xilinx
Copy link
Copy Markdown
Collaborator

@hunhoffe This is a nice refactor and I like how the different common patterns are abstracted so that we could, for example, make a different choice to merge two other bottleneck blocks with the fused wrapper and generate a different design. I'm still digging through the source to check a few more things but so far, I like how it's structured.

I did wonder if the A,B,C now called regular, pipeline, and cascade should stay separated. With the way things are wrapped, it might be possible to merge them into one file so you could, for example, choose a "regular" pattern for a block in the "cascade" section if you wanted to (though you can't really because weights buffer size is too big. I also wonder if the init and L1, L2 blocks should also be in another wrapper python class not in the top design file. It was done this way in the placed design to quickly build the design but it seems having all blocks outisde the top level python wrapper would be more consistent.

So just to double check, the iron version will replace the placed version, but some of the design files like the gen_golden*py will remain.

@hunhoffe
Copy link
Copy Markdown
Collaborator Author

@jackl-xilinx Thanks for taking a look! I'm very open to suggestions. I don't have strong feelings towards the specific organization of the code, although -- as you might be able to tell from the structure of the code -- I was wondering if perhaps we can eventually put some bottleneck patterns in algorithms and share them amongst ML examples as an example of how to reuse mlir-aie patterns. That was some of the intent behind this, but I'm still pretty new to this design so that may/may not be a good goal.

I'll look into your suggestions for:

  • A, B, C separation
  • Wrappers for init and L1, L2 blocks in separate python wrappers

I'm open to keeping both designs in the repo; I thought replacement helped to make the diffs more obvious for the purpose of reviewing. I do worry about the maintenance effort of keeping both versions consistent and functional, so I have a preference towards just having one design long-term. I would be interested to hear your thoughts on this matter.

hunhoffe added a commit that referenced this pull request May 12, 2026
Adds aie.iron.CascadeFlow — a directed cascade-stream edge between two
Workers.  Lowers to aie.cascade_flow(src.tile, dst.tile) after worker
placement.  The kernel functions on each tile drive the cascade via
put_mcd / get_scd intrinsics; CascadeFlow only declares the directed
topology edge.

Worker gains an _outgoing_cascades list, populated by CascadeFlow's
constructor.  Program.resolve walks each worker's outgoing cascades
after worker.resolve() so both tiles are placed first.

Source: #3059 (mobilenet IRON
port).  Cherry-picked the minimum needed to enable cascade matmul:
cascadeflow.py + worker._outgoing_cascades + program cascade resolve
loop + dataflow/iron exports.  Other PR #3059 changes
(dynamic_objfifo_lowering, delegate_tile, Resolvable.tiles) are not
included here.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Without this, lit's test subprocess sees no NPU2=1 (env_setup.sh sets it
in the shell, but lit's config.environment doesn't propagate it), so the
Makefile falls through to devicename=npu and Peano builds kernel .o files
with --target=aie2-none-unknown-elf. The IRON design still lowers for
aie.device(npu2), the xclbin links the wrong-arch .o files, and the NPU
hangs at dispatch (ERT_CMD_STATE_TIMEOUT) on the first block.

run_strix_makefile.lit already follows this pattern.

Verified locally on Strix Halo (NPU2): all 9 sub-targets (bn1..bn8 +
regular/pipeline/cascade chains) PASS bit-exact in 124s.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
@hunhoffe hunhoffe marked this pull request as ready for review May 18, 2026 19:37
@hunhoffe hunhoffe changed the title [WIP] Port MobileNet to the IRON API Port MobileNet to the IRON API May 18, 2026
hunhoffe added 3 commits May 19, 2026 13:45
…lings

The bottleneck block families (regular / pipeline / cascade) already live
in their own modules under bottleneck/; init conv, post-L1 (avgpool +
1x1 expand), and post-L2 (4-tile FC1+FC2) were still inline in the top
file. @jackl-xilinx noted in review that the symmetry would read better
if they joined their siblings. Each new builder mirrors the existing
*_bottlenecks(act_in, sf, *, placement, data_dir) shape: it owns its
fifos, kernels, weights, and workers, and returns the workers list plus
the activation handoff for the next stage.

  bottleneck/init.py    -- init_conv(...)    -> (workers, act_in, act_init_out)
  bottleneck/post_l1.py -- post_l1(...)      -> (workers, act_out_post_avgpool_shim)
  bottleneck/post_l2.py -- post_l2(...)      -> (workers, act_out_of)

aie2_mobilenet_iron.py drops from 590 to 250 lines and reads as
declarations -> block chain -> unchanged runtime sequence -> Program.
The post_l1 -> post_l2 host-scratch bridge fifo stays in the top file
since both endpoints are runtime-sequence concerns and neither builder
is the natural owner.

Pure source-code reorganization. Verified by:
- byte-for-byte MLIR diff (before/after sha256 match,
  a0995693100173508eda150ff9cf187b2d53e9a3f393d525d1d6bca49bd0da76,
  2913 lines)
- test_numpy_per_bn.py: all 8 verified blocks BIT-EXACT
- make run_py on Strix Halo (NPU Strix Halo, NPU2=1): PASS,
  max_difference=9 (matches the existing #3009 envelope used by
  test_mobilenet.py with atol=9)
- run_e2e.sh on Strix Halo for all 9 lit targets (block:bn1, bn2,
  bn3, bn6, bn7, bn8, chain:regular, chain:pipeline, chain:cascade):
  every per-block + chain:pipeline + chain:cascade is 100% bit-exact;
  chain:regular falls within its pre-existing atol=14 envelope.
Two layered cleanups, validated by mobilenet + resnet sharing the same
underlying patterns:

1. bottleneck/_common.py gains sf_key / layer_sf / skip_sf / wts_buffer.
   The first three were duplicated verbatim across regular.py,
   pipeline.py, and cascade.py; the fourth unifies _wts_buffer (regular)
   and _wts_buf (pipeline), which were the same construction under two
   names. cascade's _make_static_wts stays put -- it's a different beast
   (StaticWeightStream, not Buffer).

2. python/iron/algorithms/conv_pipeline.py adds three control-flow
   templates that callers invoke from inside a Worker fn body:

     row_at_a_time(of_in, of_out, *, n_rows, do_kernel, of_wts=None)
     sliding_3row(of_in, of_out, *, n_out_rows, do_kernel, of_wts=None)
     row_at_a_time_with_skip(of_in, of_out, of_skip, *, n_rows,
                             do_kernel, of_wts=None)

   The caller supplies a `do_kernel` closure that handles the
   design-specific kernel signature; the helper owns the acquire/release
   plumbing and the preamble/middle/postamble dance for the 3-row
   sliding window. of_wts=None handles the static-Buffer case
   (mobilenet); of_wts=ObjectFifo handles the per-frame acquire case
   (resnet).

Co-port mobilenet's build_3tile_pipeline (l1_fn / l2_fn / l3_fn, both
skip and no-skip variants) and resnet's conv1_fn / conv2_fn /
conv1_skip_fn so the helper API is shaped by two real consumers in the
same change.

Net diff: -221 lines / +102 lines across 6 files, plus a new ~110-line
helper module.

The MLIR output is not byte-identical with this refactor (it was for
the earlier extract-init/L1/L2 commit); two harmless reorderings appear:
  - the skip-variant release order swaps to in/out/skip (helper order)
    from mobilenet's original skip/in/out
  - resnet's worker fns now load their RTP scale before acquiring the
    weight fifo, where the original did weight-acquire then RTP-load
Both are independent ops; the orderings are semantically commutative.

Verified on Strix Halo (NPU Strix Halo, NPU2=1):
  - Mobilenet full e2e (make run_py): PASS, max_difference=9
    (matches the existing #3009 envelope used by test_mobilenet.py
    with atol=9; identical to pre-refactor result)
  - Resnet full e2e (make + make run_py): PASS, 4107us NPU time
…lpers

Validates the row_at_a_time / sliding_3row / row_at_a_time_with_skip
helpers (introduced in 0ff172e) across a broader set of consumers,
and adds row_at_a_time_tiled for designs that split each output row
into spatial tiles. After this change the helpers cover:

  - mobilenet 3-tile pipeline (l1/l2/l3 ± skip)        [phase 2]
  - resnet conv2_x (conv1/conv2/conv1_skip)            [phase 2]
  - ml/bottleneck single-block bottleneck (all three)
  - ml/conv2d                                          (row_at_a_time)
  - ml/conv2d_fused_relu                               (row_at_a_time)
  - ml/conv2d_14x14                                    (row_at_a_time_tiled)
  - vision/edge_detect                                 (sliding_3row)

Two consumer modes are exercised: of_wts=ObjectFifo (the four conv2d-
family + bottleneck + resnet) and of_wts=None (mobilenet pipeline,
edge_detect — weights live in a static Buffer captured by the closure).

The edge_detect port doubles as a small cleanup: its filter_fn used to
write a manual `for _ in range_(sys.maxsize):` outer loop with
`Worker(while_true=False)` to stream. The other workers in that file
already use the default `while_true=True`; this normalizes filter_fn
to match and drops the redundant outer loop. Net result is one fewer
scf.for level in the emitted MLIR; the iteration count is unchanged.

MLIR for three of the five ports (conv2d, conv2d_fused_relu,
conv2d_14x14) is byte-identical pre/post-refactor. ml/bottleneck and
edge_detect have small expected reorderings (RTP-load order, loop-
nesting flatten) — same kinds of harmless commutativity changes phase
2 already accepted.

Verified end-to-end on Strix Halo (NPU2):
  - ml/bottleneck:        PASS, 2323us NPU time
  - ml/conv2d:            PASS, 805us NPU time
  - ml/conv2d_fused_relu: PASS, 955us NPU time
  - ml/conv2d_14x14:      PASS, 18785us NPU time, max_diff=1
  - vision/edge_detect:   builds + MLIR regenerates; NPU run not
    attempted (design targets NPU1 colors + needs opencv host deps
    that aren't on this machine)

Phase-2 mobilenet + resnet MLIR shas unchanged (regression-free for
the unmodified callers of the existing helpers).
@hunhoffe
Copy link
Copy Markdown
Collaborator Author

Hi @jackl-xilinx — it took me a bit to explore your feedback so far.

On wrapping init / L1 / L2: ✅ done in 11f7169 — they're now bottleneck/init.py / post_l1.py / post_l2.py mirroring the existing sibling shape. Top file drops 590 → 250 lines and MLIR is unchanged, so this was a simple refactor.

On "should A / B / C stay separated?": I was approaching the desire for cross-pattern reuse in a different way. Commits 0ff172e and 22d1c4d extract common algorithms — 1x1 row driver, 3x3 sliding window with replicate borders, skip-add row driver — into python/iron/algorithms/conv_pipeline.py as four small composable helpers. I kept the per-file structure because regular/pipeline/cascade encode real hardware-strategy decisions (1-tile vs 3-tile vs 5-tile cascade) and flattening them doesn't seem to reduce code complexity much. By focusing on the algorithm itself, I was able to extract helpers applicable to mobilenet, resnet's conv2_x, ml/bottleneck, the conv2d family, and vision/edge_detect.

Let me know if any of this feels like the wrong direction; happy to revisit either the extraction shape or the per-file factoring.

@hunhoffe hunhoffe added this pull request to the merge queue May 30, 2026
Merged via the queue into main with commit ef2f78f May 30, 2026
71 checks passed
@hunhoffe hunhoffe deleted the mobilenet-py branch May 30, 2026 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants