Port MobileNet to the IRON API#3059
Conversation
- Add CascadeFlow class to python/iron/dataflow for cascade stream connections - Add PersistentBuffer stub to python/iron/dataflow for static weight DMA - Add Runtime.cascade_flow() method for registering cascade connections - Resolve CascadeFlow ops in Program.resolve_program() after worker resolution - Export CascadeFlow and PersistentBuffer from python/iron top-level - Account for PersistentBufferHandle DMA channels in SequentialPlacer - Add test_cascade_flow.py hardware test for new cascade API - Add bottleneck/ directory with IRON rewrites of bn0-bn14 blocks: - regular.py: bn0-bn9 (single tile per block) - pipeline.py: bn10-bn12 (one tile per layer pipeline) - cascade.py: bn13-bn14 (5-tile cascade stream blocks) Scale factors are compile-time Python int constants in Worker fn_args; no RTP buffers or NpuWriteRTPOp calls needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- ObjectFifo.set_repeat_count(): store pre-resolve, emit during resolve() as attribute on the MLIR op; distinct from set_iter_count() - ObjectFifoHandle.set_repeat_count(): delegates to underlying ObjectFifo - cascade.py: replace per-tile weight ObjectFifos with full-weight fifos that split at MemTile using cons().split() — matches object_fifo_link with offsets pattern from original placed dialect code - cascade.py: replace element-wise skip copy workers with forward() (MemTile DMA forwarding, no extra compute tile needed) - cascade.py: set_repeat_count(InH=7) on cascade weight sub-fifos - pipeline.py: fix weight filenames to match bn10_1_chain.txt convention - pipeline.py: fix Kernel .o paths (remove erroneous data_dir prefix) - pipeline.py: replace element-wise bn11 skip copy with forward() - aie2_mobilenet_iron.py: initial orchestrator (cascades wts_fifos) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- pipeline.py: replace self-loop ObjectFifo (bn12 DW->PW) with tile-local Buffer; both are tile-SRAM but Buffer avoids lock/BD overhead for same-tile sequential ops where no sync is needed - pipeline.py: remove extra bn11 skip copy Worker; use forward() instead - cascade.py: fix wts_fifos return to use full-weight fifos (host fills these, MemTile splits to put/get via cons().split()) - aie2_mobilenet_iron.py: add cascade weight DMA in runtime sequence; fix act_in depth to 5 (allow 3-row window); fix cascade 4-tuple unpack; fix bn0 scale factor count (no sfAdd for 2-layer block) - objectfifo.py: add set_repeat_count() with pre-resolve storage pattern, matching how set_iter_count() works; apply during resolve() NOTE: regular.py has structural issue - ObjectFifos declared inside worker function bodies; needs to be moved outside and passed as fn_args per IRON API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
IRON framework (python/helpers/dialects/func.py): - Auto-cast index-typed args to expected IntegerType in func.call() so loop induction vars from range_() work without manual IndexCastOp regular.py: - Move internal ObjectFifos outside worker function bodies; pass both .prod() and .cons() to the same Worker (IRON self-loop ObjectFifos) - Fix all Kernel .o paths (remove os.path.join(data_dir, ...) prefix) pipeline.py: - Fix bn12 DW/PW weight buffers: split combined buffer into separate bn12_dw_wts and bn12_pw_wts (matching separate chain files) - Fix bn11 skip: use forward() instead of copy Worker - Fix weight filenames to *_chain.txt convention cascade.py: - Fix L3 split weight size: 480*80=38400 (not 480*40=19200) - Remove copy Workers; use forward() for skip connections - Use cons().split() + set_repeat_count() for cascade weight fifos aie2_mobilenet_iron.py: - Fix runtime sequence to fill cascade weight fifos - Fix act_in depth and post_l1 worker pattern - Fix FC weight buffer size in Kernel signature Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each worker function in regular.py now uses memref_view(wts_buf.op, ...) to extract per-layer weight sub-views from the combined weight Buffer. This matches the original placed-dialect code's memref_view pattern and produces correct kernel argument types. All blocks fixed: bn0, bn1, bn2, bn3, bn4+5, bn6, bn7, bn8+9. The IRON design now generates valid MLIR (2438 lines). Comparison with original: - cascade_flow ops: identical (4 ops, same tile pairs) - aie.core count: 32 (matches) - aie.objectfifo count: 68 IRON vs 70 original (2 fewer) - aie.tile count: 40 IRON vs 48 original (8 fewer shim/memtile) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch MLIR_FILE from aie2_mobilenet to aie2_mobilenet_iron. The IRON design generates valid MLIR (2438 lines, verified). Key stats vs original: - cascade_flow ops: identical (4 ops, same tile pairs) - aie.core count: 32 (matches) - Code: ~3500 lines vs ~7000 original (2x reduction) Framework changes also included: - python/helpers/dialects/func.py: auto index->int cast in func.call - python/iron/dataflow/objectfifo.py: set_repeat_count() pre-resolve storage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bn13 and bn14 L3 weight fifos (put/get) now have repeat_count=7, matching the original design's set_repeat_count(RepeatChannels=7) calls. Structural comparison with original: aie.core: 32 = 32 ✓ cascade_flow: 4 = 4 ✓ repeat_count: 8 = 8 ✓ runtime_sequence: 1 = 1 ✓ aie.objectfifo: 68 ≈ 70 (2 fewer: cascade L2 DW as Buffer not fifo) aie.buffer: 26 < 59 (missing: PostL2 FC MemTile buffers, PersistentBuffer TODO) aie.tile: 40 < 48 (IRON declares tiles on-demand, not all upfront) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without sep=',', np.fromfile reads the text file as raw binary bytes, producing an array of ASCII character codes (135983 bytes) instead of the actual int8 weights (38400 elements). This caused aiecc to fail with 'elements hex data size is invalid' when parsing the generated MLIR. All other bottleneck modules (regular.py, cascade.py) already had sep=',' set correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The original design places bn14 L2 DW at tile(6,2) which is the same row as L3 GET at tile(5,2), allowing shared-memory (streaming) access without consuming a DMA channel. Tile(7,4) was incorrect from the plan document and caused 'number of input DMA channel exceeded' in aiecc. Also fix pipeline.py _load_weights: add sep=',' to np.fromfile so comma- separated weight chain files are parsed correctly (not as raw binary). Without sep=',' the ASCII text was being read as raw bytes, producing wrong weight data and invalid hex encoding in the MLIR dense attribute. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The original design places bn14_tile_layer2 at tile(6,2), adjacent to the bn14_l3_get tile at tile(5,2). This adjacency allows aiecc to use shared local memory instead of DMA for bn14_of_l2_l3_second, avoiding the S2MM DMA channel overflow on tile(5,2). Also corrects the comment in the docstring (plan had Tile(7,4) which was based on an incorrect reading of the original code). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Distribute cascade weight and skip ObjectFifos across dedicated MemTiles matching the original design, preventing BD exhaustion on any single channel. Original MemTile assignments: bn13 L1 weights → MemTile(0,1) bn13 L3 weights → MemTile(1,1) bn14 L1 weights → MemTile(2,1) bn14 L3 weights → MemTile(3,1) bn13 skip → MemTile(5,1) bn14 skip → MemTile(7,1) Before: mem_tile_0_1 had 8 fifos (exceeded 4 MM2S/S2MM DMA channels). After: max 5 fifos per memtile (within NPU2 resource limits). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ObjectFifo.__init__: accept disable_synchronization=True parameter,
passed through to the underlying aie.objectfifo MLIR op during resolve().
ObjectFifo.allocate_on(tile): redirect buffer allocation to a neighboring
tile using aie.objectfifo.allocate. Can be called before resolve() (tile
stored in _alloc_tile_ops and applied during f.resolve()) or after.
These two together allow fused bottleneck pairs (bn4+5, bn8+9) to keep
their internal self-loop ObjectFifos on the compute tile logically but
store their buffer memory on an adjacent tile's SRAM — avoiding the 64KB
tile memory overflow caused by 62KB weights + ~23KB internal fifo buffers.
This mirrors the original placed-dialect pattern:
object_fifo("name", tile, {tile}, depth) {disable_synchronization=true}
fifo.allocate(L1_tile)
Also wires disable_synchronization into objectfifo resolve() and applies
pending _alloc_tile_ops list during resolve().
Next steps:
- Call allocate_on() in regular.py for bn45 and bn89 internal fifos,
targeting adjacent tiles (Tile(0,2) for bn45, Tile(3,4) for bn89)
- This matches original: bn4+5 allocates on init tile, bn8+9 on pipeline tile
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…airs
Apply disable_synchronization=True + allocate_on(adjacent_tile) to the
internal self-loop ObjectFifos in the bn4+5 and bn8+9 fused bottleneck
workers. This exactly matches the original placed-dialect design:
aie.objectfifo @bn8_bn9_act_bn8_1_2(%tile, {%tile}, 3) {disable_synchronization=true}
aie.objectfifo.allocate @bn8_bn9_act_bn8_1_2(%adj_tile)
Original allocation targets (from original MLIR analysis):
bn4+5 internal fifos → Tile(0,2) (init tile, same as original)
bn8+9 internal fifos → Tile(3,4) (pipeline bn11_l1 tile, same as original)
This redirects the ~23KB of intermediate activation fifo buffers off the
compute tile's 64KB SRAM, allowing the 62KB weight buffer to fit.
Also fix program.py to include _alloc_tile_ops in the tile resolution list
so hardcoded Tile(col,row) objects are resolved before fifo resolution.
Generated MLIR now contains:
- 10x {disable_synchronization = true} attributes
- 10x aie.objectfifo.allocate ops on correct neighboring tiles
Matches original MLIR structure exactly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ocate_on targets
The objectfifo.allocate op requires memory-adjacent tiles. Added explicit
placement to the bn4+5 and bn8+9 fused workers to match the original design:
bn45_worker: placement=Tile(1,2) — original: bn4_5_tile = tile(1,2)
allocate_on(Tile(0,2)) — init tile, adjacent (same row, col 1→0) ✓
bn89_worker: placement=Tile(3,3) — original: bn8_9_tile = tile(3,3)
allocate_on(Tile(3,4)) — bn11_l1 tile, adjacent (same col, row 3→4) ✓
After AIEObjectFifoStatefulTransform, bn89 internal fifo buffers now appear
on tile(3,4) matching the original's memory layout exactly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…output The bn7→bn8+9 boundary fifo uses shared local memory (not DMA) when tiles are adjacent. This eliminates the consumer buffer from tile(3,3)'s SRAM. bn7_worker: placement=Tile(2,3) — adjacent to bn89_worker at Tile(3,3) Split-depth output fifo act_bn9_out (bn8+9→pipeline): [1,2] matches the original's [1,2] depth, allocating only 1 buffer on tile(3,3) instead of 2. Memory layout for tile(3,3) now matches original: stack: 1024 bytes buf_7 (wts): 62192 bytes act_bn9_out: 1120 bytes (depth=1 producer side) Total: 64336 bytes < 65536 (64KB) ✓ Also fix ObjectFifo/ObjectFifoHandle to handle list (split) depths: - prod() extracts depth[0] for producer handle - cons() extracts depth[-1] for consumer handle - Handle.__init__: guard depth<1 with isinstance check Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds Worker placement=Tile() to all 8 pipeline workers to match the original's tile(n,m) assignments. This ensures adjacent-tile pairs use shared local memory (not DMA) for inter-worker ObjectFifos, eliminating consumer buffer allocation on downstream tiles. Original assignments (tileColIndex=0): bn10 L1: Tile(1,5) bn10 L2: Tile(2,4) bn10 L3: Tile(2,5) bn11 L1: Tile(3,2) bn11 L2: Tile(3,4) bn11 L3: Tile(2,2) bn12 L1: Tile(3,5) bn12 L2+L3: Tile(4,4) Memory after transform (all < 64KB): tile(2,5) bn10_L3: ~54784 bytes tile(2,2) bn11_L3: ~38656 bytes tile(3,2) bn11_L1: ~38656 bytes tile(3,4) bn11_L2: ~4048 bytes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…treaming Replaces the stub NotImplementedError with a working implementation that matches the original placed-dialect pattern: buffer(memtile, large_type, name, initial_value=arr) + 4x lock() + flow(memtile, DMA, mm2s, compute, DMA, s2mm) + @memtile_dma + @mem Use in orchestrator for post-processing weights: post-L1 (76,800 bytes): MemTile(4,1) → PostL1Tile(6,4), chunk=9,600 post-L2 FC (819,200 bytes): 4x MemTile → 4x PostL2Tile, chunk=10,240 Also update program.py to: 1. Resolve PersistentBuffer when found in fn_args (via PersistentBufferHandle) 2. Include PersistentBuffer tiles in the tile resolution list Structural comparison with original now matches on: aie.core(32), cascade_flow(4), repeat_count(8), aie.flow, memtile_dma, aie.mem Remaining differences: - aie.objectfifo: 68 vs 70 (cascade L2 DW as Buffer) - aie.buffer: 31 vs 59 (no RTP buffers — replaced by compile-time constants) - aie.lock: 20 vs 28 (same reason) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… design PersistentBuffer now supports two-buffer ping-pong via ping_pong_buf param: DMA BD chain: BD1(FC2 on MemTile_N+1) → BD2(FC1 on MemTile_N) → BD1 ... This matches the original placed-dialect @memtile_dma pattern for PostL2: MemTile(1,1): mem_01_buff (FC1_0, 409,600B) + mem_11_buff (FC2_0, 409,600B) ping-pong DMA: BD1→BD2→BD1→... one cycle per PostOutputSplitL2=40 iteration Structural comparison with original now matches on ALL key ops: aie.core(32) cascade_flow(4) repeat_count(8) aie.flow(5) memtile_dma(10) aie.mem(5) ← ALL IDENTICAL Also: add ping_pong_memtile to tile resolution in program.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- init_conv2dk3.o: fix kernel filename (was init_conv2dk3_stride2.o) - post-L1 kernel: use conv2dk1_xy_pool_fused_relu_large_padded_i8_ui8 with link post_conv2dk1_relu_xy_pool_padded_i8_ui8.o (matches Makefile) - post-L2 kernel: use post_L2_conv2dk1_relu_i16_ui16_pad with link post_L2_conv2dk1_relu_ui16_ui16_pad.o (matches Makefile) - Output type: uint16 (not uint8) — FC output is 16-bit quantized - PersistentBuffer: use unplaced locks (lock_id=None → AIEAssignLockIDs) eliminates manual lock ID tracking and conflict management Build status: resource allocation passes, routing passes, Peano compiles core (0,2) OK. Only remaining issue was missing kernel .o files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The AIEObjectFifoStatefulTransform pass has two modes: - Without dynamic_objfifo_lowering: calls unrollForLoops() which unrolls inner scf.for loops by LCM(objectfifo depths). Combined with Peano's O1 optimizer further unrolling constant-trip loops, this generates thousands of kernel call sites per core (~3360 for the init core), causing ELF text section overflow of the AIE tile program memory limit. - With dynamic_objfifo_lowering=True: uses dynamicGlobalObjectFifos() which preserves loops and uses runtime acquire/release. This matches how the original placed-dialect design compiles with Peano, producing compact ELFs. Also revert aiecc.cpp changes (globaldce + unroll-max-count flags) — we use the same flags as the original CI build: default<O1> + -inline-threshold=10. Also revert AIECoreToStandard.cpp buffer visibility change (was incorrect). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…t.py The test expects (input, combined_cascade_wts, output) — 3 buffers. Combined weight layout: bn13_L1 (76800) | bn13_L3 (76800) | bn14_L1 (76800) | bn14_L3 (76800) = 307200 bytes total Use TensorAccessPattern with offsets to slice each sub-weight from the combined buffer via separate DMA BDs, matching the original design's npu_dma_memcpy_nd calls with offsets into weightsFromL3. Generated MLIR confirms correct offset-based DMA slicing: @bn13_wts_l1_full: offset=0, size=76800 @bn13_wts_l3_full: offset=76800, size=76800 @bn14_wts_l1_full: offset=153600, size=76800 @bn14_wts_l3_full: offset=230400, size=76800 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The MemTile cons_lock initialization was wrong: - For ping-pong buffers: should be init=1 (one buffer ready at a time) Original: mem_11_cons_lock = lock(MemTile11, init=1) - For single-buffer (post-L1): should be init=repeat_count=7 Original: post_L1_wts_cons_lock = lock(MemTile41, init=PostRepeatChannels=7) Using init=repeat_count for ping-pong (40) was causing the hardware deadlock since the lock protocol expects only 1 unit to be released per DMA transfer. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sform The AIEObjectFifoStatefulTransform runs before AIEAssignLockIDs and requires all lock_id attributes to be explicitly set. Using lock_id=None (auto-assigned) caused an assertion failure: LockOp::getLockIDValue() requires lock IDs to be set. This was the root cause of the hardware timeout — the MLIR pipeline failed silently with incorrect code that deadlocked on hardware. Lock IDs match the original placed-dialect design: post-L1: MemTile(4,1) lock_id=2,3; PostL1Tile(6,4) lock_id=0,1 FC tiles: MemTile lock_id=0,1; compute tile lock_id=2,3 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The original placed-dialect design does NOT use dynamic_objfifo_lowering. Using True caused the dynamicGlobalObjectFifos() path which: 1. Does NOT generate blocking locks (non-blocking index-based access) 2. Prevents loop unrolling by MLIR's unrollForLoops() pass 3. BUT Peano then fully unrolls constant-trip loops (3360 calls → 77KB ELF) With False (original behavior): 1. unrollForLoops() does LCM-based partial unrolling (21 calls → compact ELF) 2. Locks are generated for proper blocking synchronization 3. Peano preserves the loop structure → correct code size Also adds dynamic_objfifo_lowering parameter to Worker for future use. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ndex loop Move act_out objectfifo acquire/release inside the FC weight loop so unrollForLoops sees the acquire op and preserves the loop (LCM=1), keeping ELF sizes compact and avoiding CDO program memory overflow. Also fix join() element type: use co=8 channels per element (not fc_out_per_tile=320) to match the original design's ty_post_Layer2_out_split = memref<8 x uint16>. Minor cleanup in persistentbuffer.py: rename redundant docstring params, improve lock init comments for ping-pong vs single-buffer cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Library refactor (python/iron):
- Remove PersistentBuffer; demonstrate placed-API-style escape hatch lives in
user code via Resolvable subclass (mobilenet/lowlevel_dma.py)
- Add generic tiles() method on Resolvable so user-side Resolvables can declare
their tile dependencies; replace PB-specific hooks in program.py + placers.py
- ObjectFifo: add disable_synchronization, repeat_count, delegate_tile, via_DMA
kwargs; repeat_counts list-kwarg on split()/join()/forward() (replaces
set_repeat_count method); revert split-depth list (use existing per-handle
override); generic tiles() pickup for delegate_tile
- CascadeFlow: standalone construct that self-registers on its source Worker
(no global registry); replaces Runtime.cascade_flow method
- Worker: add dynamic_objfifo_lowering kwarg (default None = compiler chooses)
- aiecc.cpp: remove stale comment about reverted Peano flags
Mobilenet IRON user code (programming_examples/ml/mobilenet):
- New lowlevel_dma.py: StaticWeightStream Resolvable demonstrates inline
placed-dialect lock/BD/buffer/DMA programming inside a high-level IRON design
- aie2_mobilenet_iron.py + bottleneck/{regular,pipeline,cascade}.py: use new
ObjectFifo kwargs (delegate_tile, repeat_counts, via_DMA, named fifos);
CascadeFlow(src, dst) replaces rt.cascade_flow loop; bn0..bn9 workers use
while_true=False for compact ELFs; depth/cons-tile-ordering matched to
placed-API original to keep LCM unrolling small enough to fit AIE tile
program memory
- Makefile: switch to IRON variant; pass --dynamic-objFifos=false to opt out
of new aiecc default
Build status: hardware xclbin builds successfully (all per-core ELFs <16KB,
fitting in tile program memory). Runtime kernel timed out in initial test —
likely needs further investigation of multi-frame iteration semantics with
while_true=False. Library refactor itself is verified MLIR-equivalent through
the entire reorganization.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
No semantic changes. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
…rker When Worker(while_true=False), use the IRON range_ helper so an scf.for(0, 1) wrapper is emitted around the worker body — matching the placed-API pattern. Python range(1) emits the body inline with no scf.for wrapper, which the dataflow lowerer treats differently and can cause runtime hangs on data flow that crosses single-shot worker boundaries. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
|
@hunhoffe This is a nice refactor and I like how the different common patterns are abstracted so that we could, for example, make a different choice to merge two other bottleneck blocks with the fused wrapper and generate a different design. I'm still digging through the source to check a few more things but so far, I like how it's structured. I did wonder if the A,B,C now called regular, pipeline, and cascade should stay separated. With the way things are wrapped, it might be possible to merge them into one file so you could, for example, choose a "regular" pattern for a block in the "cascade" section if you wanted to (though you can't really because weights buffer size is too big. I also wonder if the init and L1, L2 blocks should also be in another wrapper python class not in the top design file. It was done this way in the placed design to quickly build the design but it seems having all blocks outisde the top level python wrapper would be more consistent. So just to double check, the iron version will replace the placed version, but some of the design files like the gen_golden*py will remain. |
|
@jackl-xilinx Thanks for taking a look! I'm very open to suggestions. I don't have strong feelings towards the specific organization of the code, although -- as you might be able to tell from the structure of the code -- I was wondering if perhaps we can eventually put some bottleneck patterns in algorithms and share them amongst ML examples as an example of how to reuse mlir-aie patterns. That was some of the intent behind this, but I'm still pretty new to this design so that may/may not be a good goal. I'll look into your suggestions for:
I'm open to keeping both designs in the repo; I thought replacement helped to make the diffs more obvious for the purpose of reviewing. I do worry about the maintenance effort of keeping both versions consistent and functional, so I have a preference towards just having one design long-term. I would be interested to hear your thoughts on this matter. |
…our design" This reverts commit 9bd00f3.
Adds aie.iron.CascadeFlow — a directed cascade-stream edge between two Workers. Lowers to aie.cascade_flow(src.tile, dst.tile) after worker placement. The kernel functions on each tile drive the cascade via put_mcd / get_scd intrinsics; CascadeFlow only declares the directed topology edge. Worker gains an _outgoing_cascades list, populated by CascadeFlow's constructor. Program.resolve walks each worker's outgoing cascades after worker.resolve() so both tiles are placed first. Source: #3059 (mobilenet IRON port). Cherry-picked the minimum needed to enable cascade matmul: cascadeflow.py + worker._outgoing_cascades + program cascade resolve loop + dataflow/iron exports. Other PR #3059 changes (dynamic_objfifo_lowering, delegate_tile, Resolvable.tiles) are not included here. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Without this, lit's test subprocess sees no NPU2=1 (env_setup.sh sets it in the shell, but lit's config.environment doesn't propagate it), so the Makefile falls through to devicename=npu and Peano builds kernel .o files with --target=aie2-none-unknown-elf. The IRON design still lowers for aie.device(npu2), the xclbin links the wrong-arch .o files, and the NPU hangs at dispatch (ERT_CMD_STATE_TIMEOUT) on the first block. run_strix_makefile.lit already follows this pattern. Verified locally on Strix Halo (NPU2): all 9 sub-targets (bn1..bn8 + regular/pipeline/cascade chains) PASS bit-exact in 124s. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
…lings The bottleneck block families (regular / pipeline / cascade) already live in their own modules under bottleneck/; init conv, post-L1 (avgpool + 1x1 expand), and post-L2 (4-tile FC1+FC2) were still inline in the top file. @jackl-xilinx noted in review that the symmetry would read better if they joined their siblings. Each new builder mirrors the existing *_bottlenecks(act_in, sf, *, placement, data_dir) shape: it owns its fifos, kernels, weights, and workers, and returns the workers list plus the activation handoff for the next stage. bottleneck/init.py -- init_conv(...) -> (workers, act_in, act_init_out) bottleneck/post_l1.py -- post_l1(...) -> (workers, act_out_post_avgpool_shim) bottleneck/post_l2.py -- post_l2(...) -> (workers, act_out_of) aie2_mobilenet_iron.py drops from 590 to 250 lines and reads as declarations -> block chain -> unchanged runtime sequence -> Program. The post_l1 -> post_l2 host-scratch bridge fifo stays in the top file since both endpoints are runtime-sequence concerns and neither builder is the natural owner. Pure source-code reorganization. Verified by: - byte-for-byte MLIR diff (before/after sha256 match, a0995693100173508eda150ff9cf187b2d53e9a3f393d525d1d6bca49bd0da76, 2913 lines) - test_numpy_per_bn.py: all 8 verified blocks BIT-EXACT - make run_py on Strix Halo (NPU Strix Halo, NPU2=1): PASS, max_difference=9 (matches the existing #3009 envelope used by test_mobilenet.py with atol=9) - run_e2e.sh on Strix Halo for all 9 lit targets (block:bn1, bn2, bn3, bn6, bn7, bn8, chain:regular, chain:pipeline, chain:cascade): every per-block + chain:pipeline + chain:cascade is 100% bit-exact; chain:regular falls within its pre-existing atol=14 envelope.
Two layered cleanups, validated by mobilenet + resnet sharing the same
underlying patterns:
1. bottleneck/_common.py gains sf_key / layer_sf / skip_sf / wts_buffer.
The first three were duplicated verbatim across regular.py,
pipeline.py, and cascade.py; the fourth unifies _wts_buffer (regular)
and _wts_buf (pipeline), which were the same construction under two
names. cascade's _make_static_wts stays put -- it's a different beast
(StaticWeightStream, not Buffer).
2. python/iron/algorithms/conv_pipeline.py adds three control-flow
templates that callers invoke from inside a Worker fn body:
row_at_a_time(of_in, of_out, *, n_rows, do_kernel, of_wts=None)
sliding_3row(of_in, of_out, *, n_out_rows, do_kernel, of_wts=None)
row_at_a_time_with_skip(of_in, of_out, of_skip, *, n_rows,
do_kernel, of_wts=None)
The caller supplies a `do_kernel` closure that handles the
design-specific kernel signature; the helper owns the acquire/release
plumbing and the preamble/middle/postamble dance for the 3-row
sliding window. of_wts=None handles the static-Buffer case
(mobilenet); of_wts=ObjectFifo handles the per-frame acquire case
(resnet).
Co-port mobilenet's build_3tile_pipeline (l1_fn / l2_fn / l3_fn, both
skip and no-skip variants) and resnet's conv1_fn / conv2_fn /
conv1_skip_fn so the helper API is shaped by two real consumers in the
same change.
Net diff: -221 lines / +102 lines across 6 files, plus a new ~110-line
helper module.
The MLIR output is not byte-identical with this refactor (it was for
the earlier extract-init/L1/L2 commit); two harmless reorderings appear:
- the skip-variant release order swaps to in/out/skip (helper order)
from mobilenet's original skip/in/out
- resnet's worker fns now load their RTP scale before acquiring the
weight fifo, where the original did weight-acquire then RTP-load
Both are independent ops; the orderings are semantically commutative.
Verified on Strix Halo (NPU Strix Halo, NPU2=1):
- Mobilenet full e2e (make run_py): PASS, max_difference=9
(matches the existing #3009 envelope used by test_mobilenet.py
with atol=9; identical to pre-refactor result)
- Resnet full e2e (make + make run_py): PASS, 4107us NPU time
…lpers Validates the row_at_a_time / sliding_3row / row_at_a_time_with_skip helpers (introduced in 0ff172e) across a broader set of consumers, and adds row_at_a_time_tiled for designs that split each output row into spatial tiles. After this change the helpers cover: - mobilenet 3-tile pipeline (l1/l2/l3 ± skip) [phase 2] - resnet conv2_x (conv1/conv2/conv1_skip) [phase 2] - ml/bottleneck single-block bottleneck (all three) - ml/conv2d (row_at_a_time) - ml/conv2d_fused_relu (row_at_a_time) - ml/conv2d_14x14 (row_at_a_time_tiled) - vision/edge_detect (sliding_3row) Two consumer modes are exercised: of_wts=ObjectFifo (the four conv2d- family + bottleneck + resnet) and of_wts=None (mobilenet pipeline, edge_detect — weights live in a static Buffer captured by the closure). The edge_detect port doubles as a small cleanup: its filter_fn used to write a manual `for _ in range_(sys.maxsize):` outer loop with `Worker(while_true=False)` to stream. The other workers in that file already use the default `while_true=True`; this normalizes filter_fn to match and drops the redundant outer loop. Net result is one fewer scf.for level in the emitted MLIR; the iteration count is unchanged. MLIR for three of the five ports (conv2d, conv2d_fused_relu, conv2d_14x14) is byte-identical pre/post-refactor. ml/bottleneck and edge_detect have small expected reorderings (RTP-load order, loop- nesting flatten) — same kinds of harmless commutativity changes phase 2 already accepted. Verified end-to-end on Strix Halo (NPU2): - ml/bottleneck: PASS, 2323us NPU time - ml/conv2d: PASS, 805us NPU time - ml/conv2d_fused_relu: PASS, 955us NPU time - ml/conv2d_14x14: PASS, 18785us NPU time, max_diff=1 - vision/edge_detect: builds + MLIR regenerates; NPU run not attempted (design targets NPU1 colors + needs opencv host deps that aren't on this machine) Phase-2 mobilenet + resnet MLIR shas unchanged (regression-free for the unmodified callers of the existing helpers).
|
Hi @jackl-xilinx — it took me a bit to explore your feedback so far. On wrapping init / L1 / L2: ✅ done in 11f7169 — they're now bottleneck/init.py / post_l1.py / post_l2.py mirroring the existing sibling shape. Top file drops 590 → 250 lines and MLIR is unchanged, so this was a simple refactor. On "should A / B / C stay separated?": I was approaching the desire for cross-pattern reuse in a different way. Commits 0ff172e and 22d1c4d extract common algorithms — 1x1 row driver, 3x3 sliding window with replicate borders, skip-add row driver — into python/iron/algorithms/conv_pipeline.py as four small composable helpers. I kept the per-file structure because regular/pipeline/cascade encode real hardware-strategy decisions (1-tile vs 3-tile vs 5-tile cascade) and flattening them doesn't seem to reduce code complexity much. By focusing on the algorithm itself, I was able to extract helpers applicable to mobilenet, resnet's conv2_x, ml/bottleneck, the conv2d family, and vision/edge_detect. Let me know if any of this feels like the wrong direction; happy to revisit either the extraction shape or the per-file factoring. |
# Conflicts: # programming_examples/ml/bottleneck/bottleneck.py
Adds an IRON implementation of MobileNet (
aie2_mobilenet_iron.py) that produces a functionally-equivalent xclbin for Strix. The original placed-API design (aie2_mobilenet.py) did the hard work of validating the network topology, kernel selection, and tile placement on Strix; this PR ports those proven decisions into a more declarative form using IRON.Summary
bottleneck/{regular,cascade,pipeline}.py) driven by a singleNETWORKspec; scale factors carried as a dict.StaticWeightStream(lowlevel_dma.py): MemTile→compute weight DMA with optional ping-pong, for layers where ObjectFifo's semantics weren't a fit.CascadeFlowin IRON: first-classCascadeFlow(src, dst)that lowers toaie.cascade_flowafter placement.disable_synchronization,allocate_on.mobilenet_numpy.pybit-exact reference, per-block IRON e2e + numpy lit suites.Test plan
make build/final_mobilenet.xclbin+test_mobilenet.pypass on Strix againstdata/aie_output.txt.run_e2e.litandrun_numpy_per_bn.litgreen.test/python/npu-xrt/test_cascade_flow.pypasses.Note for reviewers
AIE output has a known max-delta=9 vs the brevitas golden (mlir-aie #3009);
test_mobilenet.pyfollows the existing convention ofatol=9againstdata/golden_output.txt.