[WIP] Fix objectfifo declared depth as minimum by hunhoffe · Pull Request #3096 · Xilinx/mlir-aie

hunhoffe · 2026-05-18T23:16:34Z

No description provided.

printObjectFifoInitValues unconditionally cast $elemNumber to IntegerAttr, which is null when $elemNumber is the ArrayAttr form (per-endpoint depths). The subsequent .getInt() then segfaults during str() of any objectfifo op that has both list-shaped depth AND initValues. The parser already handles both shapes; the printer now mirrors it. Add a roundtrip lit test. Reduced repro (Python): object_fifo("test", memtile, compute, [2, 1], np.ndarray[(16,), np.dtype[np.int32]], initValues=[np.zeros(16, np.int32), np.ones(16, np.int32)]) # str(ctx.module) -> Segmentation fault (was) Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

Plumb init_values through IRON ObjectFifo to the underlying aie.dialects.aie.object_fifo's `initValues` attribute, so designs that want statically-initialized buffers (e.g. weight tensors held on a MemTile, no shim load) can express it at the IRON level rather than dropping to the low-level op. Producer endpoint pinning is the user's responsibility — when no Worker or rt.fill targets the producer, set it explicitly: of = ObjectFifo(ty, depth=2, init_values=[arr0, arr1]) of.prod().endpoint = ObjectFifoEndpoint(AnyMemTile) (The op verifier rejects init_values on shim producers.) Add a unit test that builds an IRON ObjectFifo with init_values, pins the producer to AnyMemTile, and FileChecks the lowered MLIR contains the expected `= [dense<0>..., dense<1>...]` initial values. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

…tic-init producer When an aie.objectfifo has init_values on a MemTile producer with no upstream link and iter_count > 1, the DMA channel's task_count (= iter_count - 1) restarts the BD chain after each pass. The per-BD use_lock acquires/releases consume the source-side lock state during the first pass, and nothing replenishes it (there is no upstream S2MM filling the buffers) — the chain deadlocks on the second pass. For this configuration the locks are not needed for correctness: the static buffers are always available, and back-pressure to the downstream consumer is handled by the DMA stream's flow control. Detect the (MM2S + initValues + iter_count > 1 + no link) case in createBdBlock and omit the lock ops on the source side. Add a lit roundtrip test in bd_chain_on_memtile/ that checks the memtile_dma BD chain has no use_lock ops in this configuration. Verified end-to-end on NPU Strix: prior to this change, a design that cycles a static-init MemTile buffer N times (via iter_count) returns ert_cmd_state.ERT_CMD_STATE_TIMEOUT; with the fix the same design runs to completion and the host receives the cycled data. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

`_get_depths()` collapsed all-equal per-handle depths to a single int. The stateful-transform's lowering interprets a single-int elemNumber as "producer depth only" and auto-sizes each consumer-side ping-pong from max-acquire (via findObjectFifoSize), silently dropping below the user's declared depth. For multi-consumer fanout with uneven acquire patterns — one consumer needs to buffer ahead while peers wait on upstream data — this back-pressures the producer DMA and deadlocks at runtime. Fix: delete _get_depths and inline the per-handle list [prod_depth, *cons_depths] at the resolve() call site. Uniform handling for 1-cons and N-cons; lowering always goes through the ArrayAttr branch that honors each declared depth directly. Regression: test/python/objFifo_iron_multi_cons_depth.py — multi-cons fanout with all depths equal must emit [4 : i32, 4 : i32, 4 : i32], not a single 4 : i32.

Catches the PR up to upstream and reconciles two parallel ObjectFifo init_values implementations: - main (PR Xilinx#3077) added an early-return in createObjectFifoLocks that skips lock allocation for static-init no-link producers cycled via iter_count, plus a stricter test asserting no memtile locks are emitted at all. - this branch added an isCycledStaticInitProducer guard in createUseLocks that skips emitting use_lock ops in the BD chain under the same conditions. The auto-merge took both paths, which is consistent: the early-return makes locksPerFifo[op] empty, so the createUseLocks guard becomes a no-op in the common case. Both safety nets are retained for now. The conflicting test test/objectFifo-stateful-transform/bd_chain_on_memtile/ init_values_no_link_skips_source_locks.mlir is resolved by adopting main's stricter version (its additional CHECK-NOT lines on memtile lock allocation are satisfied by the merged transform).

…pong With the depth-as-minimum contract from this branch, the stateful transform now honors the declared depth instead of silently auto-minimizing equal-depth ObjectFifos to max-acquire. The two vision designs were declaring forward(depth=7) / forward(depth=6) to mirror the upstream memtile depth set by inOF_L3L2.cons(N), but the L1 consumer only ever acquires 1 element at a time, so depth=2 (standard ping-pong) is the correct L1 buffering. The memtile-side allocation is unaffected because the .link makes those buffers shared with the upstream fifo. Without this change the new contract allocates 6-7 cons_buff_* on a single core tile on Strix and aiecc's basic-sequential allocator runs out of memory.

… collapsing equal depths Cherry-picks the python-side change from upstream PR Xilinx#3096 ("Fix objectfifo declared depth as minimum"). IRON's ObjectFifo previously collapsed `[prod_depth, *cons_depths]` to a single int when all values were equal, which the `aie-objectFifo-stateful-transform` lowering interpreted as "producer depth only" and silently auto-minimized each consumer's ping-pong from max-acquire -- dropping below the user's declared depth. For multi-consumer fanout with uneven acquire patterns (one consumer must buffer ahead of a peer that's waiting on upstream data), the auto-minimize sized every pool to 2 (ping-pong) regardless of what .cons(depth=N) requested, deadlocking at runtime. Concrete symptom hit in this branch: yolo26n m6 (c3k2_heavy) bot_fifo fanout to m_0_split AND cv3_cv2. ObjectFifo declared depth=6, .cons(depth=6) on both consumers, but lowered IR allocated 2 buffers per pool and m6 standalone TIMEOUT'd. After the fix the producer pool gets 6 buffers, each consumer pool gets 6 buffers, and m6 runs bit- exact at 66.9 fps. m2 (c3k2_small, same fanout pattern, also depth=6 + .cons(depth=6)x2) was getting prod=6/cons=4 pre-fix (some other auto-sizing path); post-fix it gets the full 6 everywhere and m2 perf goes 62.4 -> 69.7 fps as a side effect. Test included upstream: `test/python/objFifo_iron_multi_cons_depth.py`. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

…e comment + README distance-to-60 refresh Audit pass on the m4 (e7d0d89) + m6 (51c068e) commits and the README chain refresh. Two issues found, both comment-only: - kernels/yolo_c3k2_small_m0_cv1_vec.cc line 282: comment said "For m2: 2 tail tiles (13, 14)" but those are m4's tail-tile indices (kXTiles=16, kXTailStart=13). m2's are tiles 29, 30 (kXTiles=32, kXTailStart=29). Rewrote to cover both blocks explicitly. - README "Distance to 60 fps target" paragraph still said chain N=15 was 36.3 ms and "re-measurement pending". With the m2 + m4 + m6 fixes, the partial m0..m6 chain is now 14.23 ms / 70.26 fps at N=15 -- the target is crossed for everything we've measured. Refreshed the framing to call out the remaining gap as m8 / m9 / infrastructure (the only chain-non-overlapped tiles per the prior NOOP_BLOCK attribution). Verification done: - m6 lowered IR (build/aie_m6.mlir.prj/input_with_addresses.mlir): every fifo's buffer-pool size now matches its declared depth post- IRON-fix. of2 (bot_fifo, depth=6, 2 cons): 6/6/6; of0 (act_in, d=5): 5; of1 (top_fifo, d=4): 4/4; of4 (split_b, d=in_h=32): 32; of5 (inner_0_out, d=4): 4/4; of6 (inner_1_out, d=2): 2; of10 (f_cv3_cv2, d=2): 2. No latent depth mismatches in the c3k2_heavy graph. - M4_{CV1,CV2,M0CV1,M0CV2_SKIP}_SHAPE macros eyeballed against yolo_spec.py m4 Block shapes: all four match. - m4 + m6 standalone bit-exact vs ORT in the most recent clean-build sweep (mismatches=0/* both). - IRON fix (db8c0fc) matches the Python portion of upstream PR Xilinx#3096 exactly (the C++ portion of that PR is for a separate static-init- producer-locks bug; not cherry-picked here). Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

hunhoffe and others added 4 commits May 18, 2026 18:53

hunhoffe force-pushed the fix-objectfifo-declared-depth-as-minimum branch from 6151669 to fc6772e Compare May 19, 2026 01:05

hunhoffe added 2 commits May 21, 2026 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix objectfifo declared depth as minimum#3096

[WIP] Fix objectfifo declared depth as minimum#3096
hunhoffe wants to merge 6 commits into
Xilinx:mainfrom
hunhoffe:fix-objectfifo-declared-depth-as-minimum

hunhoffe commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hunhoffe commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant