[WIP] Fix objectfifo declared depth as minimum#3096
Draft
hunhoffe wants to merge 6 commits into
Draft
Conversation
printObjectFifoInitValues unconditionally cast $elemNumber to IntegerAttr,
which is null when $elemNumber is the ArrayAttr form (per-endpoint depths).
The subsequent .getInt() then segfaults during str() of any objectfifo op
that has both list-shaped depth AND initValues. The parser already handles
both shapes; the printer now mirrors it.
Add a roundtrip lit test.
Reduced repro (Python):
object_fifo("test", memtile, compute, [2, 1],
np.ndarray[(16,), np.dtype[np.int32]],
initValues=[np.zeros(16, np.int32), np.ones(16, np.int32)])
# str(ctx.module) -> Segmentation fault (was)
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Plumb init_values through IRON ObjectFifo to the underlying
aie.dialects.aie.object_fifo's `initValues` attribute, so designs that
want statically-initialized buffers (e.g. weight tensors held on a
MemTile, no shim load) can express it at the IRON level rather than
dropping to the low-level op.
Producer endpoint pinning is the user's responsibility — when no Worker
or rt.fill targets the producer, set it explicitly:
of = ObjectFifo(ty, depth=2, init_values=[arr0, arr1])
of.prod().endpoint = ObjectFifoEndpoint(AnyMemTile)
(The op verifier rejects init_values on shim producers.)
Add a unit test that builds an IRON ObjectFifo with init_values, pins
the producer to AnyMemTile, and FileChecks the lowered MLIR contains
the expected `= [dense<0>..., dense<1>...]` initial values.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
…tic-init producer When an aie.objectfifo has init_values on a MemTile producer with no upstream link and iter_count > 1, the DMA channel's task_count (= iter_count - 1) restarts the BD chain after each pass. The per-BD use_lock acquires/releases consume the source-side lock state during the first pass, and nothing replenishes it (there is no upstream S2MM filling the buffers) — the chain deadlocks on the second pass. For this configuration the locks are not needed for correctness: the static buffers are always available, and back-pressure to the downstream consumer is handled by the DMA stream's flow control. Detect the (MM2S + initValues + iter_count > 1 + no link) case in createBdBlock and omit the lock ops on the source side. Add a lit roundtrip test in bd_chain_on_memtile/ that checks the memtile_dma BD chain has no use_lock ops in this configuration. Verified end-to-end on NPU Strix: prior to this change, a design that cycles a static-init MemTile buffer N times (via iter_count) returns ert_cmd_state.ERT_CMD_STATE_TIMEOUT; with the fix the same design runs to completion and the host receives the cycled data. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
`_get_depths()` collapsed all-equal per-handle depths to a single int. The stateful-transform's lowering interprets a single-int elemNumber as "producer depth only" and auto-sizes each consumer-side ping-pong from max-acquire (via findObjectFifoSize), silently dropping below the user's declared depth. For multi-consumer fanout with uneven acquire patterns — one consumer needs to buffer ahead while peers wait on upstream data — this back-pressures the producer DMA and deadlocks at runtime. Fix: delete _get_depths and inline the per-handle list [prod_depth, *cons_depths] at the resolve() call site. Uniform handling for 1-cons and N-cons; lowering always goes through the ArrayAttr branch that honors each declared depth directly. Regression: test/python/objFifo_iron_multi_cons_depth.py — multi-cons fanout with all depths equal must emit [4 : i32, 4 : i32, 4 : i32], not a single 4 : i32.
6151669 to
fc6772e
Compare
Catches the PR up to upstream and reconciles two parallel ObjectFifo init_values implementations: - main (PR Xilinx#3077) added an early-return in createObjectFifoLocks that skips lock allocation for static-init no-link producers cycled via iter_count, plus a stricter test asserting no memtile locks are emitted at all. - this branch added an isCycledStaticInitProducer guard in createUseLocks that skips emitting use_lock ops in the BD chain under the same conditions. The auto-merge took both paths, which is consistent: the early-return makes locksPerFifo[op] empty, so the createUseLocks guard becomes a no-op in the common case. Both safety nets are retained for now. The conflicting test test/objectFifo-stateful-transform/bd_chain_on_memtile/ init_values_no_link_skips_source_locks.mlir is resolved by adopting main's stricter version (its additional CHECK-NOT lines on memtile lock allocation are satisfied by the merged transform).
…pong With the depth-as-minimum contract from this branch, the stateful transform now honors the declared depth instead of silently auto-minimizing equal-depth ObjectFifos to max-acquire. The two vision designs were declaring forward(depth=7) / forward(depth=6) to mirror the upstream memtile depth set by inOF_L3L2.cons(N), but the L1 consumer only ever acquires 1 element at a time, so depth=2 (standard ping-pong) is the correct L1 buffering. The memtile-side allocation is unaffected because the .link makes those buffers shared with the upstream fifo. Without this change the new contract allocates 6-7 cons_buff_* on a single core tile on Strix and aiecc's basic-sequential allocator runs out of memory.
hunhoffe
added a commit
to hunhoffe/mlir-aie
that referenced
this pull request
May 24, 2026
… collapsing equal depths Cherry-picks the python-side change from upstream PR Xilinx#3096 ("Fix objectfifo declared depth as minimum"). IRON's ObjectFifo previously collapsed `[prod_depth, *cons_depths]` to a single int when all values were equal, which the `aie-objectFifo-stateful-transform` lowering interpreted as "producer depth only" and silently auto-minimized each consumer's ping-pong from max-acquire -- dropping below the user's declared depth. For multi-consumer fanout with uneven acquire patterns (one consumer must buffer ahead of a peer that's waiting on upstream data), the auto-minimize sized every pool to 2 (ping-pong) regardless of what .cons(depth=N) requested, deadlocking at runtime. Concrete symptom hit in this branch: yolo26n m6 (c3k2_heavy) bot_fifo fanout to m_0_split AND cv3_cv2. ObjectFifo declared depth=6, .cons(depth=6) on both consumers, but lowered IR allocated 2 buffers per pool and m6 standalone TIMEOUT'd. After the fix the producer pool gets 6 buffers, each consumer pool gets 6 buffers, and m6 runs bit- exact at 66.9 fps. m2 (c3k2_small, same fanout pattern, also depth=6 + .cons(depth=6)x2) was getting prod=6/cons=4 pre-fix (some other auto-sizing path); post-fix it gets the full 6 everywhere and m2 perf goes 62.4 -> 69.7 fps as a side effect. Test included upstream: `test/python/objFifo_iron_multi_cons_depth.py`. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
hunhoffe
added a commit
to hunhoffe/mlir-aie
that referenced
this pull request
May 24, 2026
…e comment + README distance-to-60 refresh Audit pass on the m4 (e7d0d89) + m6 (51c068e) commits and the README chain refresh. Two issues found, both comment-only: - kernels/yolo_c3k2_small_m0_cv1_vec.cc line 282: comment said "For m2: 2 tail tiles (13, 14)" but those are m4's tail-tile indices (kXTiles=16, kXTailStart=13). m2's are tiles 29, 30 (kXTiles=32, kXTailStart=29). Rewrote to cover both blocks explicitly. - README "Distance to 60 fps target" paragraph still said chain N=15 was 36.3 ms and "re-measurement pending". With the m2 + m4 + m6 fixes, the partial m0..m6 chain is now 14.23 ms / 70.26 fps at N=15 -- the target is crossed for everything we've measured. Refreshed the framing to call out the remaining gap as m8 / m9 / infrastructure (the only chain-non-overlapped tiles per the prior NOOP_BLOCK attribution). Verification done: - m6 lowered IR (build/aie_m6.mlir.prj/input_with_addresses.mlir): every fifo's buffer-pool size now matches its declared depth post- IRON-fix. of2 (bot_fifo, depth=6, 2 cons): 6/6/6; of0 (act_in, d=5): 5; of1 (top_fifo, d=4): 4/4; of4 (split_b, d=in_h=32): 32; of5 (inner_0_out, d=4): 4/4; of6 (inner_1_out, d=2): 2; of10 (f_cv3_cv2, d=2): 2. No latent depth mismatches in the c3k2_heavy graph. - M4_{CV1,CV2,M0CV1,M0CV2_SKIP}_SHAPE macros eyeballed against yolo_spec.py m4 Block shapes: all four match. - m4 + m6 standalone bit-exact vs ORT in the most recent clean-build sweep (mismatches=0/* both). - IRON fix (db8c0fc) matches the Python portion of upstream PR Xilinx#3096 exactly (the C++ portion of that PR is for a separate static-init- producer-locks bug; not cherry-picked here). Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.