Skip to content

[WIP] Fix objectfifo declared depth as minimum#3096

Draft
hunhoffe wants to merge 6 commits into
Xilinx:mainfrom
hunhoffe:fix-objectfifo-declared-depth-as-minimum
Draft

[WIP] Fix objectfifo declared depth as minimum#3096
hunhoffe wants to merge 6 commits into
Xilinx:mainfrom
hunhoffe:fix-objectfifo-declared-depth-as-minimum

Conversation

@hunhoffe
Copy link
Copy Markdown
Collaborator

No description provided.

hunhoffe and others added 4 commits May 18, 2026 18:53
printObjectFifoInitValues unconditionally cast $elemNumber to IntegerAttr,
which is null when $elemNumber is the ArrayAttr form (per-endpoint depths).
The subsequent .getInt() then segfaults during str() of any objectfifo op
that has both list-shaped depth AND initValues. The parser already handles
both shapes; the printer now mirrors it.

Add a roundtrip lit test.

Reduced repro (Python):

    object_fifo("test", memtile, compute, [2, 1],
                np.ndarray[(16,), np.dtype[np.int32]],
                initValues=[np.zeros(16, np.int32), np.ones(16, np.int32)])
    # str(ctx.module) -> Segmentation fault (was)

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Plumb init_values through IRON ObjectFifo to the underlying
aie.dialects.aie.object_fifo's `initValues` attribute, so designs that
want statically-initialized buffers (e.g. weight tensors held on a
MemTile, no shim load) can express it at the IRON level rather than
dropping to the low-level op.

Producer endpoint pinning is the user's responsibility — when no Worker
or rt.fill targets the producer, set it explicitly:

    of = ObjectFifo(ty, depth=2, init_values=[arr0, arr1])
    of.prod().endpoint = ObjectFifoEndpoint(AnyMemTile)

(The op verifier rejects init_values on shim producers.)

Add a unit test that builds an IRON ObjectFifo with init_values, pins
the producer to AnyMemTile, and FileChecks the lowered MLIR contains
the expected `= [dense<0>..., dense<1>...]` initial values.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
…tic-init producer

When an aie.objectfifo has init_values on a MemTile producer with no
upstream link and iter_count > 1, the DMA channel's task_count
(= iter_count - 1) restarts the BD chain after each pass. The
per-BD use_lock acquires/releases consume the source-side lock state
during the first pass, and nothing replenishes it (there is no
upstream S2MM filling the buffers) — the chain deadlocks on the
second pass.

For this configuration the locks are not needed for correctness: the
static buffers are always available, and back-pressure to the
downstream consumer is handled by the DMA stream's flow control.
Detect the (MM2S + initValues + iter_count > 1 + no link) case in
createBdBlock and omit the lock ops on the source side.

Add a lit roundtrip test in bd_chain_on_memtile/ that checks the
memtile_dma BD chain has no use_lock ops in this configuration.

Verified end-to-end on NPU Strix: prior to this change, a design that
cycles a static-init MemTile buffer N times (via iter_count) returns
ert_cmd_state.ERT_CMD_STATE_TIMEOUT; with the fix the same design
runs to completion and the host receives the cycled data.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
`_get_depths()` collapsed all-equal per-handle depths to a single int.
The stateful-transform's lowering interprets a single-int elemNumber as
"producer depth only" and auto-sizes each consumer-side ping-pong from
max-acquire (via findObjectFifoSize), silently dropping below the user's
declared depth. For multi-consumer fanout with uneven acquire patterns
— one consumer needs to buffer ahead while peers wait on upstream data
— this back-pressures the producer DMA and deadlocks at runtime.

Fix: delete _get_depths and inline the per-handle list
[prod_depth, *cons_depths] at the resolve() call site. Uniform handling
for 1-cons and N-cons; lowering always goes through the ArrayAttr
branch that honors each declared depth directly.

Regression: test/python/objFifo_iron_multi_cons_depth.py — multi-cons
fanout with all depths equal must emit [4 : i32, 4 : i32, 4 : i32], not
a single 4 : i32.
@hunhoffe hunhoffe force-pushed the fix-objectfifo-declared-depth-as-minimum branch from 6151669 to fc6772e Compare May 19, 2026 01:05
hunhoffe added 2 commits May 21, 2026 14:39
Catches the PR up to upstream and reconciles two parallel ObjectFifo
init_values implementations:

  - main (PR Xilinx#3077) added an early-return in createObjectFifoLocks that
    skips lock allocation for static-init no-link producers cycled via
    iter_count, plus a stricter test asserting no memtile locks are
    emitted at all.
  - this branch added an isCycledStaticInitProducer guard in
    createUseLocks that skips emitting use_lock ops in the BD chain
    under the same conditions.

The auto-merge took both paths, which is consistent: the early-return
makes locksPerFifo[op] empty, so the createUseLocks guard becomes a
no-op in the common case. Both safety nets are retained for now.

The conflicting test
test/objectFifo-stateful-transform/bd_chain_on_memtile/
init_values_no_link_skips_source_locks.mlir is resolved by adopting
main's stricter version (its additional CHECK-NOT lines on memtile
lock allocation are satisfied by the merged transform).
…pong

With the depth-as-minimum contract from this branch, the stateful
transform now honors the declared depth instead of silently
auto-minimizing equal-depth ObjectFifos to max-acquire. The two vision
designs were declaring forward(depth=7) / forward(depth=6) to mirror
the upstream memtile depth set by inOF_L3L2.cons(N), but the L1
consumer only ever acquires 1 element at a time, so depth=2 (standard
ping-pong) is the correct L1 buffering. The memtile-side allocation is
unaffected because the .link makes those buffers shared with the
upstream fifo.

Without this change the new contract allocates 6-7 cons_buff_* on a
single core tile on Strix and aiecc's basic-sequential allocator runs
out of memory.
hunhoffe added a commit to hunhoffe/mlir-aie that referenced this pull request May 24, 2026
… collapsing equal depths

Cherry-picks the python-side change from upstream PR Xilinx#3096 ("Fix
objectfifo declared depth as minimum"). IRON's ObjectFifo previously
collapsed `[prod_depth, *cons_depths]` to a single int when all values
were equal, which the `aie-objectFifo-stateful-transform` lowering
interpreted as "producer depth only" and silently auto-minimized each
consumer's ping-pong from max-acquire -- dropping below the user's
declared depth.

For multi-consumer fanout with uneven acquire patterns (one consumer
must buffer ahead of a peer that's waiting on upstream data), the
auto-minimize sized every pool to 2 (ping-pong) regardless of what
.cons(depth=N) requested, deadlocking at runtime.

Concrete symptom hit in this branch: yolo26n m6 (c3k2_heavy) bot_fifo
fanout to m_0_split AND cv3_cv2. ObjectFifo declared depth=6,
.cons(depth=6) on both consumers, but lowered IR allocated 2 buffers
per pool and m6 standalone TIMEOUT'd. After the fix the producer pool
gets 6 buffers, each consumer pool gets 6 buffers, and m6 runs bit-
exact at 66.9 fps. m2 (c3k2_small, same fanout pattern, also depth=6
+ .cons(depth=6)x2) was getting prod=6/cons=4 pre-fix (some other
auto-sizing path); post-fix it gets the full 6 everywhere and m2 perf
goes 62.4 -> 69.7 fps as a side effect.

Test included upstream: `test/python/objFifo_iron_multi_cons_depth.py`.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
hunhoffe added a commit to hunhoffe/mlir-aie that referenced this pull request May 24, 2026
…e comment + README distance-to-60 refresh

Audit pass on the m4 (e7d0d89) + m6 (51c068e) commits and the README
chain refresh. Two issues found, both comment-only:

- kernels/yolo_c3k2_small_m0_cv1_vec.cc line 282: comment said "For m2:
  2 tail tiles (13, 14)" but those are m4's tail-tile indices (kXTiles=16,
  kXTailStart=13). m2's are tiles 29, 30 (kXTiles=32, kXTailStart=29).
  Rewrote to cover both blocks explicitly.

- README "Distance to 60 fps target" paragraph still said chain N=15 was
  36.3 ms and "re-measurement pending". With the m2 + m4 + m6 fixes,
  the partial m0..m6 chain is now 14.23 ms / 70.26 fps at N=15 -- the
  target is crossed for everything we've measured. Refreshed the
  framing to call out the remaining gap as m8 / m9 / infrastructure
  (the only chain-non-overlapped tiles per the prior NOOP_BLOCK
  attribution).

Verification done:
- m6 lowered IR (build/aie_m6.mlir.prj/input_with_addresses.mlir):
  every fifo's buffer-pool size now matches its declared depth post-
  IRON-fix. of2 (bot_fifo, depth=6, 2 cons): 6/6/6; of0 (act_in, d=5):
  5; of1 (top_fifo, d=4): 4/4; of4 (split_b, d=in_h=32): 32; of5
  (inner_0_out, d=4): 4/4; of6 (inner_1_out, d=2): 2; of10
  (f_cv3_cv2, d=2): 2. No latent depth mismatches in the c3k2_heavy
  graph.
- M4_{CV1,CV2,M0CV1,M0CV2_SKIP}_SHAPE macros eyeballed against
  yolo_spec.py m4 Block shapes: all four match.
- m4 + m6 standalone bit-exact vs ORT in the most recent clean-build
  sweep (mismatches=0/* both).
- IRON fix (db8c0fc) matches the Python portion of upstream PR Xilinx#3096
  exactly (the C++ portion of that PR is for a separate static-init-
  producer-locks bug; not cherry-picked here).

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant