You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The air-to-aie pass generates a shared lock pair for MemTile gather patterns when use_lock_race_condition_fix=false (the current default). When multiple S2MM channels write to the same L2 buffer at different offsets, this shared lock causes hardware deadlock.
The fix exists (--use-lock-race-condition-fix / use_lock_race_condition_fix=True in XRTRunner), but it is off by default. Any multi-tile design using L2 staging with gathered outputs through MemTile will deadlock without it.
Root Cause
In AIRToAIESchedulingUtils.cpp, DMAAllocator::getLockForDMA() (lines 675–684): when lockRaceConditionFix=false, the code matches locks by buffer identity only, ignoring which DMA channel or offset is involved. Both S2MM channels get the same lock pair, and the MM2S BD acquires the shared lock with acquire >= N. This deadlocks on hardware.
With lockRaceConditionFix=true, the code uses per-channel lock matching (lines 685–736), generating separate lock pairs per S2MM channel and separate MM2S BDs — which is the correct pattern matching ObjectFIFO link/join semantics.
Evidence
Standalone hardware test (NPU2/Strix):
Single tile → MemTile → Shim: PASS (both lock schemes)
Two tiles → MemTile gather → Shim, separate locks: PASS
Suggestion
Enable use_lock_race_condition_fix by default. However, there are potential concerns:
Ping-pong interaction: The fix inserts extra dummy DMA BDs. With ping-pong buffering enabled, this may exceed available BD slots or MemTile memory on some designs. The bf16 GEMV example currently uses omit_pingpong=True to avoid L1/L2 memory pressure with large K dimensions — it's unclear whether the lock fix + ping-pong would cause issues on other designs.
Existing test failures: The fix may change lock/BD allocation patterns enough to break designs that were tuned for the current (incorrect) behavior. The existing test suite should be run with the fix enabled to identify any regressions.
BD count overhead: The fix generates separate MM2S BDs per S2MM source channel instead of one combined BD. For designs with many gathered channels, this could exhaust MemTile BD slots.
Reproducer
cd programming_examples/matrix_vector_multiplication/bf16
# Deadlocks without fix:
make run M=256 K=64 TILE_M_L2=64 M_INPUT=16 HERD_M=2
# Passes with fix (use_lock_race_condition_fix=True in XRTRunner):# See PR #1477
Problem
The
air-to-aiepass generates a shared lock pair for MemTile gather patterns whenuse_lock_race_condition_fix=false(the current default). When multiple S2MM channels write to the same L2 buffer at different offsets, this shared lock causes hardware deadlock.The fix exists (
--use-lock-race-condition-fix/use_lock_race_condition_fix=Truein XRTRunner), but it is off by default. Any multi-tile design using L2 staging with gathered outputs through MemTile will deadlock without it.Root Cause
In
AIRToAIESchedulingUtils.cpp,DMAAllocator::getLockForDMA()(lines 675–684): whenlockRaceConditionFix=false, the code matches locks by buffer identity only, ignoring which DMA channel or offset is involved. Both S2MM channels get the same lock pair, and the MM2S BD acquires the shared lock withacquire >= N. This deadlocks on hardware.With
lockRaceConditionFix=true, the code uses per-channel lock matching (lines 685–736), generating separate lock pairs per S2MM channel and separate MM2S BDs — which is the correct pattern matching ObjectFIFO link/join semantics.Evidence
Standalone hardware test (NPU2/Strix):
Suggestion
Enable
use_lock_race_condition_fixby default. However, there are potential concerns:Ping-pong interaction: The fix inserts extra dummy DMA BDs. With ping-pong buffering enabled, this may exceed available BD slots or MemTile memory on some designs. The bf16 GEMV example currently uses
omit_pingpong=Trueto avoid L1/L2 memory pressure with large K dimensions — it's unclear whether the lock fix + ping-pong would cause issues on other designs.Existing test failures: The fix may change lock/BD allocation patterns enough to break designs that were tuned for the current (incorrect) behavior. The existing test suite should be run with the fix enabled to identify any regressions.
BD count overhead: The fix generates separate MM2S BDs per S2MM source channel instead of one combined BD. For designs with many gathered channels, this could exhaust MemTile BD slots.
Reproducer
References
mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp:675-684— buggy code pathmlir/lib/Conversion/AIRToAIESchedulingUtils.cpp:685-736— fix code pathmlir-aie/test/objectFifo-stateful-transform/repeat_count/link_join_repeat_count_test.mlir— known-working separate-lock gather pattern