Skip to content

use_lock_race_condition_fix should be enabled by default for multi-tile MemTile designs #1478

@erwei-xilinx

Description

@erwei-xilinx

Problem

The air-to-aie pass generates a shared lock pair for MemTile gather patterns when use_lock_race_condition_fix=false (the current default). When multiple S2MM channels write to the same L2 buffer at different offsets, this shared lock causes hardware deadlock.

The fix exists (--use-lock-race-condition-fix / use_lock_race_condition_fix=True in XRTRunner), but it is off by default. Any multi-tile design using L2 staging with gathered outputs through MemTile will deadlock without it.

Root Cause

In AIRToAIESchedulingUtils.cpp, DMAAllocator::getLockForDMA() (lines 675–684): when lockRaceConditionFix=false, the code matches locks by buffer identity only, ignoring which DMA channel or offset is involved. Both S2MM channels get the same lock pair, and the MM2S BD acquires the shared lock with acquire >= N. This deadlocks on hardware.

With lockRaceConditionFix=true, the code uses per-channel lock matching (lines 685–736), generating separate lock pairs per S2MM channel and separate MM2S BDs — which is the correct pattern matching ObjectFIFO link/join semantics.

Evidence

Standalone hardware test (NPU2/Strix):

  • Single tile → MemTile → Shim: PASS (both lock schemes)
  • Two tiles → MemTile gather → Shim, shared lock: DEADLOCK (timeout)
  • Two tiles → MemTile gather → Shim, separate locks: PASS

Suggestion

Enable use_lock_race_condition_fix by default. However, there are potential concerns:

  1. Ping-pong interaction: The fix inserts extra dummy DMA BDs. With ping-pong buffering enabled, this may exceed available BD slots or MemTile memory on some designs. The bf16 GEMV example currently uses omit_pingpong=True to avoid L1/L2 memory pressure with large K dimensions — it's unclear whether the lock fix + ping-pong would cause issues on other designs.

  2. Existing test failures: The fix may change lock/BD allocation patterns enough to break designs that were tuned for the current (incorrect) behavior. The existing test suite should be run with the fix enabled to identify any regressions.

  3. BD count overhead: The fix generates separate MM2S BDs per S2MM source channel instead of one combined BD. For designs with many gathered channels, this could exhaust MemTile BD slots.

Reproducer

cd programming_examples/matrix_vector_multiplication/bf16
# Deadlocks without fix:
make run M=256 K=64 TILE_M_L2=64 M_INPUT=16 HERD_M=2
# Passes with fix (use_lock_race_condition_fix=True in XRTRunner):
# See PR #1477

References

  • PR Scale bf16 GEMV example to multi-column L2 design #1477: enables the fix for the bf16 GEMV example
  • mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp:675-684 — buggy code path
  • mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp:685-736 — fix code path
  • mlir-aie/test/objectFifo-stateful-transform/repeat_count/link_join_repeat_count_test.mlir — known-working separate-lock gather pattern

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions