use_lock_race_condition_fix should be enabled by default for multi-tile MemTile designs

## Problem

The `air-to-aie` pass generates a **shared lock pair** for MemTile gather patterns when `use_lock_race_condition_fix=false` (the current default). When multiple S2MM channels write to the same L2 buffer at different offsets, this shared lock causes **hardware deadlock**.

The fix exists (`--use-lock-race-condition-fix` / `use_lock_race_condition_fix=True` in XRTRunner), but it is off by default. Any multi-tile design using L2 staging with gathered outputs through MemTile will deadlock without it.

## Root Cause

In `AIRToAIESchedulingUtils.cpp`, `DMAAllocator::getLockForDMA()` (lines 675–684): when `lockRaceConditionFix=false`, the code matches locks by **buffer identity only**, ignoring which DMA channel or offset is involved. Both S2MM channels get the same lock pair, and the MM2S BD acquires the shared lock with `acquire >= N`. This deadlocks on hardware.

With `lockRaceConditionFix=true`, the code uses per-channel lock matching (lines 685–736), generating **separate lock pairs per S2MM channel** and separate MM2S BDs — which is the correct pattern matching ObjectFIFO link/join semantics.

## Evidence

Standalone hardware test (NPU2/Strix):
- Single tile → MemTile → Shim: **PASS** (both lock schemes)
- Two tiles → MemTile gather → Shim, shared lock: **DEADLOCK** (timeout)
- Two tiles → MemTile gather → Shim, separate locks: **PASS**

## Suggestion

Enable `use_lock_race_condition_fix` by default. However, there are potential concerns:

1. **Ping-pong interaction**: The fix inserts extra dummy DMA BDs. With ping-pong buffering enabled, this may exceed available BD slots or MemTile memory on some designs. The bf16 GEMV example currently uses `omit_pingpong=True` to avoid L1/L2 memory pressure with large K dimensions — it's unclear whether the lock fix + ping-pong would cause issues on other designs.

2. **Existing test failures**: The fix may change lock/BD allocation patterns enough to break designs that were tuned for the current (incorrect) behavior. The existing test suite should be run with the fix enabled to identify any regressions.

3. **BD count overhead**: The fix generates separate MM2S BDs per S2MM source channel instead of one combined BD. For designs with many gathered channels, this could exhaust MemTile BD slots.

## Reproducer

```bash
cd programming_examples/matrix_vector_multiplication/bf16
# Deadlocks without fix:
make run M=256 K=64 TILE_M_L2=64 M_INPUT=16 HERD_M=2
# Passes with fix (use_lock_race_condition_fix=True in XRTRunner):
# See PR #1477
```

## References

- PR #1477: enables the fix for the bf16 GEMV example
- `mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp:675-684` — buggy code path
- `mlir/lib/Conversion/AIRToAIESchedulingUtils.cpp:685-736` — fix code path
- `mlir-aie/test/objectFifo-stateful-transform/repeat_count/link_join_repeat_count_test.mlir` — known-working separate-lock gather pattern

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use_lock_race_condition_fix should be enabled by default for multi-tile MemTile designs #1478

Problem

Root Cause

Evidence

Suggestion

Reproducer

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

use_lock_race_condition_fix should be enabled by default for multi-tile MemTile designs #1478

Description

Problem

Root Cause

Evidence

Suggestion

Reproducer

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions