Skip to content

test/npu-xrt/memtile_dmas/* tests TDR on Phoenix (NPU1) — do these actually pass in CI? #3062

@FIM43-Redeye

Description

@FIM43-Redeye

All five tests under test/npu-xrt/memtile_dmas/ reproducibly TDR on Phoenix
(NPU1) hardware running the latest mlir-aie HEAD. The same tests pass under
the in-emulator run path (xdna-emu), so the test logic itself is correct;
the hang is at the firmware/DMA-execution layer.

I'm filing this to ask whether these tests actually pass in upstream CI on
the amd7940hs runner today, since the upstream lit summary in CI logs
only enumerates failures/skips, not passes — so I can't tell from the
public log whether they ran successfully or were silently Unsupported.

Affected tests

test/npu-xrt/memtile_dmas/blockwrite_using_locks
test/npu-xrt/memtile_dmas/dma_configure_task_lock
test/npu-xrt/memtile_dmas/dma_configure_task_token
test/npu-xrt/memtile_dmas/writebd
test/npu-xrt/memtile_dmas/writebd_tokens

All five fail identically. The simplest standalone repro is writebd.

Reproducer

Run via native lit (no custom test infrastructure):

$ /path/to/ironenv/bin/lit -v --filter "memtile_dmas/writebd/run.lit" build/test/

test.exe blocks indefinitely in drm_syncobj_array_wait_timeout. The
runtime sequence completes (every aiex.npu.writebd and aiex.npu.write32
dispatches successfully), but the final aiex.npu.sync never receives its
TCT from shim S2MM. dmesg confirms the kernel context never makes
forward progress:

amdxdna 0000:c6:00.1: aie2_tdr_work: Device isn't making progress... Count N timeout 2 dump_only 1
amdxdna 0000:c6:00.1: aie2_dump_ctx: Dumping ctx ...
        op: 0x0
        msg: 0x1d000001
        fence: unsignaled
        out_fence: unsignaled

Removing the runtime-sequence sync lets the kernel return cleanly (test
fails on data check rather than TDR), confirming the hang is exclusively
at sync time waiting for shim S2MM completion.

What we ruled out empirically

Constructed a series of minimal MLIR variants to isolate the trigger.
All run on the same Phoenix HW.

Variant Result
Single aiex.npu.write32 to memtile (col 0, row 1) LOCK0_VALUE PASS
Single aiex.npu.writebd to memtile (program BD, no exec) PASS
writebd + push to memtile S2MM TASK_QUEUE (channel start) PASS
Full writebd test with locks and chains TDR
Full writebd with use_next_bd=0 (no self-loop) TDR
Full writebd with all lock_acq_enable=0 (no locks) TDR
add_one_using_dma (static aie.memtile_dma block) PASS

So:

  • Runtime-sequence writes to memtile registers work fine on Phoenix.
  • Single-BD memtile programming + channel start works.
  • Static (CDO-time) memtile programming works.
  • Multi-channel runtime-programmed memtile DMA flow (shim → memtile → shim)
    never delivers data to the shim S2MM receiver.

The bug is independent of self-looping next_bd and independent of locks.

Environment

Component Version
CPU AMD Ryzen 9 7940HS (same family as the upstream amd7940hs runner)
NPU Phoenix (NPU Phoenix, aie2, 6×5 topology)
NPU Firmware 1.5.5.391 (per xrt-smi examine)
XRT 2.23.0
amdxdna driver 2.23.0_20260509 (xdna-driver HEAD c347d62)
mlir-aie HEAD b37dc33d41
llvm-aie / Peano latest (compile path used: chess)
aietools RyzenAI 2025.2 / Vitis AIE Essentials

The upstream CI Phoenix runner (amd7940hs) appears to use
/opt/ryzen_ai-1.3.0.1/vitis_aie_essentials per the workflow logs.
We're a few minor versions ahead on RyzenAI.

Confidence the bug is HW/FW-specific

  • xdna-emu (in-process emulator) runs the same xclbin and runtime sequence
    through its own DMA model and reports PASS!. So the lowering and
    test logic are sound; only the on-silicon execution diverges.
  • add_one_using_dma, which exercises the same shim ↔ memtile ↔ shim flow
    but programs the memtile DMA statically via the aie.memtile_dma
    block (encoded into CDO at xclbin-load time), passes on the same
    hardware. Only the runtime-sequence-programmed path TDRs.
  • xrt-smi validate passes; the device is healthy at SMI level.

Questions

  1. Do these five tests currently pass on amd7940hs in upstream CI? The
    visible log only lists Unsupported and Failed; if they ran in the
    Passed count, the count is the only evidence and it's not
    discriminating.
  2. If they pass in CI: what NPU firmware version is the runner using?
    We suspect a regression between the firmware bundled with RyzenAI
    1.3.0.1 (CI) and 1.5.x (us), since the only obvious environmental
    delta is firmware.
  3. If they fail (or are silently Unsupported) in CI too: would you
    accept a PR marking these as XFAIL on ryzen_ai_npu1 with a
    reference to this issue, until the underlying firmware/runtime
    issue is resolved?

Related

  • The recent fix Fix memtile DMA BD address missing base offset #2893 ("Fix memtile DMA BD address missing base offset")
    loosened these tests' REQUIRES from ryzen_ai_npu1 to ryzen_ai,
    but the lowering change in that PR explicitly only applies to
    AIE2p — the AIE2/NPU1 path is unchanged. So if the tests ever passed
    on Phoenix, that wasn't via Fix memtile DMA BD address missing base offset #2893.
  • Background context on Phoenix firmware limitations we've documented
    separately: xrt::hw_context::read_aie_reg returns successfully for
    compute-tile reads but never responds for memtile reads on the same
    firmware version. The driver kills the user-context mailbox on the
    resulting 5s timeout, which then cascades to a drm_dev_unplug
    wedge during modprobe -r. We can include details if that's
    potentially related — it suggests memtile runtime access in
    general is incompletely supported in Phoenix firmware 1.5.x.

Happy to provide the full lit/dmesg traces or any additional test
variations on request.

Metadata

Metadata

Assignees

Labels

triagedThis has been looked at and triaged

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions