Trace timing mismatch with NPU time #2785

yashpalyv · 2025-12-25T12:50:42Z

yashpalyv
Dec 25, 2025

A simple pass through test case is implemented where 256kilobytes moves from L3 -> shimDMA(0,0) -> Core(0,2) and back in chunks of 4096 bytes. The data is verified after pass through. Trace is enabled on both Core(0,2) and shim tile shimDMA(0,0).
Pass through code

v64uint8 *restrict outPtr = (v64uint8 *)out;
v64uint8 *restrict inPtr = (v64uint8 *)in;

AIE_PREPARE_FOR_PIPELINING
AIE_LOOP_MIN_ITERATION_COUNT(6)
for (int j = 0; j < (height * width); j += N) // Nx samples per loop
{
outPtr++ = inPtr++;
}

Observations from the trace "to transfer a block of 4096 bytes"

Core(0,2) Lock stall duration ~ 850us

Core(0,2) in running state duration ~ 155us

Overall NPU execution time is around 650us.

Concerns

 While over all execution time for the transfer of 256kb is around 650us but in the trace we observe for a transfer of 4096 bytes it takes around 1000us.

 In the trace we observe Vector copy of 4096 bytes from input buffer to output buffer takes around 155us. Which is huge considering the AIE core.

hunhoffe · 2026-01-08T20:17:29Z

hunhoffe
Jan 8, 2026
Maintainer

Can you please include the test code used to generate these results?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace timing mismatch with NPU time #2785

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Trace timing mismatch with NPU time #2785

Uh oh!

yashpalyv Dec 25, 2025

AIE_PREPARE_FOR_PIPELINING AIE_LOOP_MIN_ITERATION_COUNT(6) for (int j = 0; j < (height * width); j += N) // Nx samples per loop { *outPtr++ = *inPtr++; }

Replies: 1 comment

Uh oh!

hunhoffe Jan 8, 2026 Maintainer

yashpalyv
Dec 25, 2025

AIE_PREPARE_FOR_PIPELINING
AIE_LOOP_MIN_ITERATION_COUNT(6)
for (int j = 0; j < (height * width); j += N) // Nx samples per loop
{
outPtr++ = inPtr++;
}

hunhoffe
Jan 8, 2026
Maintainer