I'm analyzing the NPU inference behavior for a Qwen2-0.5B model (24 layers) and have observed a fascinating compilation and runtime pattern.
The model is compiled into 6 separate IRs (3 for Prefill, 3 for Decode). During Prefill, each of the 3 IRs runs once. However, during the Decode phase (per token), I observe a 1 + 22 + 1 execution pattern: the first IR runs once, the second IR runs 22 times, and the third IR runs once.
Questions: This 1 + 22 + 1 pattern perfectly matches the 24 layers, suggesting an automatic chunking strategy (e.g., Layer 0, Layers 1-22, Layer 23).
I am primarily wondering about the reason for this 3-part split. Is this chunking an automatic optimization to reduce NPU compilation time and memory consumption, or is it a necessary workaround due to dynamic shape handling for the K-V Cache on the NPU? Why does the NPU compilation strategy create a single fused IR for the middle layers in Prefill (which runs once), but a re-used, single-layer IR for Decode (which runs 22 times)?
Thank you!