Skip to content

[Question] Understanding NPU Compilation: Model Chunking, IR Re-use, and Dynamic Shapes #2913

@Kepontry

Description

@Kepontry

I'm analyzing the NPU inference behavior for a Qwen2-0.5B model (24 layers) and have observed a fascinating compilation and runtime pattern.

The model is compiled into 6 separate IRs (3 for Prefill, 3 for Decode). During Prefill, each of the 3 IRs runs once. However, during the Decode phase (per token), I observe a 1 + 22 + 1 execution pattern: the first IR runs once, the second IR runs 22 times, and the third IR runs once.

Questions: This 1 + 22 + 1 pattern perfectly matches the 24 layers, suggesting an automatic chunking strategy (e.g., Layer 0, Layers 1-22, Layer 23).

I am primarily wondering about the reason for this 3-part split. Is this chunking an automatic optimization to reduce NPU compilation time and memory consumption, or is it a necessary workaround due to dynamic shape handling for the K-V Cache on the NPU? Why does the NPU compilation strategy create a single fused IR for the middle layers in Prefill (which runs once), but a re-used, single-layer IR for Decode (which runs 22 times)?

Thank you!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions