[Question] Understanding NPU Compilation: Model Chunking, IR Re-use, and Dynamic Shapes

I'm analyzing the NPU inference behavior for a Qwen2-0.5B model (24 layers) and have observed a fascinating compilation and runtime pattern.

The model is compiled into 6 separate IRs (3 for Prefill, 3 for Decode). During Prefill, each of the 3 IRs runs once. However, during the Decode phase (per token), I observe a 1 + 22 + 1 execution pattern: the first IR runs once, the second IR runs 22 times, and the third IR runs once.

Questions: This 1 + 22 + 1 pattern perfectly matches the 24 layers, suggesting an automatic chunking strategy (e.g., Layer 0, Layers 1-22, Layer 23).

I am primarily wondering about the reason for this 3-part split. Is this chunking an automatic optimization to reduce NPU compilation time and memory consumption, or is it a necessary workaround due to dynamic shape handling for the K-V Cache on the NPU? Why does the NPU compilation strategy create a single fused IR for the middle layers in Prefill (which runs once), but a re-used, single-layer IR for Decode (which runs 22 times)?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Understanding NPU Compilation: Model Chunking, IR Re-use, and Dynamic Shapes #2913

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Understanding NPU Compilation: Model Chunking, IR Re-use, and Dynamic Shapes #2913

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions