[NPUW] 1st token generation profiling by dylanneve1 · Pull Request #34147 · openvinotoolkit/openvino

dylanneve1 · 2026-02-16T10:51:10Z

Details:

Adds LLM-level execution profiling to LLMInferRequest for 1st token generation analysis, enabled with OPENVINO_NPUW_PROF=1. Breaks down prefill into prepare (fill inputs, clear kvcache...), infer, and LM head phases, and generate into copy kvcache, initialize inputs, prepare, infer and post phases. Metrics are printed in chronological execution order.

Tickets:

EISW-202385

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

dmatveev · 2026-02-20T19:18:59Z

Now looking at the profiling structure, what we get is:

LLM/execution:
  1/prefill:1.prepare
  1/prefill:1a.fill_inputs
  1/prefill:1b.clear_kvcache
  1/prefill:1c.select_gen_req
  1/prefill:1d.apply_lora
  1/prefill:2.infer
  1/prefill:2a.npu_infer
  1/prefill:3.lm_head
  N/generate:1.copy_kvcache
  N/generate:2.init_inputs
  N/generate:3.prepare
  N/generate:4.infer
  N/generate:5.post

Looking at the actual timings, I think

prefill:1.prepare is way to detailed for the time spent - we can leave it as a single entry? Appying LoRA adapters may be the only costly thing
npu_infer may be misleading as, technically, the underlying model can run on any IP.. Also there's a subtle delta between prefill:2. and prefill:2.a - should we see it on the longer contexts?
lm_head is shown in prefill stage only but in fact it runs in both - I think it should be possible to have two regions for the same model?
do you have kvcache update instrumented in 2nd token? Or is it asynchronous?

dylanneve1 · 2026-03-11T10:30:08Z

Thank you, I believe these should be addressed now. Current profiling format is bellow

LLM/execution:
  1/prefill:1.prepare_for_new_conversation[ ... ]
  1/prefill:2.apply_lora[ ... ]
  1/prefill:3.infer   [ ... ]
  1/prefill:4.lm_head [ ... ]
  N/generate:1.prepare[ ... ]
  N/generate:2.infer  [ ... ]
  N/generate:3.update_kvcache[ ... ]
  N/generate:4.lm_head[ ... ]

The fill_in/copy_n you mentioned are now included in top level profiling as well, moved to a top level infer profiling in infer_prefill instead of just profiling m_prefill_request->infer().

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

Copilot

Pull request overview

Adds request-level execution profiling for NPUW LLM inference to help analyze first-token generation behavior when OPENVINO_NPUW_PROF=1 is enabled.

Changes:

Introduces an ov::npuw::perf::Profile member on LLMInferRequest for LLM-level timing metrics.
Instruments infer_prefill() and infer_generate() with phase-level timing buckets (prepare/apply_lora/infer/lm_head, etc.).
Configures profiling output area/tagging and enables reporting based on ov::npuw::profiling_enabled().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.hpp`	Adds `perf.hpp` include and a `Profile` member to store LLM-level timing metrics.
`src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp`	Enables the profile in the ctor and wraps key prefill/generate phases with profiling timers and tagged metrics.

You can also share your feedback on Copilot code review. Take the survey.

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

dmatveev · 2026-03-12T15:00:53Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

    process_longrope(m_prefill_request, m_prefill_in_ports, position_ids);
+
+    m_llm_profile["1/prefill:1.prepare_for_new_conversation"].record([&]() {
+        prepare_for_new_conversation(prompt_length);
+    });


Order may be important here

Cherry-pick llm_infer_request.cpp changes from a8f643d to add sub-step profiling for chunked/whole prefill and generate lm_head.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

You can also share your feedback on Copilot code review. Take the survey.

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

    // Note: m_kvcache_request, m_kvcache_in_ports, and m_kvcache_out_ports are selected in
    // prepare_for_new_conversation()
+    m_llm_profile["N/generate:1.prepare"].record([&]() {
+        if (!m_generate_initialized) {


dylanneve1 requested review from a team as code owners February 16, 2026 10:51

github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Feb 16, 2026

dylanneve1 force-pushed the EISW-202385 branch from f28539a to 4d5bedd Compare February 16, 2026 11:16

dmatveev added this to the 2026.1 milestone Feb 20, 2026

dmatveev reviewed Feb 20, 2026

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Outdated Show resolved Hide resolved

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Outdated Show resolved Hide resolved

dylanneve1 force-pushed the EISW-202385 branch 3 times, most recently from ca07b34 to f9de002 Compare February 26, 2026 15:20

dylanneve1 force-pushed the EISW-202385 branch 6 times, most recently from 01ea8ef to 4592f1b Compare March 10, 2026 16:23

dylanneve1 force-pushed the EISW-202385 branch 2 times, most recently from a8f643d to e5720d6 Compare March 12, 2026 11:38

dmatveev reviewed Mar 12, 2026

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Outdated Show resolved Hide resolved

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Show resolved Hide resolved

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Outdated Show resolved Hide resolved

dylanneve1 requested a review from Copilot March 12, 2026 11:45

Copilot started reviewing on behalf of dylanneve1 March 12, 2026 11:46 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Outdated Show resolved Hide resolved

dmatveev reviewed Mar 12, 2026

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Show resolved Hide resolved

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Show resolved Hide resolved

dmatveev reviewed Mar 12, 2026

View reviewed changes

dylanneve1 force-pushed the EISW-202385 branch 2 times, most recently from 3370c5a to ccb484c Compare March 13, 2026 09:48

dylanneve1 added 2 commits March 13, 2026 09:48

[NPUW] Add LLM-level profiling for 1st token generation analysis

b01e387

Cleanup profiling

401e706

dylanneve1 added 10 commits March 13, 2026 09:48

Fix clang-format issue in llm_infer_request.cpp

25f7263

Remove unused namespace alias in prepare_for_new_conversation

99ca3df

Rename to 1/prefill:1.prepare_for_new_conversation

a8d3dfb

Merge N/generate:(1.prepare/2.prepare)

fa68727

Fix clang formatting

dd32428

Simplify 1/prefill:3.infer profiling

72d5dc4

Remove flow changes

9a02ecf

Switch to gaurded metric::record(...)

404b68c

Restore original if/else structure for update_kvcache in infer_generate

fc7ac2a

Restore flow for longrope ordering

901a2b8

dylanneve1 force-pushed the EISW-202385 branch 2 times, most recently from 8c9ad7b to 2c4bb15 Compare March 13, 2026 09:52

Add profiling to infer_chunked_prefill and infer_whole_prefill

7d01f6c

Cherry-pick llm_infer_request.cpp changes from a8f643d to add sub-step profiling for chunked/whole prefill and generate lm_head.

dylanneve1 force-pushed the EISW-202385 branch from 2c4bb15 to 7d01f6c Compare March 13, 2026 10:04

dylanneve1 requested a review from Copilot March 13, 2026 10:07

Copilot started reviewing on behalf of dylanneve1 March 13, 2026 10:08 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

Comment on lines 902 to +905

// Note: m_kvcache_request, m_kvcache_in_ports, and m_kvcache_out_ports are selected in

// prepare_for_new_conversation()

m_llm_profile["N/generate:1.prepare"].record([&]() {

if (!m_generate_initialized) {

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPUW] 1st token generation profiling#34147

[NPUW] 1st token generation profiling#34147
dylanneve1 wants to merge 13 commits intoopenvinotoolkit:masterfrom
dylanneve1:EISW-202385

dylanneve1 commented Feb 16, 2026

Uh oh!

Uh oh!

Uh oh!

dmatveev commented Feb 20, 2026 •

edited

Loading

Uh oh!

dylanneve1 commented Mar 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmatveev Mar 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dylanneve1 commented Feb 16, 2026

Details:

Tickets:

Uh oh!

Uh oh!

Uh oh!

dmatveev commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dylanneve1 commented Mar 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmatveev Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dmatveev commented Feb 20, 2026 •

edited

Loading