Skip to content

[NPUW] 1st token generation profiling#34147

Open
dylanneve1 wants to merge 13 commits intoopenvinotoolkit:masterfrom
dylanneve1:EISW-202385
Open

[NPUW] 1st token generation profiling#34147
dylanneve1 wants to merge 13 commits intoopenvinotoolkit:masterfrom
dylanneve1:EISW-202385

Conversation

@dylanneve1
Copy link
Member

Details:

  • Adds LLM-level execution profiling to LLMInferRequest for 1st token generation analysis, enabled with OPENVINO_NPUW_PROF=1. Breaks down prefill into prepare (fill inputs, clear kvcache...), infer, and LM head phases, and generate into copy kvcache, initialize inputs, prepare, infer and post phases. Metrics are printed in chronological execution order.

Tickets:

@dylanneve1 dylanneve1 requested review from a team as code owners February 16, 2026 10:51
@github-actions github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Feb 16, 2026
@dmatveev dmatveev added this to the 2026.1 milestone Feb 20, 2026
@dmatveev
Copy link
Contributor

dmatveev commented Feb 20, 2026

Now looking at the profiling structure, what we get is:

LLM/execution:
  1/prefill:1.prepare
  1/prefill:1a.fill_inputs
  1/prefill:1b.clear_kvcache
  1/prefill:1c.select_gen_req
  1/prefill:1d.apply_lora
  1/prefill:2.infer
  1/prefill:2a.npu_infer
  1/prefill:3.lm_head
  N/generate:1.copy_kvcache
  N/generate:2.init_inputs
  N/generate:3.prepare
  N/generate:4.infer
  N/generate:5.post

Looking at the actual timings, I think

  1. prefill:1.prepare is way to detailed for the time spent - we can leave it as a single entry? Appying LoRA adapters may be the only costly thing
  2. npu_infer may be misleading as, technically, the underlying model can run on any IP.. Also there's a subtle delta between prefill:2. and prefill:2.a - should we see it on the longer contexts?
  3. lm_head is shown in prefill stage only but in fact it runs in both - I think it should be possible to have two regions for the same model?
  4. do you have kvcache update instrumented in 2nd token? Or is it asynchronous?

@dylanneve1 dylanneve1 force-pushed the EISW-202385 branch 3 times, most recently from ca07b34 to f9de002 Compare February 26, 2026 15:20
@dylanneve1 dylanneve1 force-pushed the EISW-202385 branch 6 times, most recently from 01ea8ef to 4592f1b Compare March 10, 2026 16:23
@dylanneve1
Copy link
Member Author

Thank you, I believe these should be addressed now. Current profiling format is bellow

LLM/execution:
  1/prefill:1.prepare_for_new_conversation[ ... ]
  1/prefill:2.apply_lora[ ... ]
  1/prefill:3.infer   [ ... ]
  1/prefill:4.lm_head [ ... ]
  N/generate:1.prepare[ ... ]
  N/generate:2.infer  [ ... ]
  N/generate:3.update_kvcache[ ... ]
  N/generate:4.lm_head[ ... ]

The fill_in/copy_n you mentioned are now included in top level profiling as well, moved to a top level infer profiling in infer_prefill instead of just profiling m_prefill_request->infer().

image

@dylanneve1 dylanneve1 force-pushed the EISW-202385 branch 2 times, most recently from a8f643d to e5720d6 Compare March 12, 2026 11:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds request-level execution profiling for NPUW LLM inference to help analyze first-token generation behavior when OPENVINO_NPUW_PROF=1 is enabled.

Changes:

  • Introduces an ov::npuw::perf::Profile member on LLMInferRequest for LLM-level timing metrics.
  • Instruments infer_prefill() and infer_generate() with phase-level timing buckets (prepare/apply_lora/infer/lm_head, etc.).
  • Configures profiling output area/tagging and enables reporting based on ov::npuw::profiling_enabled().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.hpp Adds perf.hpp include and a Profile member to store LLM-level timing metrics.
src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Enables the profile in the ctor and wraps key prefill/generate phases with profiling timers and tagged metrics.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +829 to +833
process_longrope(m_prefill_request, m_prefill_in_ports, position_ids);

m_llm_profile["1/prefill:1.prepare_for_new_conversation"].record([&]() {
prepare_for_new_conversation(prompt_length);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Order may be important here

@dylanneve1 dylanneve1 force-pushed the EISW-202385 branch 2 times, most recently from 3370c5a to ccb484c Compare March 13, 2026 09:48
@dylanneve1 dylanneve1 force-pushed the EISW-202385 branch 2 times, most recently from 8c9ad7b to 2c4bb15 Compare March 13, 2026 09:52
Cherry-pick llm_infer_request.cpp changes from a8f643d to add
sub-step profiling for chunked/whole prefill and generate lm_head.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 902 to +905
// Note: m_kvcache_request, m_kvcache_in_ports, and m_kvcache_out_ports are selected in
// prepare_for_new_conversation()
m_llm_profile["N/generate:1.prepare"].record([&]() {
if (!m_generate_initialized) {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants