[NPUW] 1st token generation profiling#34147
Open
dylanneve1 wants to merge 13 commits intoopenvinotoolkit:masterfrom
Open
[NPUW] 1st token generation profiling#34147dylanneve1 wants to merge 13 commits intoopenvinotoolkit:masterfrom
dylanneve1 wants to merge 13 commits intoopenvinotoolkit:masterfrom
Conversation
f28539a to
4d5bedd
Compare
dmatveev
reviewed
Feb 20, 2026
Contributor
|
Now looking at the profiling structure, what we get is: Looking at the actual timings, I think
|
ca07b34 to
f9de002
Compare
01ea8ef to
4592f1b
Compare
Member
Author
a8f643d to
e5720d6
Compare
dmatveev
reviewed
Mar 12, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
Adds request-level execution profiling for NPUW LLM inference to help analyze first-token generation behavior when OPENVINO_NPUW_PROF=1 is enabled.
Changes:
- Introduces an
ov::npuw::perf::Profilemember onLLMInferRequestfor LLM-level timing metrics. - Instruments
infer_prefill()andinfer_generate()with phase-level timing buckets (prepare/apply_lora/infer/lm_head, etc.). - Configures profiling output area/tagging and enables reporting based on
ov::npuw::profiling_enabled().
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.hpp |
Adds perf.hpp include and a Profile member to store LLM-level timing metrics. |
src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp |
Enables the profile in the ctor and wraps key prefill/generate phases with profiling timers and tagged metrics. |
You can also share your feedback on Copilot code review. Take the survey.
dmatveev
reviewed
Mar 12, 2026
dmatveev
reviewed
Mar 12, 2026
Comment on lines
+829
to
+833
| process_longrope(m_prefill_request, m_prefill_in_ports, position_ids); | ||
|
|
||
| m_llm_profile["1/prefill:1.prepare_for_new_conversation"].record([&]() { | ||
| prepare_for_new_conversation(prompt_length); | ||
| }); |
Contributor
There was a problem hiding this comment.
Order may be important here
3370c5a to
ccb484c
Compare
8c9ad7b to
2c4bb15
Compare
Cherry-pick llm_infer_request.cpp changes from a8f643d to add sub-step profiling for chunked/whole prefill and generate lm_head.
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
902
to
+905
| // Note: m_kvcache_request, m_kvcache_in_ports, and m_kvcache_out_ports are selected in | ||
| // prepare_for_new_conversation() | ||
| m_llm_profile["N/generate:1.prepare"].record([&]() { | ||
| if (!m_generate_initialized) { |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Details:
Tickets: