[GPU][NPU] Resolve excessive memory usage in Eagle3 by GuoliangShiIntel · Pull Request #3394 · openvinotoolkit/openvino.genai

GuoliangShiIntel · 2026-02-26T05:58:43Z

Description

Quick background:
In the Eagle3 speculative decoding pipeline, currently the draft model's embedding weights is been derived from the target model.

The share_vocabulary function does this by sharing embedding nodes directly between both models, which interferes with the shared memory optimization and prevents ov::Model from being properly released after compilation.

Fix:
Embedding weights are now cloned into the draft model instead of shared, allowing each model's memory to be released independently and reducing memory consumption by ~4 GB for 8B (4bit quantize) model.

Tickets: CVS-181133

Checklist:

This PR follows GenAI Contributing guidelines.
Tests have been updated or added to cover the new code.
This PR fully addresses the ticket.
I have made corresponding changes to the documentation.

Copilot

Pull request overview

This PR changes how Eagle3 speculative decoding handles embedding weights between the main and draft OpenVINO models, aiming to avoid problematic cross-model references that can contribute to excessive memory usage on GPU/NPU.

Changes:

Replaces direct cross-model output replacement with a recursive clone of the embedding-weight subgraph from the main model into the draft model.
Updates logging to reflect “copying” rather than “sharing” embedding weights.

src/cpp/src/speculative_decoding/eagle3_model_transforms.cpp

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/speculative_decoding/eagle3_model_transforms.cpp:149

The suffix '_cloned_for_draft' is hardcoded. If the friendly name is already suffixed (e.g., during multiple cloning operations), this could create names like 'name_cloned_for_draft_cloned_for_draft'. Consider checking if the suffix already exists or using a more unique naming scheme.

        cloned->set_friendly_name(node->get_friendly_name() + "_cloned_for_draft");

src/cpp/src/speculative_decoding/eagle3_model_transforms.cpp

GuoliangShiIntel · 2026-02-26T08:46:32Z

src/cpp/src/speculative_decoding/eagle3_model_transforms.cpp

+    // Replace draft model's weight node with the cloned subgraph
+    // This avoids cross-model references by duplicating the vocabulary weights
+    draft_weight_node->output(0).replace(cloned_weight_node->output(0));
 }


Self note: Why no test added?

Below are the system memory changes with/without this PR (Model: Qwen3-8B int4 + Eagle3 int8)

Device Before After

GPU 11.14 GB 7.03 GB

NPU 14.81 GB 11.21 GB

CPU 9.10 GB 10.21 GB

This issue only manifests on GPU and NPU — CPU behavior looks correct. Since the CI environment appears to only have CPU devices (no GPU/NPU available), have not good idea for this test.

GuoliangShiIntel · 2026-03-02T01:30:15Z

@songbell @peterchen-intel Could you please review this fix first, as it also affects the GPU pipeline?

songbell · 2026-03-03T01:25:06Z

src/cpp/src/speculative_decoding/eagle3_model_transforms.cpp

+
+        if (auto constant = ov::as_type_ptr<ov::op::v0::Constant>(node)) {
+            // For Constant nodes, create a deep copy with new data
+            cloned = std::make_shared<ov::op::v0::Constant>(constant->get_element_type(),


maybe we need to understand why the extra 4G on GPU and ~3G on NPU is consumed or not released?
for the shared output, it should cut the connection to the shared node when the ov model releases, and for the constant node, if we use the shared data instead of deep copy, the issue is still there?

maybe we need to understand why the extra 4G on GPU and ~3G on NPU is consumed or not released?

Based on debug information, i have confirmed that the ov::Model is not being fully released. This suggests that sharing nodes across different models may be creating reference dependencies that prevent the reference count from reaching zero. The root cause needs more investigate in OpenVINO.

if we use the shared data instead of deep copy, the issue is still there?

Yes, the issue persists even when using shared data.

@GuoliangShiIntel the extra memory on GPU/NPU should be due to the mmap of the model. let's merge the fix first.

songbell · 2026-03-04T02:10:13Z

src/cpp/src/speculative_decoding/eagle3_model_transforms.cpp

+
+        if (auto constant = ov::as_type_ptr<ov::op::v0::Constant>(node)) {
+            // For Constant nodes, create a deep copy with new data
+            cloned = std::make_shared<ov::op::v0::Constant>(constant->get_element_type(),


@GuoliangShiIntel the extra memory on GPU/NPU should be due to the mmap of the model. let's merge the fix first.

Copilot AI review requested due to automatic review settings February 26, 2026 05:58

github-actions bot added the category: speculative decoding Speculative decoding label Feb 26, 2026

Copilot started reviewing on behalf of GuoliangShiIntel February 26, 2026 05:59 View session

GuoliangShiIntel force-pushed the sgl/fix_eagle_3_memory_issue branch from 1a5ad6c to f6661fd Compare February 26, 2026 06:00

Copilot AI reviewed Feb 26, 2026

View reviewed changes

src/cpp/src/speculative_decoding/eagle3_model_transforms.cpp Outdated Show resolved Hide resolved

src/cpp/src/speculative_decoding/eagle3_model_transforms.cpp Show resolved Hide resolved

src/cpp/src/speculative_decoding/eagle3_model_transforms.cpp Show resolved Hide resolved

GuoliangShiIntel force-pushed the sgl/fix_eagle_3_memory_issue branch from f6661fd to a4d120e Compare February 26, 2026 07:53

Copilot AI review requested due to automatic review settings February 26, 2026 08:19

GuoliangShiIntel force-pushed the sgl/fix_eagle_3_memory_issue branch 2 times, most recently from 1be6f7d to 702bb91 Compare February 26, 2026 08:21

Copilot AI reviewed Feb 26, 2026

View reviewed changes

src/cpp/src/speculative_decoding/eagle3_model_transforms.cpp Show resolved Hide resolved

src/cpp/src/speculative_decoding/eagle3_model_transforms.cpp Show resolved Hide resolved

GuoliangShiIntel marked this pull request as ready for review February 26, 2026 08:31

GuoliangShiIntel requested a review from sbalandi as a code owner February 26, 2026 08:31

GuoliangShiIntel requested review from peterchen-intel and songbell February 26, 2026 08:31

GuoliangShiIntel commented Feb 26, 2026

View reviewed changes

GuoliangShiIntel added this to the 2026.1 milestone Feb 26, 2026

Copilot started reviewing on behalf of GuoliangShiIntel February 26, 2026 08:51 View session

GuoliangShiIntel added 4 commits February 27, 2026 10:51

Fix eagle 3 memory issue

e617964

Add Test

ce82990

typo

af2c644

Remove test

7204ebb

GuoliangShiIntel force-pushed the sgl/fix_eagle_3_memory_issue branch from 702bb91 to 7204ebb Compare February 27, 2026 02:51

songbell reviewed Mar 3, 2026

View reviewed changes

peterchen-intel added the Code Freeze label Mar 4, 2026

songbell approved these changes Mar 4, 2026

View reviewed changes

GuoliangShiIntel requested a review from AsyaPronina March 4, 2026 03:11

EfimovIlia added this pull request to the merge queue Mar 5, 2026

Merged via the queue into openvinotoolkit:master with commit fe495d0 Mar 5, 2026
118 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU][NPU] Resolve excessive memory usage in Eagle3#3394

[GPU][NPU] Resolve excessive memory usage in Eagle3#3394
EfimovIlia merged 4 commits intoopenvinotoolkit:masterfrom
GuoliangShiIntel:sgl/fix_eagle_3_memory_issue

GuoliangShiIntel commented Feb 26, 2026 •

edited by songbell

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

GuoliangShiIntel Feb 26, 2026

Uh oh!

GuoliangShiIntel commented Mar 2, 2026

Uh oh!

songbell Mar 3, 2026

Uh oh!

GuoliangShiIntel Mar 3, 2026

Uh oh!

songbell Mar 4, 2026

Uh oh!

songbell Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

GuoliangShiIntel commented Feb 26, 2026 • edited by songbell Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

GuoliangShiIntel Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

GuoliangShiIntel commented Mar 2, 2026

Uh oh!

songbell Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

GuoliangShiIntel Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

songbell Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

songbell Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

GuoliangShiIntel commented Feb 26, 2026 •

edited by songbell

Loading