Skip to content

[GPU][NPU] Resolve excessive memory usage in Eagle3#3394

Merged
EfimovIlia merged 4 commits intoopenvinotoolkit:masterfrom
GuoliangShiIntel:sgl/fix_eagle_3_memory_issue
Mar 5, 2026
Merged

[GPU][NPU] Resolve excessive memory usage in Eagle3#3394
EfimovIlia merged 4 commits intoopenvinotoolkit:masterfrom
GuoliangShiIntel:sgl/fix_eagle_3_memory_issue

Conversation

@GuoliangShiIntel
Copy link
Contributor

@GuoliangShiIntel GuoliangShiIntel commented Feb 26, 2026

Description

Quick background:
In the Eagle3 speculative decoding pipeline, currently the draft model's embedding weights is been derived from the target model.

The share_vocabulary function does this by sharing embedding nodes directly between both models, which interferes with the shared memory optimization and prevents ov::Model from being properly released after compilation.

Fix:
Embedding weights are now cloned into the draft model instead of shared, allowing each model's memory to be released independently and reducing memory consumption by ~4 GB for 8B (4bit quantize) model.

Tickets: CVS-181133

Checklist:

  • This PR follows GenAI Contributing guidelines.
  • Tests have been updated or added to cover the new code.
  • This PR fully addresses the ticket.
  • I have made corresponding changes to the documentation.

Copilot AI review requested due to automatic review settings February 26, 2026 05:58
@github-actions github-actions bot added the category: speculative decoding Speculative decoding label Feb 26, 2026
@GuoliangShiIntel GuoliangShiIntel force-pushed the sgl/fix_eagle_3_memory_issue branch from 1a5ad6c to f6661fd Compare February 26, 2026 06:00
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes how Eagle3 speculative decoding handles embedding weights between the main and draft OpenVINO models, aiming to avoid problematic cross-model references that can contribute to excessive memory usage on GPU/NPU.

Changes:

  • Replaces direct cross-model output replacement with a recursive clone of the embedding-weight subgraph from the main model into the draft model.
  • Updates logging to reflect “copying” rather than “sharing” embedding weights.

@GuoliangShiIntel GuoliangShiIntel force-pushed the sgl/fix_eagle_3_memory_issue branch from f6661fd to a4d120e Compare February 26, 2026 07:53
Copilot AI review requested due to automatic review settings February 26, 2026 08:19
@GuoliangShiIntel GuoliangShiIntel force-pushed the sgl/fix_eagle_3_memory_issue branch 2 times, most recently from 1be6f7d to 702bb91 Compare February 26, 2026 08:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/cpp/src/speculative_decoding/eagle3_model_transforms.cpp:149

  • The suffix '_cloned_for_draft' is hardcoded. If the friendly name is already suffixed (e.g., during multiple cloning operations), this could create names like 'name_cloned_for_draft_cloned_for_draft'. Consider checking if the suffix already exists or using a more unique naming scheme.
        cloned->set_friendly_name(node->get_friendly_name() + "_cloned_for_draft");

@GuoliangShiIntel GuoliangShiIntel marked this pull request as ready for review February 26, 2026 08:31
// Replace draft model's weight node with the cloned subgraph
// This avoids cross-model references by duplicating the vocabulary weights
draft_weight_node->output(0).replace(cloned_weight_node->output(0));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self note: Why no test added?

Below are the system memory changes with/without this PR (Model: Qwen3-8B int4 + Eagle3 int8)

Device Before After
GPU 11.14 GB 7.03 GB
NPU 14.81 GB 11.21 GB
CPU 9.10 GB 10.21 GB

This issue only manifests on GPU and NPU — CPU behavior looks correct. Since the CI environment appears to only have CPU devices (no GPU/NPU available), have not good idea for this test.

@GuoliangShiIntel GuoliangShiIntel force-pushed the sgl/fix_eagle_3_memory_issue branch from 702bb91 to 7204ebb Compare February 27, 2026 02:51
@GuoliangShiIntel
Copy link
Contributor Author

@songbell @peterchen-intel Could you please review this fix first, as it also affects the GPU pipeline?


if (auto constant = ov::as_type_ptr<ov::op::v0::Constant>(node)) {
// For Constant nodes, create a deep copy with new data
cloned = std::make_shared<ov::op::v0::Constant>(constant->get_element_type(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we need to understand why the extra 4G on GPU and ~3G on NPU is consumed or not released?
for the shared output, it should cut the connection to the shared node when the ov model releases, and for the constant node, if we use the shared data instead of deep copy, the issue is still there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we need to understand why the extra 4G on GPU and ~3G on NPU is consumed or not released?

Based on debug information, i have confirmed that the ov::Model is not being fully released. This suggests that sharing nodes across different models may be creating reference dependencies that prevent the reference count from reaching zero. The root cause needs more investigate in OpenVINO.

if we use the shared data instead of deep copy, the issue is still there?

Yes, the issue persists even when using shared data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GuoliangShiIntel the extra memory on GPU/NPU should be due to the mmap of the model. let's merge the fix first.


if (auto constant = ov::as_type_ptr<ov::op::v0::Constant>(node)) {
// For Constant nodes, create a deep copy with new data
cloned = std::make_shared<ov::op::v0::Constant>(constant->get_element_type(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GuoliangShiIntel the extra memory on GPU/NPU should be due to the mmap of the model. let's merge the fix first.

@EfimovIlia EfimovIlia added this pull request to the merge queue Mar 5, 2026
Merged via the queue into openvinotoolkit:master with commit fe495d0 Mar 5, 2026
118 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants