[GPU][NPU] Resolve excessive memory usage in Eagle3#3394
[GPU][NPU] Resolve excessive memory usage in Eagle3#3394EfimovIlia merged 4 commits intoopenvinotoolkit:masterfrom
Conversation
1a5ad6c to
f6661fd
Compare
There was a problem hiding this comment.
Pull request overview
This PR changes how Eagle3 speculative decoding handles embedding weights between the main and draft OpenVINO models, aiming to avoid problematic cross-model references that can contribute to excessive memory usage on GPU/NPU.
Changes:
- Replaces direct cross-model output replacement with a recursive clone of the embedding-weight subgraph from the main model into the draft model.
- Updates logging to reflect “copying” rather than “sharing” embedding weights.
f6661fd to
a4d120e
Compare
1be6f7d to
702bb91
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/cpp/src/speculative_decoding/eagle3_model_transforms.cpp:149
- The suffix '_cloned_for_draft' is hardcoded. If the friendly name is already suffixed (e.g., during multiple cloning operations), this could create names like 'name_cloned_for_draft_cloned_for_draft'. Consider checking if the suffix already exists or using a more unique naming scheme.
cloned->set_friendly_name(node->get_friendly_name() + "_cloned_for_draft");
| // Replace draft model's weight node with the cloned subgraph | ||
| // This avoids cross-model references by duplicating the vocabulary weights | ||
| draft_weight_node->output(0).replace(cloned_weight_node->output(0)); | ||
| } |
There was a problem hiding this comment.
Self note: Why no test added?
Below are the system memory changes with/without this PR (Model: Qwen3-8B int4 + Eagle3 int8)
| Device | Before | After |
|---|---|---|
| GPU | 11.14 GB | 7.03 GB |
| NPU | 14.81 GB | 11.21 GB |
| CPU | 9.10 GB | 10.21 GB |
This issue only manifests on GPU and NPU — CPU behavior looks correct. Since the CI environment appears to only have CPU devices (no GPU/NPU available), have not good idea for this test.
702bb91 to
7204ebb
Compare
|
@songbell @peterchen-intel Could you please review this fix first, as it also affects the GPU pipeline? |
|
|
||
| if (auto constant = ov::as_type_ptr<ov::op::v0::Constant>(node)) { | ||
| // For Constant nodes, create a deep copy with new data | ||
| cloned = std::make_shared<ov::op::v0::Constant>(constant->get_element_type(), |
There was a problem hiding this comment.
maybe we need to understand why the extra 4G on GPU and ~3G on NPU is consumed or not released?
for the shared output, it should cut the connection to the shared node when the ov model releases, and for the constant node, if we use the shared data instead of deep copy, the issue is still there?
There was a problem hiding this comment.
maybe we need to understand why the extra 4G on GPU and ~3G on NPU is consumed or not released?
Based on debug information, i have confirmed that the ov::Model is not being fully released. This suggests that sharing nodes across different models may be creating reference dependencies that prevent the reference count from reaching zero. The root cause needs more investigate in OpenVINO.
if we use the shared data instead of deep copy, the issue is still there?
Yes, the issue persists even when using shared data.
There was a problem hiding this comment.
@GuoliangShiIntel the extra memory on GPU/NPU should be due to the mmap of the model. let's merge the fix first.
|
|
||
| if (auto constant = ov::as_type_ptr<ov::op::v0::Constant>(node)) { | ||
| // For Constant nodes, create a deep copy with new data | ||
| cloned = std::make_shared<ov::op::v0::Constant>(constant->get_element_type(), |
There was a problem hiding this comment.
@GuoliangShiIntel the extra memory on GPU/NPU should be due to the mmap of the model. let's merge the fix first.
Description
Quick background:
In the Eagle3 speculative decoding pipeline, currently the draft model's embedding weights is been derived from the target model.
The share_vocabulary function does this by sharing embedding nodes directly between both models, which interferes with the shared memory optimization and prevents
ov::Modelfrom being properly released after compilation.Fix:
Embedding weights are now cloned into the draft model instead of shared, allowing each model's memory to be released independently and reducing memory consumption by ~4 GB for 8B (4bit quantize) model.
Tickets: CVS-181133
Checklist: