[OV] Fix model inference not being affected by static quantization #1461

nikita-savelyevv · 2025-10-07T09:44:47Z

What does this PR do?

Reason for changes
Executing multi-model pipelines after static quantization but before serialization leads to original non-quantized models being inferred instead. For example in case of vision embeddings model within a VLM pipeline, this is because the model is replaced at OVModelForVisualCausalLM level, but not at OVVisionEmbedding level.

Changes

Introduced OVModelHostMixin to unify handling of openvino.Model instances for all model classes containing them. This class implements:
- ov_models property returns named instances of openvino.Model that the model contains (renamed from ov_submodels).
- components property returns named instances of OVModelHostMixin that the model contains.
- replace_ov_model(current_model: openvino.Model, new_model: openvino.Model) method is used to perform recursive replacement. OV models are matched by Python object id.
- compile() method is self explanatory. Previously, for some classes it was called _compile() which is now renamed for consistency.
- clear_requests() method is self explanatory.
test_ovmodel_pipeline_quantization -- now checks quantized model both before and after serialization.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

IlyasMoutawwakil · 2025-10-07T17:51:58Z

optimum/intel/openvino/modeling_base.py

-    @property
-    def ov_submodels(self) -> Dict[str, openvino.Model]:
-        return {submodel_name: getattr(self, submodel_name) for submodel_name in self._ov_submodel_names}


breaking change ?

Yes, but https://github.com/nikita-savelyevv/optimum-intel/blob/ns/fix-inplace-quantization/optimum/intel/openvino/modeling_base.py#L75-L79

IlyasMoutawwakil · 2025-10-07T18:03:26Z

Are you sure we need to refactorization to fix ?
I understand the incentive of the refactorization but I would suggest fixing in one PR, and refactoring in another,
because reviewing the refactorization will definitely take longer than the fix (which i assume is meant for the blog's benchmark)

nikita-savelyevv · 2025-10-08T06:46:45Z

Are you sure we need to refactorization to fix ? I understand the incentive of the refactorization but I would suggest fixing in one PR, and refactoring in another, because reviewing the refactorization will definitely take longer than the fix (which i assume is meant for the blog's benchmark)

Sure, I can create a fix for the VLM case

rkazants · 2025-10-13T03:01:38Z

optimum/intel/openvino/modeling_text2speech.py


    def forward(self, input_ids, **kwargs):
-        self._compile()
+        self.compile()


Now compile() method gets a sort of external method but do we really need it? I am not sure that user API of optimum-intel should expose it. Compilation happens under the hood and belong to dev API only. So _ still needed.

As of now, for some classes this method is named compile() and for some _compile(). I've decided it to align it by making all of them called compile(). According to documentation, compile() is a part of our public API https://huggingface.co/docs/optimum/en/intel/openvino/inference#compilation

optimum/intel/openvino/modeling_open_clip.py

rkazants · 2025-10-13T03:05:04Z

@nikita-savelyevv, could you please share a snippet of code that demonstrates what we can't do with the current API? Thanks

nikita-savelyevv · 2025-10-13T09:51:33Z

@nikita-savelyevv, could you please share a snippet of code that demonstrates what we can't do with the current API? Thanks

The purpose of this PR is to properly fix the scenario below. A hot-fix was provided in #1464.

from optimum.intel import OVModelForVisualCausalLM, OVQuantizationConfig

q_model = OVModelForVisualCausalLM.from_pretrained(
    "HuggingFaceTB/SmolVLM2-256M-Video-Instruct",
    quantization_config=OVQuantizationConfig(bits=8, dataset="contextual", num_samples=50))

# If called here, q_model.generate() will run original non-quantized models

The main thing added here that makes it work is replace_ov_model() method that is now used during quantization.

echarlaix

Great work, thanks a lot @nikita-savelyevv , left couple of comments

echarlaix · 2025-10-17T09:16:03Z

optimum/intel/openvino/modeling_base.py

+        return self.ov_models
+
+    @property
+    def _ov_model_names(self) -> List[str]:


from my understanding _ov_model_names will differ from _component_names for encoder-only or decoder-only models? I'm wondering if we should modify both so that they have the same behavior as other models (having only one component) so that we can always iterate on the models "components" to compile / clear request / set device, also would remove the distinction between _ov_model_names and _component_names wdyt @nikita-savelyevv ?

echarlaix · 2025-10-17T09:40:36Z

optimum/intel/openvino/modeling_seq2seq.py



-class OVEncoder:
+class OVEncoder(OVModelHostMixin):


why is this needed? (my understanding that OVModelHostMixin was providing functionalities for OVModels and not for components themselves)

My idea was that every entity that has an openvino.model instance inside of it, should base from OVModelHostMixin. This includes components such as OVEncoder and OVDecoder too. In terms of this particular PR, this is needed for OVModelForSeq2SeqLM to call replace_ov_model() on its components which happen to be OVEncoder and OVDecoder.

echarlaix · 2025-10-17T09:44:19Z

optimum/intel/openvino/modeling_seq2seq.py

+

-class OVDecoder:
+class OVDecoder(OVModelHostMixin):


why not inherit from OVModelPart instead ?

That's a good idea, I think it's possible 👍

echarlaix · 2025-10-17T09:50:06Z

optimum/intel/openvino/modeling_visual_language.py

+    def _component_names(self) -> List[str]:
        base_components = ["language_model", "vision_embeddings"]
-        additional_components = [part for part in self.additional_parts if getattr(self, part, None) is not None]
+        additional_components = [part for part in self.additional_parts if hasattr(self, part)]
        return base_components + additional_components

    @property
-    def components(self):
-        return {component_name: getattr(self, component_name) for component_name in self._component_names}
-
-    @property
-    def _ov_submodel_names(self):
+    def _ov_model_names(self):


why is it needed to make a distinction between _component_names and _ov_model_names here ?

nikita-savelyevv added 7 commits October 7, 2025 11:43

Update all OVModel classes

247fbec

Update test

e4ee0b0

Reorder inheritance

9151a89

Fix

034c556

Fix model replacement logic

e4f8ce5

Add missing decorators

380aa51

Remove references to the deprecated fields

e03a024

nikita-savelyevv changed the title ~~Fix model inference not being affected by static quantization~~ [OV] Fix model inference not being affected by static quantization Oct 7, 2025

nikita-savelyevv marked this pull request as ready for review October 7, 2025 16:04

nikita-savelyevv requested review from IlyasMoutawwakil, echarlaix and rkazants October 7, 2025 16:11

IlyasMoutawwakil reviewed Oct 7, 2025

View reviewed changes

nikita-savelyevv mentioned this pull request Oct 8, 2025

[OV] Fix VLM in-place static quantization #1464

Merged

3 tasks

nikita-savelyevv added 2 commits October 8, 2025 14:59

Merge branch 'main' into ns/fix-inplace-quantization

9de6797

Merge branch 'main' into ns/fix-inplace-quantization

ef56c7d

nikita-savelyevv requested a review from IlyasMoutawwakil October 10, 2025 08:37

rkazants reviewed Oct 13, 2025

View reviewed changes

optimum/intel/openvino/modeling_open_clip.py Show resolved Hide resolved

echarlaix reviewed Oct 17, 2025

View reviewed changes



		class OVEncoder:
		class OVEncoder(OVModelHostMixin):



		class OVDecoder:
		class OVDecoder(OVModelHostMixin):

[OV] Fix model inference not being affected by static quantization #1461

Are you sure you want to change the base?

[OV] Fix model inference not being affected by static quantization #1461

Uh oh!

Conversation

nikita-savelyevv commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil commented Oct 7, 2025

Uh oh!

nikita-savelyevv commented Oct 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rkazants commented Oct 13, 2025

Uh oh!

nikita-savelyevv commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

echarlaix left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nikita-savelyevv commented Oct 7, 2025 •

edited

Loading

nikita-savelyevv commented Oct 13, 2025 •

edited

Loading