Skip to content

Conversation

nikita-savelyevv
Copy link
Collaborator

@nikita-savelyevv nikita-savelyevv commented Oct 7, 2025

What does this PR do?

Reason for changes
Executing multi-model pipelines after static quantization but before serialization leads to original non-quantized models being inferred instead. For example in case of vision embeddings model within a VLM pipeline, this is because the model is replaced at OVModelForVisualCausalLM level, but not at OVVisionEmbedding level.

Changes

  • Introduced OVModelHostMixin to unify handling of openvino.Model instances for all model classes containing them. This class implements:
    • ov_models property returns named instances of openvino.Model that the model contains (renamed from ov_submodels).
    • components property returns named instances of OVModelHostMixin that the model contains.
    • replace_ov_model(current_model: openvino.Model, new_model: openvino.Model) method is used to perform recursive replacement. OV models are matched by Python object id.
    • compile() method is self explanatory. Previously, for some classes it was called _compile() which is now renamed for consistency.
    • clear_requests() method is self explanatory.
  • test_ovmodel_pipeline_quantization -- now checks quantized model both before and after serialization.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@nikita-savelyevv nikita-savelyevv changed the title Fix model inference not being affected by static quantization [OV] Fix model inference not being affected by static quantization Oct 7, 2025
@nikita-savelyevv nikita-savelyevv marked this pull request as ready for review October 7, 2025 16:04
Comment on lines -213 to -215
@property
def ov_submodels(self) -> Dict[str, openvino.Model]:
return {submodel_name: getattr(self, submodel_name) for submodel_name in self._ov_submodel_names}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

breaking change ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@IlyasMoutawwakil
Copy link
Member

Are you sure we need to refactorization to fix ?
I understand the incentive of the refactorization but I would suggest fixing in one PR, and refactoring in another,
because reviewing the refactorization will definitely take longer than the fix (which i assume is meant for the blog's benchmark)

@nikita-savelyevv
Copy link
Collaborator Author

Are you sure we need to refactorization to fix ? I understand the incentive of the refactorization but I would suggest fixing in one PR, and refactoring in another, because reviewing the refactorization will definitely take longer than the fix (which i assume is meant for the blog's benchmark)

Sure, I can create a fix for the VLM case


def forward(self, input_ids, **kwargs):
self._compile()
self.compile()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now compile() method gets a sort of external method but do we really need it? I am not sure that user API of optimum-intel should expose it. Compilation happens under the hood and belong to dev API only. So _ still needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, for some classes this method is named compile() and for some _compile(). I've decided it to align it by making all of them called compile(). According to documentation, compile() is a part of our public API https://huggingface.co/docs/optimum/en/intel/openvino/inference#compilation

@rkazants
Copy link
Collaborator

@nikita-savelyevv, could you please share a snippet of code that demonstrates what we can't do with the current API? Thanks

@nikita-savelyevv
Copy link
Collaborator Author

nikita-savelyevv commented Oct 13, 2025

@nikita-savelyevv, could you please share a snippet of code that demonstrates what we can't do with the current API? Thanks

The purpose of this PR is to properly fix the scenario below. A hot-fix was provided in #1464.

from optimum.intel import OVModelForVisualCausalLM, OVQuantizationConfig

q_model = OVModelForVisualCausalLM.from_pretrained(
    "HuggingFaceTB/SmolVLM2-256M-Video-Instruct",
    quantization_config=OVQuantizationConfig(bits=8, dataset="contextual", num_samples=50))

# If called here, q_model.generate() will run original non-quantized models

The main thing added here that makes it work is replace_ov_model() method that is now used during quantization.

Copy link
Collaborator

@echarlaix echarlaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, thanks a lot @nikita-savelyevv , left couple of comments

return self.ov_models

@property
def _ov_model_names(self) -> List[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my understanding _ov_model_names will differ from _component_names for encoder-only or decoder-only models? I'm wondering if we should modify both so that they have the same behavior as other models (having only one component) so that we can always iterate on the models "components" to compile / clear request / set device, also would remove the distinction between _ov_model_names and _component_names wdyt @nikita-savelyevv ?



class OVEncoder:
class OVEncoder(OVModelHostMixin):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? (my understanding that OVModelHostMixin was providing functionalities for OVModels and not for components themselves)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was that every entity that has an openvino.model instance inside of it, should base from OVModelHostMixin. This includes components such as OVEncoder and OVDecoder too. In terms of this particular PR, this is needed for OVModelForSeq2SeqLM to call replace_ov_model() on its components which happen to be OVEncoder and OVDecoder.



class OVDecoder:
class OVDecoder(OVModelHostMixin):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not inherit from OVModelPart instead ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, I think it's possible 👍

Comment on lines +694 to +700
def _component_names(self) -> List[str]:
base_components = ["language_model", "vision_embeddings"]
additional_components = [part for part in self.additional_parts if getattr(self, part, None) is not None]
additional_components = [part for part in self.additional_parts if hasattr(self, part)]
return base_components + additional_components

@property
def components(self):
return {component_name: getattr(self, component_name) for component_name in self._component_names}

@property
def _ov_submodel_names(self):
def _ov_model_names(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it needed to make a distinction between _component_names and _ov_model_names here ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants