-
Notifications
You must be signed in to change notification settings - Fork 146
[OV] Fix model inference not being affected by static quantization #1461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[OV] Fix model inference not being affected by static quantization #1461
Conversation
@property | ||
def ov_submodels(self) -> Dict[str, openvino.Model]: | ||
return {submodel_name: getattr(self, submodel_name) for submodel_name in self._ov_submodel_names} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
breaking change ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we need to refactorization to fix ? |
Sure, I can create a fix for the VLM case |
|
||
def forward(self, input_ids, **kwargs): | ||
self._compile() | ||
self.compile() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now compile()
method gets a sort of external method but do we really need it? I am not sure that user API of optimum-intel should expose it. Compilation happens under the hood and belong to dev API only. So _
still needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now, for some classes this method is named compile()
and for some _compile()
. I've decided it to align it by making all of them called compile()
. According to documentation, compile()
is a part of our public API https://huggingface.co/docs/optimum/en/intel/openvino/inference#compilation
@nikita-savelyevv, could you please share a snippet of code that demonstrates what we can't do with the current API? Thanks |
The purpose of this PR is to properly fix the scenario below. A hot-fix was provided in #1464. from optimum.intel import OVModelForVisualCausalLM, OVQuantizationConfig
q_model = OVModelForVisualCausalLM.from_pretrained(
"HuggingFaceTB/SmolVLM2-256M-Video-Instruct",
quantization_config=OVQuantizationConfig(bits=8, dataset="contextual", num_samples=50))
# If called here, q_model.generate() will run original non-quantized models The main thing added here that makes it work is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, thanks a lot @nikita-savelyevv , left couple of comments
return self.ov_models | ||
|
||
@property | ||
def _ov_model_names(self) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from my understanding _ov_model_names
will differ from _component_names
for encoder-only or decoder-only models? I'm wondering if we should modify both so that they have the same behavior as other models (having only one component) so that we can always iterate on the models "components" to compile / clear request / set device, also would remove the distinction between _ov_model_names
and _component_names
wdyt @nikita-savelyevv ?
|
||
|
||
class OVEncoder: | ||
class OVEncoder(OVModelHostMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed? (my understanding that OVModelHostMixin
was providing functionalities for OVModels and not for components themselves)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was that every entity that has an openvino.model
instance inside of it, should base from OVModelHostMixin
. This includes components such as OVEncoder
and OVDecoder
too. In terms of this particular PR, this is needed for OVModelForSeq2SeqLM
to call replace_ov_model()
on its components which happen to be OVEncoder
and OVDecoder
.
|
||
|
||
class OVDecoder: | ||
class OVDecoder(OVModelHostMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not inherit from OVModelPart
instead ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, I think it's possible 👍
def _component_names(self) -> List[str]: | ||
base_components = ["language_model", "vision_embeddings"] | ||
additional_components = [part for part in self.additional_parts if getattr(self, part, None) is not None] | ||
additional_components = [part for part in self.additional_parts if hasattr(self, part)] | ||
return base_components + additional_components | ||
|
||
@property | ||
def components(self): | ||
return {component_name: getattr(self, component_name) for component_name in self._component_names} | ||
|
||
@property | ||
def _ov_submodel_names(self): | ||
def _ov_model_names(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it needed to make a distinction between _component_names
and _ov_model_names
here ?
What does this PR do?
Reason for changes
Executing multi-model pipelines after static quantization but before serialization leads to original non-quantized models being inferred instead. For example in case of vision embeddings model within a VLM pipeline, this is because the model is replaced at
OVModelForVisualCausalLM
level, but not atOVVisionEmbedding
level.Changes
OVModelHostMixin
to unify handling ofopenvino.Model
instances for all model classes containing them. This class implements:ov_models
property returns named instances ofopenvino.Model
that the model contains (renamed fromov_submodels
).components
property returns named instances ofOVModelHostMixin
that the model contains.replace_ov_model(current_model: openvino.Model, new_model: openvino.Model)
method is used to perform recursive replacement. OV models are matched by Python object id.compile()
method is self explanatory. Previously, for some classes it was called_compile()
which is now renamed for consistency.clear_requests()
method is self explanatory.test_ovmodel_pipeline_quantization
-- now checks quantized model both before and after serialization.Before submitting