-
Notifications
You must be signed in to change notification settings - Fork 31.5k
Include output embedding as well with include_embedding flag
#37935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: att Test Plan: python tests/quantization/torchao_integration/test_torchao.py -k test_include_embedding Reviewers: Subscribers: Tasks: Tags:
|
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
|
cc @MekkCyber @SunMarc please take a look, just a small change to include_embedding |
MekkCyber
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for PR @jerryzh168, I have a small concern with the output embeddings quantization
| output_emb = model.get_output_embeddings() | ||
| output_emb_names = [name for name, module in model.named_modules() if id(module) == id(output_emb)] | ||
| self.modules_to_not_convert = [ | ||
| x for x in self.modules_to_not_convert if x not in input_emb_names + output_emb_names | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's a good idea to quantize the lm_head when the flag include_embedding is set 🤔 , it's a bit misleading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lm_head is the output embedding right?
e.g. https://github.com/vllm-project/vllm/blob/aea302be6c3c323207502a973fe341c3bcf7288f/vllm/model_executor/models/llama.py#L457
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also:
transformers/src/transformers/models/llama/modeling_llama.py
Lines 736 to 743 in 46c0e1f
| def get_input_embeddings(self): | |
| return self.model.embed_tokens | |
| def set_input_embeddings(self, value): | |
| self.model.embed_tokens = value | |
| def get_output_embeddings(self): | |
| return self.lm_head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is, but it's still a nn.Linear not a nn.Embedding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About embeddings and lm_head, there are some edge cases we need to be aware of.
If they are tied:
- if we quantize the embeddings, the lm-head will also be quantized unless we break the tied weights. This will lead to reduce memory consumption but quality will be reduced.
- if we decide to remove the tied weights and quantize the embeddings / keep the
lm_headas is, the memory consumption will increase (due to thelm-head) but maybe we have latency improvement ?. Maybe you also want to quantize the lm-head differently ?
Do we have a specific use case for 2) as I think this is what you wanted to do @jerryzh168 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we have a use case in ExecuTorch, where we quantize both input embedding and lm_head, and we quantize them differently, the way we are doing it right now is:
(1) manually break ties
(2) quantize the input embedding and lm_head separately
see details in https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w#quantization-recipe
quant_config = AOPerModuleConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])
right now we need to set modules_to_not_convert and this PR will allow use to remove modules_to_not_convert
Also I feel we might be able to remove the untie_embedding_weights flag now since we have an alternative solution.
Please also take a look our solution for manually untying the weights, it might be useful to have some API for it as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MekkCyber how about changing the name to include_input_output_embeddings to be more specific on what we are referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it’s fine as long as the user is aware that they’re quantizing the lm_head.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, just updated
SunMarc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some feedback !
| output_emb = model.get_output_embeddings() | ||
| output_emb_names = [name for name, module in model.named_modules() if id(module) == id(output_emb)] | ||
| self.modules_to_not_convert = [ | ||
| x for x in self.modules_to_not_convert if x not in input_emb_names + output_emb_names | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About embeddings and lm_head, there are some edge cases we need to be aware of.
If they are tied:
- if we quantize the embeddings, the lm-head will also be quantized unless we break the tied weights. This will lead to reduce memory consumption but quality will be reduced.
- if we decide to remove the tied weights and quantize the embeddings / keep the
lm_headas is, the memory consumption will increase (due to thelm-head) but maybe we have latency improvement ?. Maybe you also want to quantize the lm-head differently ?
Do we have a specific use case for 2) as I think this is what you wanted to do @jerryzh168 ?
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
MekkCyber
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this !
|
@MekkCyber @SunMarc can you merge this |
|
Done ! sorry for the delay |
Summary:
att
Test Plan:
python tests/quantization/torchao_integration/test_torchao.py -k test_include_input_output_embeddings
Reviewers:
Subscribers:
Tasks:
Tags: