Include output embedding as well with `include_embedding` flag #37935

jerryzh168 · 2025-05-02T23:46:04Z

Summary:
att

Test Plan:
python tests/quantization/torchao_integration/test_torchao.py -k test_include_input_output_embeddings

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: att Test Plan: python tests/quantization/torchao_integration/test_torchao.py -k test_include_embedding Reviewers: Subscribers: Tasks: Tags:

github-actions · 2025-05-02T23:46:16Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

jerryzh168 · 2025-05-02T23:46:47Z

cc @MekkCyber @SunMarc please take a look, just a small change to include_embedding

MekkCyber

Thanks for PR @jerryzh168, I have a small concern with the output embeddings quantization

MekkCyber · 2025-05-05T08:41:02Z

src/transformers/quantizers/quantizer_torchao.py

+            output_emb = model.get_output_embeddings()
+            output_emb_names = [name for name, module in model.named_modules() if id(module) == id(output_emb)]
+            self.modules_to_not_convert = [
+                x for x in self.modules_to_not_convert if x not in input_emb_names + output_emb_names
+            ]


I'm not sure if it's a good idea to quantize the lm_head when the flag include_embedding is set 🤔 , it's a bit misleading

lm_head is the output embedding right?
e.g. https://github.com/vllm-project/vllm/blob/aea302be6c3c323207502a973fe341c3bcf7288f/vllm/model_executor/models/llama.py#L457

also:

transformers/src/transformers/models/llama/modeling_llama.py

Lines 736 to 743 in 46c0e1f

def get_input_embeddings(self):

return self.model.embed_tokens

def set_input_embeddings(self, value):

self.model.embed_tokens = value

def get_output_embeddings(self):

return self.lm_head

Yes it is, but it's still a nn.Linear not a nn.Embedding

About embeddings and lm_head, there are some edge cases we need to be aware of.
If they are tied:

if we quantize the embeddings, the lm-head will also be quantized unless we break the tied weights. This will lead to reduce memory consumption but quality will be reduced.

if we decide to remove the tied weights and quantize the embeddings / keep the lm_head as is, the memory consumption will increase (due to the lm-head) but maybe we have latency improvement ?. Maybe you also want to quantize the lm-head differently ?

Do we have a specific use case for 2) as I think this is what you wanted to do @jerryzh168 ?

yeah we have a use case in ExecuTorch, where we quantize both input embedding and lm_head, and we quantize them differently, the way we are doing it right now is:

(1) manually break ties
(2) quantize the input embedding and lm_head separately

see details in https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w#quantization-recipe

quant_config = AOPerModuleConfig({"_default": linear_config, "model.embed_tokens": embedding_config}) quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])

right now we need to set modules_to_not_convert and this PR will allow use to remove modules_to_not_convert

Also I feel we might be able to remove the untie_embedding_weights flag now since we have an alternative solution.

Please also take a look our solution for manually untying the weights, it might be useful to have some API for it as well

@MekkCyber how about changing the name to include_input_output_embeddings to be more specific on what we are referring to?

Yes, I think it’s fine as long as the user is aware that they’re quantizing the lm_head.

makes sense, just updated

SunMarc

Left some feedback !

SunMarc · 2025-05-06T12:51:52Z

src/transformers/quantizers/quantizer_torchao.py

+            output_emb = model.get_output_embeddings()
+            output_emb_names = [name for name, module in model.named_modules() if id(module) == id(output_emb)]
+            self.modules_to_not_convert = [
+                x for x in self.modules_to_not_convert if x not in input_emb_names + output_emb_names
+            ]


About embeddings and lm_head, there are some edge cases we need to be aware of.
If they are tied:

if we quantize the embeddings, the lm-head will also be quantized unless we break the tied weights. This will lead to reduce memory consumption but quality will be reduced.

if we decide to remove the tied weights and quantize the embeddings / keep the lm_head as is, the memory consumption will increase (due to the lm-head) but maybe we have latency improvement ?. Maybe you also want to quantize the lm-head differently ?

Do we have a specific use case for 2) as I think this is what you wanted to do @jerryzh168 ?

HuggingFaceDocBuilderDev · 2025-05-12T12:01:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

MekkCyber

Thanks for adding this !

jerryzh168 · 2025-05-15T21:27:27Z

@MekkCyber @SunMarc can you merge this

MekkCyber · 2025-05-16T10:07:03Z

Done ! sorry for the delay

Include output embedding as well with include_embedding flag

bbf88b5

Summary: att Test Plan: python tests/quantization/torchao_integration/test_torchao.py -k test_include_embedding Reviewers: Subscribers: Tasks: Tags:

github-actions bot marked this pull request as draft May 2, 2025 23:46

format

b66730f

jerryzh168 marked this pull request as ready for review May 3, 2025 00:22

github-actions bot requested review from MekkCyber and SunMarc May 3, 2025 00:22

MekkCyber reviewed May 5, 2025

View reviewed changes

SunMarc reviewed May 6, 2025

View reviewed changes

jerryzh168 added 2 commits May 7, 2025 10:51

rename include_embedding to include_input_output_embeddings

927ef1f

Merge branch 'main' into quantize-lm-head

2299706

MekkCyber approved these changes May 13, 2025

View reviewed changes

Merge branch 'main' into quantize-lm-head

cf64410

MekkCyber enabled auto-merge (squash) May 16, 2025 09:12

Merge branch 'main' into quantize-lm-head

040e2e0

MekkCyber disabled auto-merge May 16, 2025 09:43

MekkCyber merged commit 44fa04a into huggingface:main May 16, 2025
20 checks passed

	def get_input_embeddings(self):
	return self.model.embed_tokens

	def set_input_embeddings(self, value):
	self.model.embed_tokens = value

	def get_output_embeddings(self):
	return self.lm_head

Include output embedding as well with include_embedding flag #37935

Include output embedding as well with include_embedding flag #37935

Uh oh!

Conversation

jerryzh168 commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 2, 2025

Uh oh!

jerryzh168 commented May 2, 2025

Uh oh!

MekkCyber left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented May 12, 2025

Uh oh!

MekkCyber left a comment

Choose a reason for hiding this comment

Uh oh!

jerryzh168 commented May 15, 2025

Uh oh!

Uh oh!

MekkCyber commented May 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Include output embedding as well with `include_embedding` flag #37935

Include output embedding as well with `include_embedding` flag #37935

jerryzh168 commented May 2, 2025 •

edited

Loading

jerryzh168 May 6, 2025 •

edited

Loading