Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion src/transformers/quantizers/quantizer_torchao.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,11 @@ def _process_model_before_weight_loading(
if self.quantization_config.include_embedding:
input_emb = model.get_input_embeddings()
input_emb_names = [name for name, module in model.named_modules() if id(module) == id(input_emb)]
self.modules_to_not_convert = [x for x in self.modules_to_not_convert if x not in input_emb_names]
output_emb = model.get_output_embeddings()
output_emb_names = [name for name, module in model.named_modules() if id(module) == id(output_emb)]
self.modules_to_not_convert = [
x for x in self.modules_to_not_convert if x not in input_emb_names + output_emb_names
]
Comment on lines +191 to +195
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's a good idea to quantize the lm_head when the flag include_embedding is set 🤔 , it's a bit misleading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also:

def get_input_embeddings(self):
return self.model.embed_tokens
def set_input_embeddings(self, value):
self.model.embed_tokens = value
def get_output_embeddings(self):
return self.lm_head

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is, but it's still a nn.Linear not a nn.Embedding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About embeddings and lm_head, there are some edge cases we need to be aware of.
If they are tied:

  1. if we quantize the embeddings, the lm-head will also be quantized unless we break the tied weights. This will lead to reduce memory consumption but quality will be reduced.
  2. if we decide to remove the tied weights and quantize the embeddings / keep the lm_head as is, the memory consumption will increase (due to the lm-head) but maybe we have latency improvement ?. Maybe you also want to quantize the lm-head differently ?

Do we have a specific use case for 2) as I think this is what you wanted to do @jerryzh168 ?

Copy link
Contributor Author

@jerryzh168 jerryzh168 May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we have a use case in ExecuTorch, where we quantize both input embedding and lm_head, and we quantize them differently, the way we are doing it right now is:

(1) manually break ties
(2) quantize the input embedding and lm_head separately

see details in https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w#quantization-recipe

quant_config = AOPerModuleConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])

right now we need to set modules_to_not_convert and this PR will allow use to remove modules_to_not_convert

Also I feel we might be able to remove the untie_embedding_weights flag now since we have an alternative solution.

Please also take a look our solution for manually untying the weights, it might be useful to have some API for it as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MekkCyber how about changing the name to include_input_output_embeddings to be more specific on what we are referring to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it’s fine as long as the user is aware that they’re quantizing the lm_head.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, just updated

return

def check_quantized_param(
Expand Down
5 changes: 4 additions & 1 deletion tests/quantization/torchao_integration/test_torchao.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,9 @@ def test_include_embedding(self):
granularity=granularity,
mapping_type=mapping_type,
)
config = AOPerModuleConfig({"_default": None, "model.embed_tokens": embedding_config})
config = AOPerModuleConfig(
{"_default": None, "model.embed_tokens": embedding_config, "lm_head": embedding_config}
)
# need set `include_embedding` to True
quant_config = TorchAoConfig(quant_type=config, include_embedding=True)
quantized_model = AutoModelForCausalLM.from_pretrained(
Expand All @@ -220,6 +222,7 @@ def test_include_embedding(self):
)
# making sure embedding is quantized
self.assertTrue(isinstance(quantized_model.model.embed_tokens.weight, AffineQuantizedTensor))
self.assertTrue(isinstance(quantized_model.lm_head.weight, AffineQuantizedTensor))
tokenizer = AutoTokenizer.from_pretrained(self.model_name)

input_ids = tokenizer(self.input_text, return_tensors="pt").to(self.device)
Expand Down