[quantization] Add QuantGemma4TextScaledWordEmbedding PTQ wrapper#788
Merged
Conversation
1ed9d8d to
3ef7fa3
Compare
Contributor
|
The test command is |
mhs4670go
reviewed
Jun 22, 2026
Comment on lines
+55
to
+57
| self.register_buffer( | ||
| "embed_scale", torch.tensor(fp.embed_scale), persistent=False | ||
| ) |
Contributor
There was a problem hiding this comment.
How about just using original scale instead of copying as a separate buffer?
def enable_calibration(self) -> None:
super().enable_calibration()
self.obs_weight.collect(self.module.weight)
self.obs_embed_scale.collect(self.module.embed_scale)
# forward
scale = self.module.embed_scale
if self._mode is Mode.QUANT:
scale = self.obs_embed_scale.fake_quant(scale)Add complete PTQ quantization support for Gemma4TextScaledWordEmbedding. TICO-DCO-1.0-Signed-off-by: d.savchenkov <d.savchenkov@partner.samsung.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
This PR adds complete Post-Training Quantization (PTQ) support for the
Gemma4TextScaledWordEmbeddingmodule, a key component of the Gemma4 multimodal model family. The implementation includes a comprehensive PTQ wrapper with per-channel weight quantization, embed_scale fake quantization, full test coverage (14 unit tests + 3 smoke tests), and an example script demonstrating the complete quantization flow with Circle format export.Why
The
Gemma4TextScaledWordEmbeddingmodule extends standard embedding layers by multiplying token embeddings with a scalar scale factor (embed_scale). This scaling operation is critical for Gemma4's numerical stability and must be properly quantized to maintain accuracy in the static-shape NPU inference flow.Prior to this change, the wrapper existed as a skeleton with only basic weight observation. This PR completes the implementation by:
QuantEmbedding)This change is part of the broader Gemma4 E2B static PTQ skeleton effort, moving individual wrappers from skeleton status to fully functional quantization modules.
Key Design Decisions
1. Per-Channel Asymmetric Weight Quantization
Decision: Use
QScheme.PER_CHANNEL_ASYMMwithchannel_axis=0for the weight observer.Rationale: This matches the pattern established in
QuantEmbedding(tico/quantization/wrapq/wrappers/nn/quant_embedding.py). Per-channel quantization provides better accuracy for embedding tables because each embedding vector can have its own scale/zero-point, accommodating varying value ranges across the vocabulary.2. Four-Observer Architecture
Decision: Add 4 observers:
obs_weight,obs_embedding,obs_embed_scale,obs_act_out.Rationale:
obs_weight: Quantizes the embedding weight matrix (per-channel)obs_embedding: Quantizes the raw embedding output before scalingobs_embed_scale: Quantizes the scalar scale factor itselfobs_act_out: Quantizes the final scaled outputThis granular observation ensures all intermediate tensors are properly quantized, maintaining numerical consistency through the embedding → scale → output chain.
3. Fake Quantization on embed_scale
Decision: Apply fake quantization to the scale factor in QUANT mode.
Rationale: While
embed_scaleis a scalar, quantizing it ensures the multiplicationhidden_states * scaleoperates on quantized values, maintaining consistency with the fake quantization paradigm. The quantization error on a scalar is negligible but including it ensures the graph accurately represents the quantized inference behavior.4. as_export_module Returns Self
Decision: The
as_export_module()method returnsselfafter asserting QUANT mode.Rationale: The wrapper is already exportable—its
forwardmethod uses onlytorch.export-compatible operations (embedding lookup, multiplication, fake_quant). No additional adaptation is needed, following the pattern used in other simple wrappers likeQuantGemma4VisionPooler.Changes
tico/quantization/wrapq/wrappers/gemma4/quant_text_scaled_word_embedding.pyQSchemeimport; changed weight observer to per-channel asymmetric (channel_axis=0); addedobs_embeddingandobs_embed_scaleobservers; addedembed_scalecollection inenable_calibration(); added fake quantization for embedding output and scale inforward(); addedas_export_module()method; updated_all_observers()to return all 4 observerstest/quantization/wrapq/wrappers/gemma4/test_quant_text_scaled_word_embedding.pytest/quantization/wrapq/wrappers/gemma4/test_quantize_text_scaled_word_embedding.pyRUN_INTERNAL_TESTS=1)tico/quantization/wrapq/examples/gemma4/quantize_text_scaled_word_embedding.pytico/quantization/wrapq/wrappers/registry.pyquant_text_scaled_word_embeddingmodule registration, enabling automatic wrapper discoverytico/quantization/recipes/debug/wrapper_smoke/cases/gemma4.pyTests
Unit Tests (14 tests)
File:
test/quantization/wrapq/wrappers/gemma4/test_quant_text_scaled_word_embedding.pytest_no_quant_forward_matches_fptest_no_quant_output_shape(batch, seq, dim)test_mode_transitionstest_observers_are_collectedtest_weight_is_observed_in_calib_modetest_embed_scale_is_observed_in_calib_modetest_output_is_fake_quantized_in_quant_modetest_quant_mode_output_is_finitetest_dtype_overridetest_weight_uses_per_channel_asymm_by_defaulttest_as_export_module_requires_quant_modetest_as_export_module_returns_selfInternal Tests (3 tests)
File:
test/quantization/wrapq/wrappers/gemma4/test_quantize_text_scaled_word_embedding.pytest_no_quant_embedding_matches_referencetest_prepare_convert_embedding_flowtest_as_export_module_flowSmoke Test
Example Script
File:
tico/quantization/wrapq/examples/gemma4/quantize_text_scaled_word_embedding.pyThe example script demonstrates the complete PTQ workflow:
Gemma4TextScaledWordEmbedding(vocab=1000, dim=64) without downloading pretrained weightstico.convert()