Skip to content

Conversation

jackzhxng
Copy link
Collaborator

@jackzhxng jackzhxng commented Sep 24, 2025

Big quantization improvements for Gemma3 4B vision (7.4 GB -> 3.0 GB)

  • Quantize encoder with 8da4w group size 32 where possible, else use 8da8w per token (this applies to the fc2 layers)
  • Quantize LM head
optimum-cli export executorch
    --model google/gemma-3-4b-it
    --task "multimodal-text-to-text"
    --max_seq_len 1024
    --recipe "xnnpack"
    --use_custom_sdpa
    --use_custom_kv_cache
    --qlinear 8da4w
    --qlinear_group_size 32
    --qlinear_encoder 8da4w,8da8w
    --qlinear_encoder_group_size 32
    --qembedding 8w
    --output_dir="gemma3_vision"

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

fallback_linear_config_key = None
else:
assert qlinear_group_size % 2 == 0, "Linear quantization group size must be a multiple of 2."
assert qlinear_group_size % 2 == 0, f"Linear quantization group size must be a multiple of 2, got {qlinear_group_size}."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is groupsixe a multiple of 2? Shouldn't it be a multiple of 32?

@jackzhxng jackzhxng force-pushed the jz/quantize-fallback branch 2 times, most recently from 9238ad0 to 3b3ae50 Compare October 7, 2025 20:42
@jackzhxng jackzhxng changed the title Implement quantization fallback to 8w per channel Implement quantization fallback to 8w per channel + other quant improvements for multimodal Oct 7, 2025
@jackzhxng jackzhxng changed the title Implement quantization fallback to 8w per channel + other quant improvements for multimodal Quant fallback to 8w per token + other quant improvements for multimodal Oct 7, 2025
@jackzhxng jackzhxng marked this pull request as ready for review October 7, 2025 21:05
@jackzhxng jackzhxng force-pushed the jz/quantize-fallback branch from 3b3ae50 to d2f238e Compare October 8, 2025 17:06
@jackzhxng jackzhxng force-pushed the jz/quantize-fallback branch from d2f238e to a872c53 Compare October 8, 2025 18:02
Comment on lines +211 to +214
quantize_lm_head_kwargs = {
"eager_model": eager_model.lm_head,
"qlinear_config": qlinear_config,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you guard this by whether eager_model has lm_head?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, curious though is there a model without lm_head?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants