Skip to content

[quantization] Microscaling (MX) Quantization for LayerNorm in Qwen3-vl#723

Draft
Torrero wants to merge 1 commit into
Samsung:mainfrom
Torrero:mx_for_layernorm_qwen
Draft

[quantization] Microscaling (MX) Quantization for LayerNorm in Qwen3-vl#723
Torrero wants to merge 1 commit into
Samsung:mainfrom
Torrero:mx_for_layernorm_qwen

Conversation

@Torrero

@Torrero Torrero commented May 22, 2026

Copy link
Copy Markdown
Contributor

What

Let's evaluate microscaling (MX) Quantization for LayerNorm in Qwen3-VL Vision Model

Why

Microscaling quantization can improve LayerNorm quantization accuracy when applied selectively to the right observers with appropriate axis configuration. The best results were achieved with:

Observers: act_in, centered, square, inv_std, norm, act_outs
Axis: 1 (channel dimension)

Mode MX Observers MX Axes PPL VQA2 MMLU COCO(CIDEr/Bleu_4) MMMU_pro(vision)
original - - 10.54 0.895 0.735 0.361/0.025 0.286
GPTQ_MSE_w4A16
token embedding, lm_head: 4bit
patch embedding (Conv3D): 4bit
act_in, centered, square, inv_std, norm, act_outs 1 14.59 0.837 - - -
degradation % - - 38% 6% - - -
GPTQ_MSE_w4A16
token embedding, lm_head: 8bit
patch embedding (Conv3D): 8bit
act_in, centered, square, inv_std, norm, act_outs 1 13.68 0.829 - - -
degradation % - - 29% 6% - - -
GPTQ_MSE_spinquant_w4A16
token embedding, lm_head: 8bit
patch embedding (Conv3D): 8bit
act_in, centered, square, inv_std, norm, act_outs 1 12.13 0.878 0.702 0.328/0.024 0.249
degradation % - - 15% 2% 3% 9%/4% 4%
GPTQ_MSE_spinquant_smootquant_vision_w4A16
token embedding, lm_head: 8bit
patch embedding (Conv3D): 8bit
act_in, centered, square, inv_std, norm, act_outs 1 12.00 0.888 0.706 0.317/0.021 0.262
degradation % - - 14% 1% 3% 12%/16% 4%

Note: Please keep in mind that Axis:1 may lead to additional computational costs.

Run commands:

#GPTQ_MSE_spinquant_w4A16
python tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py --model=Qwen/Qwen3-VL-4B-Instruct  --trust-remote-code --calib_seq_len=2048 --max_seq_len=2048 --eval_tasks=vqav2,coco --gptq_mse=mse --nsamples_for_evaluation=1000 --nsamples_for_qcalibration=128 --embedding_weight_bits=8 --vision_patch_embed_weight_bits=8 --linear_weight_bits=4 --lm_head_weight_bits=8 --spinquant --spinquant_init_method=random --ppl_dataset=wikitext2 --ppl_stride=2048 --mmmu_dataset=MMMU/MMMU_Pro --mmmu_subjects=vision --mmmu_n_shots=0  --mmmu_n_samples=-1 --mmlu_subjects=mmlu --mmlu_n_samples=1000

#GPTQ_MSE_spinquant_smootquant_vision_w4A16 
python tico/quantization/wrapq/examples/quantize_qwen3_vl_with_gptq.py --model=Qwen/Qwen3-VL-4B-Instruct  --trust-remote-code --calib_seq_len=2048 --max_seq_len=2048 --eval_tasks=vqav2,coco --gptq_mse=mse --nsamples_for_evaluation=1000 --nsamples_for_qcalibration=128 --embedding_weight_bits=8 --vision_patch_embed_weight_bits=8 --linear_weight_bits=4 --lm_head_weight_bits=8 --spinquant --spinquant_init_method=random --ppl_dataset=wikitext2 --ppl_stride=2048 --mmmu_dataset=MMMU/MMMU_Pro --mmmu_subjects=vision --mmmu_n_shots=0  --mmmu_n_samples=-1 --mmlu_subjects=mmlu --mmlu_n_samples=1000

TICO-DCO-1.0-Signed-off-by: Evgenii Maltsev e.maltsev@samsung.com

…VL Vision Model

Evaluation of microscaling (MX) Quantization for LayerNorm in Qwen3-VL Vision Model

TICO-DCO-1.0-Signed-off-by:  Evgenii Maltsev <e.maltsev@samsung.com>
@Torrero Torrero force-pushed the mx_for_layernorm_qwen branch from 28c7da3 to 51d366a Compare May 22, 2026 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant