-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
Checklist
- If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- Please use English. Otherwise, it will be closed.
Motivation
It's well documented that various layers of models have a different impact depending on the quantization used.
Compressed-tensors allows non-uniform quantization with recipes available here: https://github.com/vllm-project/llm-compressor/tree/0.9.0/examples/quantization_non_uniform
I tried with a Qwen3-4B model with the following quantization config, each config supported individually:
quant_stage:
quant_modifiers:
QuantizationModifier:
targets: r"re:.*self_attn\.(k|q|o|v)_proj.*"
scheme: FP8_BLOCK
AWQModifier:
mlp_experts_projections:
group_0:
targets: ["re:.*(down|gate|up)_proj.*"]
weights:
num_bits: 4
type: int
symmetric: true
group_size: 32
strategy: group
dynamic: false
# actorder: group
observer: memoryless_minmax
# Layers to exclude from quantization
ignore:
- "lm_head"
# Scaling options
duo_scaling: true
mappings:
- smooth_layer: re:.*post_attention_layernorm$
balance_layers: ["re:.*gate_proj$", "re:.*up_proj$"]
- smooth_layer: re:.*up_proj$
balance_layers: ["re:.*down_proj$"]Research
According to [1], spending more bits in attention layers results in large gain compared to spending them in FFN layers.
According to [2] on 2-bit quantization:
- quantizing expert FFN layers do not seriously impact model quality
- quantizing cross-attention has some impact
- quantizing self-attention has a large impact
- quantizing dense FFN has a very significant impact
- Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)
Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen
https://arxiv.org/pdf/2406.08155v1 - Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)
Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
https://arxiv.org/pdf/2310.02410
Related resources
vLLM can load this. Also I passed the question to DeepWiki which gave me a well-rounded and sourced answer, AI answer below
The quantization config is failing because SGLang's compressed-tensors implementation doesn't properly support the "mixed-precision" format. Here's what's happening:
Root Cause
Your config has a top-level "format": "mixed-precision" with two config groups that each have their own format:
group_0:"format": "float-quantized"(FP8 W8A8 for attention layers)group_1:"format": "pack-quantized"(INT4 W4A16 for MLP layers)
However, when SGLang parses the config, it only checks the top-level format to decide whether to parse input_activations, not the per-group format: [Citation 1]
At line 269, it checks is_activation_quantization_format(quant_format) where quant_format is the top-level "mixed-precision". Since "mixed-precision" is not in the list of supported activation quantization formats: [Citation 2]
The check returns False, and input_activations are never parsed for any target, even though group_0 explicitly defines them.
Why It Fails
Later, when trying to determine the quantization scheme for attention layers (qkv_proj), the code checks the format again: [Citation 3]
Since self.quant_format is "mixed-precision" (not an activation quantization format), the code skips all FP8 scheme checks (lines 469-519) and falls through to line 521, raising the NotImplementedError.
Notes
- The per-group
"format"field is available in eachquant_configbut is ignored by the current implementation - The code should use
quant_config.get("format")at line 269 instead of the top-levelquant_formatto properly handle mixed-precision configs - Similarly, the per-group format should be stored in
target_scheme_mapand used in_get_scheme_from_partsinstead ofself.quant_format - This appears to be a limitation/bug in SGLang's implementation - there are no existing tests or examples using mixed-precision format in the codebase
- The individual quantization schemes (FP8 W8A8 for group_0, INT4 W4A16 for group_1) are supported by SGLang, but not when combined in a mixed-precision config
As a workaround, you may need to wait for SGLang to add proper mixed-precision support, or you could try separating the model into single-format quantization (though this would lose the mixed-precision benefits).
Citations
File: python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py (L259-285)
config_groups = config.get("config_groups", dict())
for _, quant_config in config_groups.items():
targets = quant_config.get("targets")
for target in targets:
target_scheme_map[target] = {}
target_scheme_map[target]["weights"] = QuantizationArgs.model_validate(
quant_config.get("weights")
)
target_scheme_map[target]["input_activations"] = None
if is_activation_quantization_format(quant_format):
input_activations = quant_config.get("input_activations")
# The only case where we have activation quant supported
# but no input_activations provided in the config
# should be w8a16fp8 w8a16fp8 can also run for cases where
# there is an input_quant but it is ignored
if not input_activations:
assert (
target_scheme_map[target]["weights"].type
== QuantizationType.FLOAT
)
else:
target_scheme_map[target]["input_activations"] = (
QuantizationArgs.model_validate( # noqa: E501
quant_config.get("input_activations")
)
)File: python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py (L468-521)
if is_activation_quantization_format(self.quant_format):
if self._is_fp4a4_nvfp4(weight_quant, input_quant):
is_fp4a4_nvfp4_supported = self._check_scheme_supported(
CompressedTensorsW4A4Fp4.get_min_capability(), error=False
)
if is_fp4a4_nvfp4_supported:
return CompressedTensorsW4A4Fp4()
else:
raise NotImplementedError(
"Current platform does not support w4a4 nvfp4 quantization."
)
if self._is_fp8_w8a8(weight_quant, input_quant):
is_fp8_w8a8_supported = self._check_scheme_supported(
CompressedTensorsW8A8Fp8.get_min_capability(), error=False
)
if is_fp8_w8a8_supported:
return CompressedTensorsW8A8Fp8(
weight_quant=weight_quant,
is_static_input_scheme=(
input_quant and not input_quant.dynamic
),
)
else:
# note: input_quant will be present for converted models;
# will be ignored during inference post loading
return CompressedTensorsW8A16Fp8(
strategy=weight_quant.strategy,
is_static_input_scheme=not input_quant.dynamic,
)
# note: input_quant can be None
if self._is_fp8_w8a16(weight_quant, input_quant):
is_static_input_scheme = input_quant and not input_quant.dynamic
return CompressedTensorsW8A16Fp8(
strategy=weight_quant.strategy,
is_static_input_scheme=is_static_input_scheme,
)
if self._is_static_tensor_w8a8(weight_quant, input_quant):
return CompressedTensorsW8A8Int8(
strategy=weight_quant.strategy,
is_static_input_scheme=True,
input_symmetric=input_quant.symmetric,
)
if self._is_dynamic_token_w8a8(weight_quant, input_quant):
return CompressedTensorsW8A8Int8(
strategy=weight_quant.strategy,
is_static_input_scheme=False,
input_symmetric=input_quant.symmetric,
)
raise NotImplementedError("No compressed-tensors compatible scheme was found.")File: python/sglang/srt/layers/quantization/compressed_tensors/utils.py (L12-19)
def is_activation_quantization_format(format: str) -> bool:
_ACTIVATION_QUANTIZATION_FORMATS = [
CompressionFormat.naive_quantized.value,
CompressionFormat.int_quantized.value,
CompressionFormat.float_quantized.value,
CompressionFormat.nvfp4_pack_quantized.value,
]
return format in _ACTIVATION_QUANTIZATION_FORMATS