Skip to content

[Feature] Support Mixed-Precision models #16276

@mratsim

Description

@mratsim

Checklist

Motivation

It's well documented that various layers of models have a different impact depending on the quantization used.
Compressed-tensors allows non-uniform quantization with recipes available here: https://github.com/vllm-project/llm-compressor/tree/0.9.0/examples/quantization_non_uniform

I tried with a Qwen3-4B model with the following quantization config, each config supported individually:

quant_stage:
  quant_modifiers:
    QuantizationModifier:
      targets: r"re:.*self_attn\.(k|q|o|v)_proj.*"
      scheme: FP8_BLOCK
    AWQModifier:
      mlp_experts_projections:
        group_0:
          targets: ["re:.*(down|gate|up)_proj.*"]
          weights:
            num_bits: 4
            type: int
            symmetric: true
            group_size: 32
            strategy: group
            dynamic: false
            # actorder: group
            observer: memoryless_minmax

      # Layers to exclude from quantization
      ignore:
        - "lm_head"
      # Scaling options
      duo_scaling: true

      mappings:
        - smooth_layer: re:.*post_attention_layernorm$
          balance_layers: ["re:.*gate_proj$", "re:.*up_proj$"]
        - smooth_layer: re:.*up_proj$
          balance_layers: ["re:.*down_proj$"]

Research

According to [1], spending more bits in attention layers results in large gain compared to spending them in FFN layers.
According to [2] on 2-bit quantization:

  • quantizing expert FFN layers do not seriously impact model quality
  • quantizing cross-attention has some impact
  • quantizing self-attention has a large impact
  • quantizing dense FFN has a very significant impact
  1. Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)
    Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen
    https://arxiv.org/pdf/2406.08155v1
  2. Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)
    Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
    https://arxiv.org/pdf/2310.02410

Related resources

vLLM can load this. Also I passed the question to DeepWiki which gave me a well-rounded and sourced answer, AI answer below


The quantization config is failing because SGLang's compressed-tensors implementation doesn't properly support the "mixed-precision" format. Here's what's happening:

Root Cause

Your config has a top-level "format": "mixed-precision" with two config groups that each have their own format:

  • group_0: "format": "float-quantized" (FP8 W8A8 for attention layers)
  • group_1: "format": "pack-quantized" (INT4 W4A16 for MLP layers)

However, when SGLang parses the config, it only checks the top-level format to decide whether to parse input_activations, not the per-group format: [Citation 1]

At line 269, it checks is_activation_quantization_format(quant_format) where quant_format is the top-level "mixed-precision". Since "mixed-precision" is not in the list of supported activation quantization formats: [Citation 2]

The check returns False, and input_activations are never parsed for any target, even though group_0 explicitly defines them.

Why It Fails

Later, when trying to determine the quantization scheme for attention layers (qkv_proj), the code checks the format again: [Citation 3]

Since self.quant_format is "mixed-precision" (not an activation quantization format), the code skips all FP8 scheme checks (lines 469-519) and falls through to line 521, raising the NotImplementedError.

Notes

  • The per-group "format" field is available in each quant_config but is ignored by the current implementation
  • The code should use quant_config.get("format") at line 269 instead of the top-level quant_format to properly handle mixed-precision configs
  • Similarly, the per-group format should be stored in target_scheme_map and used in _get_scheme_from_parts instead of self.quant_format
  • This appears to be a limitation/bug in SGLang's implementation - there are no existing tests or examples using mixed-precision format in the codebase
  • The individual quantization schemes (FP8 W8A8 for group_0, INT4 W4A16 for group_1) are supported by SGLang, but not when combined in a mixed-precision config

As a workaround, you may need to wait for SGLang to add proper mixed-precision support, or you could try separating the model into single-format quantization (though this would lose the mixed-precision benefits).

Citations

File: python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py (L259-285)

        config_groups = config.get("config_groups", dict())
        for _, quant_config in config_groups.items():
            targets = quant_config.get("targets")
            for target in targets:
                target_scheme_map[target] = {}
                target_scheme_map[target]["weights"] = QuantizationArgs.model_validate(
                    quant_config.get("weights")
                )

                target_scheme_map[target]["input_activations"] = None
                if is_activation_quantization_format(quant_format):
                    input_activations = quant_config.get("input_activations")
                    # The only case where we have activation quant supported
                    # but no input_activations provided in the config
                    # should be w8a16fp8 w8a16fp8 can also run for cases where
                    # there is an input_quant but it is ignored
                    if not input_activations:
                        assert (
                            target_scheme_map[target]["weights"].type
                            == QuantizationType.FLOAT
                        )
                    else:
                        target_scheme_map[target]["input_activations"] = (
                            QuantizationArgs.model_validate(  # noqa: E501
                                quant_config.get("input_activations")
                            )
                        )

File: python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py (L468-521)

        if is_activation_quantization_format(self.quant_format):
            if self._is_fp4a4_nvfp4(weight_quant, input_quant):
                is_fp4a4_nvfp4_supported = self._check_scheme_supported(
                    CompressedTensorsW4A4Fp4.get_min_capability(), error=False
                )
                if is_fp4a4_nvfp4_supported:
                    return CompressedTensorsW4A4Fp4()
                else:
                    raise NotImplementedError(
                        "Current platform does not support w4a4 nvfp4 quantization."
                    )

            if self._is_fp8_w8a8(weight_quant, input_quant):
                is_fp8_w8a8_supported = self._check_scheme_supported(
                    CompressedTensorsW8A8Fp8.get_min_capability(), error=False
                )
                if is_fp8_w8a8_supported:
                    return CompressedTensorsW8A8Fp8(
                        weight_quant=weight_quant,
                        is_static_input_scheme=(
                            input_quant and not input_quant.dynamic
                        ),
                    )
                else:
                    # note: input_quant will be present for converted models;
                    # will be ignored during inference post loading
                    return CompressedTensorsW8A16Fp8(
                        strategy=weight_quant.strategy,
                        is_static_input_scheme=not input_quant.dynamic,
                    )

            # note: input_quant can be None
            if self._is_fp8_w8a16(weight_quant, input_quant):
                is_static_input_scheme = input_quant and not input_quant.dynamic
                return CompressedTensorsW8A16Fp8(
                    strategy=weight_quant.strategy,
                    is_static_input_scheme=is_static_input_scheme,
                )

            if self._is_static_tensor_w8a8(weight_quant, input_quant):
                return CompressedTensorsW8A8Int8(
                    strategy=weight_quant.strategy,
                    is_static_input_scheme=True,
                    input_symmetric=input_quant.symmetric,
                )

            if self._is_dynamic_token_w8a8(weight_quant, input_quant):
                return CompressedTensorsW8A8Int8(
                    strategy=weight_quant.strategy,
                    is_static_input_scheme=False,
                    input_symmetric=input_quant.symmetric,
                )

        raise NotImplementedError("No compressed-tensors compatible scheme was found.")

File: python/sglang/srt/layers/quantization/compressed_tensors/utils.py (L12-19)

def is_activation_quantization_format(format: str) -> bool:
    _ACTIVATION_QUANTIZATION_FORMATS = [
        CompressionFormat.naive_quantized.value,
        CompressionFormat.int_quantized.value,
        CompressionFormat.float_quantized.value,
        CompressionFormat.nvfp4_pack_quantized.value,
    ]
    return format in _ACTIVATION_QUANTIZATION_FORMATS

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions