[Feature] Support Mixed-Precision models

### Checklist

- [ ] If this is not a feature request but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [ ] Please use English. Otherwise, it will be closed.

### Motivation

It's well documented that various layers of models have a different impact depending on the quantization used.
Compressed-tensors allows non-uniform quantization with recipes available here: https://github.com/vllm-project/llm-compressor/tree/0.9.0/examples/quantization_non_uniform

I tried with a Qwen3-4B model with the following quantization config, each config supported individually:
```yaml
quant_stage:
  quant_modifiers:
    QuantizationModifier:
      targets: r"re:.*self_attn\.(k|q|o|v)_proj.*"
      scheme: FP8_BLOCK
    AWQModifier:
      mlp_experts_projections:
        group_0:
          targets: ["re:.*(down|gate|up)_proj.*"]
          weights:
            num_bits: 4
            type: int
            symmetric: true
            group_size: 32
            strategy: group
            dynamic: false
            # actorder: group
            observer: memoryless_minmax

      # Layers to exclude from quantization
      ignore:
        - "lm_head"
      # Scaling options
      duo_scaling: true

      mappings:
        - smooth_layer: re:.*post_attention_layernorm$
          balance_layers: ["re:.*gate_proj$", "re:.*up_proj$"]
        - smooth_layer: re:.*up_proj$
          balance_layers: ["re:.*down_proj$"]
```

## Research

According to [1], spending more bits in attention layers results in large gain compared to spending them in FFN layers.
According to [2] on 2-bit quantization:
- quantizing expert FFN layers do not seriously impact model quality
- quantizing cross-attention has some impact
- quantizing self-attention has a large impact
- quantizing dense FFN has a very significant impact

1. Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark (2024)\
  Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen\
  https://arxiv.org/pdf/2406.08155v1
2. Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)\
  Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla\
  https://arxiv.org/pdf/2310.02410

### Related resources

vLLM can load this. Also I passed the question to DeepWiki which gave me a well-rounded and sourced answer, AI answer below

***

The quantization config is failing because **SGLang's compressed-tensors implementation doesn't properly support the `"mixed-precision"` format**. Here's what's happening:

## Root Cause

Your config has a top-level `"format": "mixed-precision"` with two config groups that each have their own format:
- `group_0`: `"format": "float-quantized"` (FP8 W8A8 for attention layers)
- `group_1`: `"format": "pack-quantized"` (INT4 W4A16 for MLP layers)

However, when SGLang parses the config, it only checks the **top-level format** to decide whether to parse `input_activations`, not the per-group format: `[Citation 1]`

At line 269, it checks `is_activation_quantization_format(quant_format)` where `quant_format` is the top-level `"mixed-precision"`. Since `"mixed-precision"` is not in the list of supported activation quantization formats: `[Citation 2]`

The check returns `False`, and **`input_activations` are never parsed** for any target, even though `group_0` explicitly defines them.

## Why It Fails

Later, when trying to determine the quantization scheme for attention layers (`qkv_proj`), the code checks the format again: `[Citation 3]` 

Since `self.quant_format` is `"mixed-precision"` (not an activation quantization format), the code skips all FP8 scheme checks (lines 469-519) and falls through to line 521, raising the `NotImplementedError`.

## Notes

- **The per-group `"format"` field is available** in each `quant_config` but is ignored by the current implementation
- The code should use `quant_config.get("format")` at line 269 instead of the top-level `quant_format` to properly handle mixed-precision configs
- Similarly, the per-group format should be stored in `target_scheme_map` and used in `_get_scheme_from_parts` instead of `self.quant_format`
- This appears to be a **limitation/bug in SGLang's implementation** - there are no existing tests or examples using mixed-precision format in the codebase
- The individual quantization schemes (FP8 W8A8 for group_0, INT4 W4A16 for group_1) are supported by SGLang, but not when combined in a mixed-precision config

As a workaround, you may need to wait for SGLang to add proper mixed-precision support, or you could try separating the model into single-format quantization (though this would lose the mixed-precision benefits).

### Citations

**File:** python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py (L259-285)
```python
        config_groups = config.get("config_groups", dict())
        for _, quant_config in config_groups.items():
            targets = quant_config.get("targets")
            for target in targets:
                target_scheme_map[target] = {}
                target_scheme_map[target]["weights"] = QuantizationArgs.model_validate(
                    quant_config.get("weights")
                )

                target_scheme_map[target]["input_activations"] = None
                if is_activation_quantization_format(quant_format):
                    input_activations = quant_config.get("input_activations")
                    # The only case where we have activation quant supported
                    # but no input_activations provided in the config
                    # should be w8a16fp8 w8a16fp8 can also run for cases where
                    # there is an input_quant but it is ignored
                    if not input_activations:
                        assert (
                            target_scheme_map[target]["weights"].type
                            == QuantizationType.FLOAT
                        )
                    else:
                        target_scheme_map[target]["input_activations"] = (
                            QuantizationArgs.model_validate(  # noqa: E501
                                quant_config.get("input_activations")
                            )
                        )
```

**File:** python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py (L468-521)
```python
        if is_activation_quantization_format(self.quant_format):
            if self._is_fp4a4_nvfp4(weight_quant, input_quant):
                is_fp4a4_nvfp4_supported = self._check_scheme_supported(
                    CompressedTensorsW4A4Fp4.get_min_capability(), error=False
                )
                if is_fp4a4_nvfp4_supported:
                    return CompressedTensorsW4A4Fp4()
                else:
                    raise NotImplementedError(
                        "Current platform does not support w4a4 nvfp4 quantization."
                    )

            if self._is_fp8_w8a8(weight_quant, input_quant):
                is_fp8_w8a8_supported = self._check_scheme_supported(
                    CompressedTensorsW8A8Fp8.get_min_capability(), error=False
                )
                if is_fp8_w8a8_supported:
                    return CompressedTensorsW8A8Fp8(
                        weight_quant=weight_quant,
                        is_static_input_scheme=(
                            input_quant and not input_quant.dynamic
                        ),
                    )
                else:
                    # note: input_quant will be present for converted models;
                    # will be ignored during inference post loading
                    return CompressedTensorsW8A16Fp8(
                        strategy=weight_quant.strategy,
                        is_static_input_scheme=not input_quant.dynamic,
                    )

            # note: input_quant can be None
            if self._is_fp8_w8a16(weight_quant, input_quant):
                is_static_input_scheme = input_quant and not input_quant.dynamic
                return CompressedTensorsW8A16Fp8(
                    strategy=weight_quant.strategy,
                    is_static_input_scheme=is_static_input_scheme,
                )

            if self._is_static_tensor_w8a8(weight_quant, input_quant):
                return CompressedTensorsW8A8Int8(
                    strategy=weight_quant.strategy,
                    is_static_input_scheme=True,
                    input_symmetric=input_quant.symmetric,
                )

            if self._is_dynamic_token_w8a8(weight_quant, input_quant):
                return CompressedTensorsW8A8Int8(
                    strategy=weight_quant.strategy,
                    is_static_input_scheme=False,
                    input_symmetric=input_quant.symmetric,
                )

        raise NotImplementedError("No compressed-tensors compatible scheme was found.")
```

**File:** python/sglang/srt/layers/quantization/compressed_tensors/utils.py (L12-19)
```python
def is_activation_quantization_format(format: str) -> bool:
    _ACTIVATION_QUANTIZATION_FORMATS = [
        CompressionFormat.naive_quantized.value,
        CompressionFormat.int_quantized.value,
        CompressionFormat.float_quantized.value,
        CompressionFormat.nvfp4_pack_quantized.value,
    ]
    return format in _ACTIVATION_QUANTIZATION_FORMATS
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Support Mixed-Precision models #16276

Checklist

Motivation

Research

Related resources

Root Cause

Why It Fails

Notes

Citations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Support Mixed-Precision models #16276

Description

Checklist

Motivation

Research

Related resources

Root Cause

Why It Fails

Notes

Citations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions