Skip to content

[Feature]: Support Re‑Quantizing Per‑Tensor FP8 Models to Other Dtypes #1337

@yiliu30

Description

@yiliu30

Feature Description

As the title.

Motivation and Use Case

  • Input:
    Models whose weights have already been quantized to FP8 with per‑tensor granularity, such as: https://huggingface.co/mistralai/Devstral-2-123B-Instruct-2512
    Output:
    Re‑quantized models in other dtypes, e.g. W4A16

  • The relevant implementation starting point is here:

    if layer.__class__.__name__ == "CompressedLinear":
    dq_weight = layer.compressor.decompress_module(layer)
    else:
    weight_scale = layer.weight_scale if hasattr(layer, "weight_scale") else layer.weight_scale_inv
    data_type = getattr(layer, "data_type", None)
    dq_weight = dequant_block_fp8_weight(layer.weight, weight_scale, layer.block_size, data_type=data_type)

Alternatives Considered

No response

Definition of Done

  • Allow taking mistralai/Devstral-2-123B-Instruct-2512 as input, and W4A16 as output.
  • UT

Additional Context

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions