Skip to content

Accuracy issue associated with FineGrainedFP8 #42831

@sunghyuckhong

Description

@sunghyuckhong

System Info

Hello,

I am writing to report an issue I observed while evaluating the accuracy of a model quantized with FineGrainedFP8 through lm-eval. I observed significant accuracy discrepancies when deploying the quantized model with the HF backend versus the vLLM backend.

Image

Models Used:

Interestingly, the difference becomes much more pronounced in tasks requiring multiple token generations (e.g., HumanEval, GSM8K) than in tasks that do not (e.g., MMLU). Given that the FineGrainedFP8-quantized model with vLLM backend produce results highly similar to those of the quantized model with FineGrainedFP8,
I suspect that there could be an issue with FP8Linear (e.g., fp8_matmul_trition_kernel).

Would appreciate if you could take a look at it.

Thanks,
Sung Hyuck Hong

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Example command:

 lm_eval \
  --model hf \
  --model_args pretrained="Qwen/Qwen3-8B-FP8"\
  --tasks humaneval \
  --batch_size auto \
  --confirm_run_unsafe_code

Expected behavior

Accuracy results similar to the attached picture should arise

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions