Skip to content

[RFC]: Consolidate Intel Quantization Toolkit Integration in vLLM #30663

@yiliu30

Description

@yiliu30

Authors: Intel Neural Compressor Team, Intel vLLM Team
Related: RFC: vLLM Quantization Cleanup

Motivation.

In alignment with vLLM's quantization consolidation effort, we propose to streamline Intel's quantization support, currently fragmented across three implementations:

File Scope Capabilities
inc.py Intel Gaudi (HPU) Online W8A8_FP8 quantization
auto_round.py Intel CPU/GPU WnA16 (AutoRound models)
ipex_quant.py Intel CPU/GPU W4A16,W4A8, W8A16_FP8 via IPEX backend

This fragmentation creates maintenance overhead and inconsistent user experience. Consolidation will:

  • Reduce code duplication and maintenance burden
  • Provide a unified entry point for Intel quantization
  • Enable future formats (MXFP4/MXFP8/NVFP4, , and advanced mixed-precision recipes

Proposed Change.

The overall status and proposal are below:

Image

1. Consolidate Intel CPU/GPU Quantization → inc.py

  • Merge auto_round.py into inc.py as unified Intel quantization backend
  • Extend inc.py to support:
    • WnA16 inference (from AutoRound)
    • Future Intel formats (MXFP4/MXFP8/NVFP4, advanced mixed-precision)

2. Migrate CPU/GPU Support → compressed_tensors

  • Move ipex_quant.py kernel (use vllm-xpu-kernel instead of IPEX) support to compressed_tensors backend
  • Ensure Intel CPU/GPU compatibility with standard compressed tensor formats
  • Deprecate ipex_quant.py after successful migration

3. Finalize Gaudi Migration → vLLM-Gaudi Plugin

  • Verify W8A8_FP8 capabilities fully available in vLLM-Gaudi plugin
  • Update documentation to reference plugin

Backward Compatibility

This consolidation ensures seamless migration for existing users:

  • Compressed Tensors: Standard compressed tensor models will be deployable on Intel CPU/GPU through the enhanced backend, providing broad compatibility with the vLLM ecosystem.
  • AutoRound Models: Existing AutoRound quantized models on HuggingFace will remain backward compatible via the consolidated inc.py.

Both compressed tensor format and INC support are essential to serve different user needs: compressed tensors for standardization and interoperability and INC for Intel-optimized quantization recipes.

Estimated Timeline

  • Phase 1 (Target: Jan 2025) : merge auto_round.py into inc.py
  • Phase 2 (Target: Q1 2025) : migrate ipex_quant.py to compressed_tensors

Feedback Period.

Please comment on the proposal or suggest alternatives. If there are no strong objections, we will proceed with the timeline above and submit implementation PRs. Thanks!

CC List.

cc @hshen14 @thuang6 @wenhuach21 @jikunshang @kzawora-intel @xuechendi

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions