-
-
Notifications
You must be signed in to change notification settings - Fork 12k
Description
Authors: Intel Neural Compressor Team, Intel vLLM Team
Related: RFC: vLLM Quantization Cleanup
Motivation.
In alignment with vLLM's quantization consolidation effort, we propose to streamline Intel's quantization support, currently fragmented across three implementations:
| File | Scope | Capabilities |
|---|---|---|
inc.py |
Intel Gaudi (HPU) | Online W8A8_FP8 quantization |
auto_round.py |
Intel CPU/GPU | WnA16 (AutoRound models) |
ipex_quant.py |
Intel CPU/GPU | W4A16,W4A8, W8A16_FP8 via IPEX backend |
This fragmentation creates maintenance overhead and inconsistent user experience. Consolidation will:
- Reduce code duplication and maintenance burden
- Provide a unified entry point for Intel quantization
- Enable future formats (
MXFP4/MXFP8/NVFP4, , and advanced mixed-precision recipes
Proposed Change.
The overall status and proposal are below:
1. Consolidate Intel CPU/GPU Quantization → inc.py
- Merge
auto_round.pyintoinc.pyas unified Intel quantization backend - Extend
inc.pyto support:WnA16inference (from AutoRound)- Future Intel formats (
MXFP4/MXFP8/NVFP4, advanced mixed-precision)
2. Migrate CPU/GPU Support → compressed_tensors
- Move
ipex_quant.pykernel (use vllm-xpu-kernel instead of IPEX) support tocompressed_tensorsbackend - Ensure Intel CPU/GPU compatibility with standard compressed tensor formats
- Deprecate
ipex_quant.pyafter successful migration
3. Finalize Gaudi Migration → vLLM-Gaudi Plugin
- Verify
W8A8_FP8capabilities fully available in vLLM-Gaudi plugin - Update documentation to reference plugin
Backward Compatibility
This consolidation ensures seamless migration for existing users:
- Compressed Tensors: Standard compressed tensor models will be deployable on Intel CPU/GPU through the enhanced backend, providing broad compatibility with the vLLM ecosystem.
- AutoRound Models: Existing AutoRound quantized models on HuggingFace will remain backward compatible via the consolidated
inc.py.
Both compressed tensor format and INC support are essential to serve different user needs: compressed tensors for standardization and interoperability and INC for Intel-optimized quantization recipes.
Estimated Timeline
- Phase 1 (Target: Jan 2025) : merge
auto_round.pyintoinc.py - Phase 2 (Target: Q1 2025) : migrate
ipex_quant.pytocompressed_tensors
Feedback Period.
Please comment on the proposal or suggest alternatives. If there are no strong objections, we will proceed with the timeline above and submit implementation PRs. Thanks!
CC List.
cc @hshen14 @thuang6 @wenhuach21 @jikunshang @kzawora-intel @xuechendi
Any Other Things.
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.