[RFC]: Consolidate Intel Quantization Toolkit Integration in vLLM

**Authors**: Intel Neural Compressor Team, Intel vLLM Team  
**Related**: [RFC: vLLM Quantization Cleanup](https://github.com/vllm-project/vllm/issues/30136)

### Motivation.

In alignment with vLLM's quantization consolidation effort, we propose to streamline Intel's quantization support, currently fragmented across three implementations:

| File            | Scope             | Capabilities                           |
| --------------- | ----------------- | -------------------------------------- |
| `inc.py`        | Intel Gaudi (HPU) | Online W8A8_FP8 quantization           |
| `auto_round.py` | Intel CPU/GPU     | WnA16 (AutoRound models)         |
| `ipex_quant.py` | Intel CPU/GPU         | W4A16,W4A8, W8A16_FP8 via IPEX backend |

This fragmentation creates maintenance overhead and inconsistent user experience. Consolidation will:

- Reduce code duplication and maintenance burden
- Provide a unified entry point for Intel quantization
- Enable future formats (`MXFP4`/`MXFP8`/`NVFP4`, , and advanced mixed-precision recipes




### Proposed Change.

The overall status and proposal are below:

<img width="4102" height="1528" alt="Image" src="https://github.com/user-attachments/assets/4b9c5799-99a4-4810-b5dd-d018e66e4009" />


#### 1. Consolidate Intel CPU/GPU Quantization → `inc.py`

- Merge `auto_round.py` into `inc.py` as unified Intel quantization backend
- Extend `inc.py` to support:
  - `WnA16` inference (from AutoRound)
  - Future Intel formats (`MXFP4`/`MXFP8`/`NVFP4`, advanced mixed-precision)

#### 2. Migrate CPU/GPU Support → `compressed_tensors`

- Move `ipex_quant.py` kernel (use vllm-xpu-kernel instead of IPEX) support to `compressed_tensors` backend
- Ensure Intel CPU/GPU compatibility with standard compressed tensor formats
- Deprecate `ipex_quant.py` after successful migration

#### 3. Finalize Gaudi Migration → vLLM-Gaudi Plugin

- Verify `W8A8_FP8` capabilities fully available in vLLM-Gaudi plugin
- Update documentation to reference plugin


### Backward Compatibility

This consolidation ensures seamless migration for existing users:

- **Compressed Tensors**: Standard compressed tensor models will be deployable on Intel CPU/GPU through the enhanced backend, providing broad compatibility with the vLLM ecosystem.
- **AutoRound Models**: Existing AutoRound quantized [models on HuggingFace](https://huggingface.co/models?search=AutoRound) will remain backward compatible via the consolidated `inc.py`.

Both compressed tensor format and INC support are essential to serve different user needs: compressed tensors for standardization and interoperability and INC for Intel-optimized quantization recipes.


### Estimated Timeline

- Phase 1 (Target: Jan 2025) : merge `auto_round.py` into `inc.py`
- Phase 2 (Target: Q1 2025) : migrate `ipex_quant.py` to `compressed_tensors`

### Feedback Period.

Please comment on the proposal or suggest alternatives. If there are no strong objections, we will proceed with the timeline above and submit implementation PRs. Thanks!

### CC List.

cc @hshen14 @thuang6 @wenhuach21 @jikunshang @kzawora-intel @xuechendi 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Consolidate Intel Quantization Toolkit Integration in vLLM #30663

Motivation.

Proposed Change.

1. Consolidate Intel CPU/GPU Quantization → `inc.py`

2. Migrate CPU/GPU Support → `compressed_tensors`

3. Finalize Gaudi Migration → vLLM-Gaudi Plugin

Backward Compatibility

Estimated Timeline

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

File	Scope	Capabilities
`inc.py`	Intel Gaudi (HPU)	Online W8A8_FP8 quantization
`auto_round.py`	Intel CPU/GPU	WnA16 (AutoRound models)
`ipex_quant.py`	Intel CPU/GPU	W4A16,W4A8, W8A16_FP8 via IPEX backend

Uh oh!

[RFC]: Consolidate Intel Quantization Toolkit Integration in vLLM #30663

Description

Motivation.

Proposed Change.

1. Consolidate Intel CPU/GPU Quantization → inc.py

2. Migrate CPU/GPU Support → compressed_tensors

3. Finalize Gaudi Migration → vLLM-Gaudi Plugin

Backward Compatibility

Estimated Timeline

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Consolidate Intel CPU/GPU Quantization → `inc.py`

2. Migrate CPU/GPU Support → `compressed_tensors`