Conversation
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
There was a problem hiding this comment.
Pull Request Overview
This PR adds oneDNN w8a16 GEMM support to the vLLM XPU kernels by migrating oneDNN-based w8a16 (weight 8-bit, activation 16-bit) GEMM operations and adding corresponding tests. This enables weight-only quantization for efficient inference on Intel XPU devices.
- Adds oneDNN as a third-party dependency via Git submodule
- Implements FP8 linear layer with weight-only quantization support
- Adds oneDNN-based FP8 GEMM kernel implementation for w8a16 operations
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_xpu_kernels/layers/quantization/utils.py | Defines quantization enums and mappings for different quantization methods and data types |
| vllm_xpu_kernels/layers/quantization/fp8_linear.py | Implements WeightOnlyQuantizedLinear class with FP8 quantization support and GPTQ/AWQ compatibility |
| tests/test_fp8_linear.py | Adds comprehensive tests for FP8 linear layer with various parameter combinations |
| third_party/oneDNN | Adds oneDNN submodule for low-level GEMM operations |
| csrc/xpu/onednn/onednn_ext.h | Provides oneDNN extensions and utilities for XPU kernel integration |
| csrc/xpu/onednn/fp8_gemm_w8a16.h | Header for FP8 GEMM w8a16 implementation |
| csrc/xpu/onednn/fp8_gemm_w8a16.cpp | Core implementation of FP8 GEMM w8a16 operations |
| cmake/Modules/FindoneDNN.cmake | CMake module for finding and configuring oneDNN dependency |
| setup.py | Updates build configuration to use Intel compilers for XPU targets |
| CMakeLists.txt | Integrates oneDNN into the build system |
| csrc/xpu/torch_bindings.cpp | Adds PyTorch bindings for fp8_gemm_w8a16 operation |
| csrc/xpu/ops.h | Declares the fp8_gemm_w8a16 function interface |
| .gitmodules | Configures oneDNN as a Git submodule |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
jikunshang
left a comment
There was a problem hiding this comment.
can you add a section in doc to illustrate onednn compile & link issue/best practice?
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
|
|
||
| By linking statically, we avoid potential performance variability introduced by different builds or configurations of DNNL that might be present on the host system. | ||
|
|
||
| #### 3. **Avoiding Runtime Errors** |
There was a problem hiding this comment.
maybe one more reason: torch-xpu also use static link. cc @rogerxfeng8
9e73b8c to
154a402
Compare
Signed-off-by: Zhu, Zufang <zufang.zhu@intel.com>
CMakeLists.txt
Outdated
| # Import torch cmake configuration. | ||
| find_package(Torch REQUIRED) | ||
|
|
||
| find_package(oneDNN QUIET) |
There was a problem hiding this comment.
REQUIRED is preferred which stops with error.
There was a problem hiding this comment.
Got it, I will change this.
BTW, we also have env ONEDNN_FOUND to detect if the onednn is found.
README.md
Outdated
| @@ -43,3 +43,23 @@ VLLM_TARGET_DEVICE=xpu python3 setup.py bdist_wheel | |||
|
|
|||
| ### how to use in vLLM | |||
There was a problem hiding this comment.
Fix existed typos:
line 15: Preparation
line 25: Build & installation
line 44: How to
|
|
||
| if(ONEDNN_FOUND) | ||
| set(_ONEDNN_SRC) | ||
| file(GLOB _ONEDNN_SRC csrc/xpu/onednn/*.cpp) |
|
|
||
| SET(DNNL_LIBRARY_TYPE STATIC CACHE STRING "" FORCE) | ||
|
|
||
| SET(DNNL_CPU_RUNTIME "THREADPOOL" CACHE STRING "oneDNN cpu backend" FORCE) |
There was a problem hiding this comment.
Assuming this is a copy from onednn makefile. cpu runtime can be cleaned up.
There was a problem hiding this comment.
We need set this, or the OneDNN will set this env to default OMP, which will add dependency to libiomp5.so.
rogerxfeng8
left a comment
There was a problem hiding this comment.
code implementation looks good to me
add third_party/oneDNN
migrating onednn w8a16 gemm and add tests