-
Notifications
You must be signed in to change notification settings - Fork 94
Open
Description
Note
As8Wu4 : Activation with signed integer 8bit and Weight with unsigned integer 4bit configuration
Todo
- [ issued ] Support As8Wu4 for 128-group quantized Weight on CPU
- Support As8Wu4 for per-dimension, and per-channel quantized data on CPU : [ cpu_backend ] Enable channelwise 4bit quantized GEMM ( A : F32 -> U8 * W : S4 = O : F32) #3485
- Optimize As8Wu4 for per-dimension, and per-channel quantized data on CPU with NEON kernel : [ cpu_backend ] Optimize qai8dxp-qsi4cxp Matmul and qs4cx_f32 quantization with NEON simd #3497
- Apply offline weight packing & Classify for kernel / GEMM-dim configuration & Apply multithreading : [ cpu_backend ] Enable practical use of qsi4cxp_qs4cxs1s0 GEMM with openMP multithreading and automatic ukernel candidate selection #3519
- Support As8Wu4 for 128-group quantized Weight on GPU with openCL kernel : OpenCL Kernel for 4-Bit Signed Integer GEMV Computation #3480
- Optimize As8Wu4 for per-dimension, and per-channel quantized data on CPU with AVX2 kernel
- Support As8Wu4 GEMM for 64-group quantized data on CPU fallback
- Optimize As8Wu4 GEMM for 64-group quantized data on CPU with AVX2 kernel
- Support As8Wu4 GEMM for 32-group quantized data on CPU fallback
- Optimize As8Wu4 GEMM for 32-group quantized data on CPU with NEON kernel
- Support As8Wu4 for 64-group quantized Weight on CPU
Issues
1. Supporting As8Wu4 for 128-group quantized Weight on CPU might be invalid
- Refered from, MLAS and kleidiai, which are one of major CPU computational backends of openVINO, number 128 of group quantization is bound to SIMD register size for optimal performance. Thus there are no such kernel that supports 128 group-quantized-weight-matrix-using-GEMM on AVX2 or NEON. Only AVX512 has it!
- Although it won't be impossible to implement such, but the performance might be suboptimal.
- Problem is, then it is quite cumbersome to use 128 group quantized weight on CPU side, and forcibly using inconsistently-quantized weight might harm the LLM accuracy
Metadata
Metadata
Assignees
Labels
No labels