Skip to content

Commit 7cc28b0

Browse files
movedancer全都做不队
andauthored
[LARCH64 CPU]Provide inference acceleration optimization for Loongson CPU with 4-bit quantized models (#26280)
### Description This submission is a 4-bit quantized matrix multiplication operator suitable for the Loongson platform. It has passed the internal test checks of ONNX and has been successfully deployed for actual inference on the Loongson platform. It includes five modifications: (1) **sqnbitgemm_kernel_lasx.cpp**: Acceleration of inference for 4-bit quantized models on the LoongArch64 architecture, utilizing lasx/lsx vector instruction sets; (2) **sqnbitgemm_kernel_lasx_common.h**: Implementation of auxiliary functions used by **sqnbitgemm_kernel_lasx.cpp**`; (3) **cmake**: Added compilation options for **sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture; (4) **mlasi.h**: Added interface for calling the operator in **sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture; (5) **platform.cpp**: Added calls to the operators in **sqnbitgemm_kernel_lasx.cpp** under the LoongArch64 architecture. ### Motivation and Context Loongson has a critical lack of key operations in ONNX quantized model inference tasks. The issue of poor inference performance for 4-bit quantized models on the Loongson platform has been addressed. In tests using the Deepseek-R1-1.5B model, our operators have increased TPS by more than 7 times, with the speed of quantization matrix dequantization improving by up to 3 times. ### Pictures Dequantization Acceleration: In the chart, the vertical axis represents time in milliseconds (ms), the horizontal axis represents the number of test matrices, and the size of the quantized matrix is rows × columns, such as the 1536*256. <img width="4039" height="831" alt="反量化加速" src="https://github.com/user-attachments/assets/26da1ed9-79ae-4abd-9e6d-cadaea9ee013" /> --------- Co-authored-by: 全都做不队 <t202410611994336@eduxiji.net>
1 parent 7804b5c commit 7cc28b0

File tree

5 files changed

+1610
-1
lines changed

5 files changed

+1610
-1
lines changed

cmake/onnxruntime_mlas.cmake

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -777,6 +777,7 @@ endif()
777777
if(LOONGARCH64 AND MLAS_SOURCE_IS_NOT_SET)
778778
set(mlas_platform_srcs
779779
${MLAS_SRC_DIR}/qgemm_kernel_lsx.cpp
780+
${MLAS_SRC_DIR}/sqnbitgemm_kernel_lasx.cpp
780781
${MLAS_SRC_DIR}/loongarch64/SgemmKernelLasx.S
781782
${MLAS_SRC_DIR}/loongarch64/DgemmKernelLsx.S
782783
${MLAS_SRC_DIR}/loongarch64/DgemmKernelLasx.S

onnxruntime/core/mlas/lib/mlasi.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1234,6 +1234,8 @@ extern const MLAS_QNBIT_GEMM_DISPATCH MlasSQNBitGemmDispatchAvx512;
12341234

12351235
extern const MLAS_QNBIT_GEMM_DISPATCH MlasSQNBitGemmDispatchAvx512vnni;
12361236

1237+
extern const MLAS_QNBIT_GEMM_DISPATCH MlasSQNBitGemmDispatchLasx;
1238+
12371239
//
12381240
// Rotary embedding dispatch structure.
12391241
//

onnxruntime/core/mlas/lib/platform.cpp

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -742,6 +742,9 @@ Return Value:
742742
this->ComputeLogSoftmaxOutputF32Kernel = MlasComputeLogSoftmaxOutputF32KernelLasx;
743743
this->TransposePackB16x4Routine = MlasSgemmTransposePackB16x4Lasx;
744744

745+
// add new sqn-lasx kernel
746+
this->QNBitGemmDispatch = &MlasSQNBitGemmDispatchLasx;
747+
745748
this->GemmU8S8Dispatch = &MlasGemmU8X8DispatchLSX;
746749
this->GemmU8U8Dispatch = &MlasGemmU8X8DispatchLSX;
747750
}else if( cap_lsx ){
@@ -824,4 +827,4 @@ thread_local size_t ThreadedBufSize = 0;
824827
thread_local std::unique_ptr<uint8_t, decltype(&_aligned_free)> ThreadedBufHolder(nullptr, &_aligned_free);
825828
#else
826829
thread_local std::unique_ptr<uint8_t, decltype(&free)> ThreadedBufHolder(nullptr, &free);
827-
#endif
830+
#endif

0 commit comments

Comments
 (0)