Support direct execution of K-quant GGUF weights on GPU (Q4_K, Q5_K, Q6_K)
Motivation
GPU support for K-quant GGUF models (Q4_K_M, Q5_K_M, Q6_K) is currently implemented by #108 through load-time dequantization to Q8_0. This enables functional GPU execution but increases memory consumption and introduces an additional model-loading step.
To preserve the memory-efficiency benefits of K-quant models, we should add support for executing directly on K-quant weights without converting them to Q8_0.
Current behavior
Q4_K_M, Q5_K_M, and Q6_K tensors are dequantized to Q8_0 during model loading. In ModelLoader.java we see:
/**
* Dispatcher method for loading a TornadoVM-compatible tensor based on GGML type.
* Used in GPU-path.
*/
public static TornadoTensor loadTornadoTensor(GGMLTensorEntry entry) {
GGMLType ggmlType = entry.ggmlType();
int size = FloatTensor.numberOfElements(entry.shape());
return switch (ggmlType) {
case F32 -> FP32TornadoTensor.fromTornadoMemorySegment(entry.memorySegment());
case F16 -> FP16TornadoTensor.fromTornadoMemorySegment(entry.memorySegment());
case Q8_0 -> Q8_0TornadoTensor.fromTornadoMemorySegment(entry.memorySegment());
case Q4_0 -> throw new UnsupportedOperationException("Q4 format not supported yet");
case Q4_K, Q5_K, Q6_K -> dequantizeToQ8_0TornadoTensor(entry);
case Q4_0 -> throw new UnsupportedOperationException("Q4_0 format not supported for TornadoVM yet");
default -> throw new UnsupportedOperationException("Quantization format " + ggmlType);
};
}
GPU kernels operate on the resulting Q8_0 representation.
Model execution is correct, but memory usage is significantly higher than the original GGUF quantization format.
Proposed work
Implement direct TornadoVM GPU support for K-quant tensor formats:
by implementing:
- the tornado-based quantized tensor class in
tensor.tornado similarly to Q8_0TornadoTensor.
- in
AbstractModelLoader.effectiveGpuWeightType() point to the corresponding quantized type (i.e. Q4_K_M) instead of the effective one
- implement corresponding embeddings copy in
InferenceCore.forwardTornadoVM
- implement corresponding helper dispatch methods in
QuantizationPlannerFactory.create
- implement corresponding TornadoVM kernels in
tornadovm.kernels (i.e., see matrixVectorRowMajorOptimizedQ8_0Byte)
Support direct execution of K-quant GGUF weights on GPU (Q4_K, Q5_K, Q6_K)
Motivation
GPU support for K-quant GGUF models (Q4_K_M, Q5_K_M, Q6_K) is currently implemented by #108 through load-time dequantization to Q8_0. This enables functional GPU execution but increases memory consumption and introduces an additional model-loading step.
To preserve the memory-efficiency benefits of K-quant models, we should add support for executing directly on K-quant weights without converting them to Q8_0.
Current behavior
Q4_K_M, Q5_K_M, and Q6_K tensors are dequantized to Q8_0 during model loading. In ModelLoader.java we see:
GPU kernels operate on the resulting Q8_0 representation.
Model execution is correct, but memory usage is significantly higher than the original GGUF quantization format.
Proposed work
Implement direct TornadoVM GPU support for K-quant tensor formats:
by implementing:
tensor.tornadosimilarly toQ8_0TornadoTensor.AbstractModelLoader.effectiveGpuWeightType()point to the corresponding quantized type (i.e. Q4_K_M) instead of the effective oneInferenceCore.forwardTornadoVMQuantizationPlannerFactory.createtornadovm.kernels(i.e., seematrixVectorRowMajorOptimizedQ8_0Byte)