Skip to content

Support direct execution of K-quant GGUF weights on GPU (Q4_K, Q5_K, Q6_K) #118

@orionpapadakis

Description

@orionpapadakis

Support direct execution of K-quant GGUF weights on GPU (Q4_K, Q5_K, Q6_K)

Motivation

GPU support for K-quant GGUF models (Q4_K_M, Q5_K_M, Q6_K) is currently implemented by #108 through load-time dequantization to Q8_0. This enables functional GPU execution but increases memory consumption and introduces an additional model-loading step.

To preserve the memory-efficiency benefits of K-quant models, we should add support for executing directly on K-quant weights without converting them to Q8_0.

Current behavior

Q4_K_M, Q5_K_M, and Q6_K tensors are dequantized to Q8_0 during model loading. In ModelLoader.java we see:

/**
  * Dispatcher method for loading a TornadoVM-compatible tensor based on GGML type.
  * Used in GPU-path.
  */
public static TornadoTensor loadTornadoTensor(GGMLTensorEntry entry) {
    GGMLType ggmlType = entry.ggmlType();
    int size = FloatTensor.numberOfElements(entry.shape());
    return switch (ggmlType) {
        case F32 -> FP32TornadoTensor.fromTornadoMemorySegment(entry.memorySegment());
        case F16 -> FP16TornadoTensor.fromTornadoMemorySegment(entry.memorySegment());
        case Q8_0 -> Q8_0TornadoTensor.fromTornadoMemorySegment(entry.memorySegment());
        case Q4_0 -> throw new UnsupportedOperationException("Q4 format not supported yet");
        case Q4_K, Q5_K, Q6_K -> dequantizeToQ8_0TornadoTensor(entry);
        case Q4_0 -> throw new UnsupportedOperationException("Q4_0 format not supported for TornadoVM yet");
        default -> throw new UnsupportedOperationException("Quantization format " + ggmlType);
    };
}

GPU kernels operate on the resulting Q8_0 representation.
Model execution is correct, but memory usage is significantly higher than the original GGUF quantization format.

Proposed work

Implement direct TornadoVM GPU support for K-quant tensor formats:

  • Q4_K_M
  • Q5_K_M
  • Q6_K

by implementing:

  1. the tornado-based quantized tensor class in tensor.tornado similarly to Q8_0TornadoTensor.
  2. in AbstractModelLoader.effectiveGpuWeightType() point to the corresponding quantized type (i.e. Q4_K_M) instead of the effective one
  3. implement corresponding embeddings copy in InferenceCore.forwardTornadoVM
  4. implement corresponding helper dispatch methods in QuantizationPlannerFactory.create
  5. implement corresponding TornadoVM kernels in tornadovm.kernels (i.e., see matrixVectorRowMajorOptimizedQ8_0Byte)

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions