Support direct execution of K-quant GGUF weights on GPU (Q4_K, Q5_K, Q6_K)

### Support direct execution of K-quant GGUF weights on GPU (Q4_K, Q5_K, Q6_K)

#### Motivation

GPU support for K-quant GGUF models (Q4_K_M, Q5_K_M, Q6_K) is currently implemented by #108 through load-time dequantization to Q8_0. This enables functional GPU execution but increases memory consumption and introduces an additional model-loading step.

To preserve the memory-efficiency benefits of K-quant models, we should add support for executing directly on K-quant weights without converting them to Q8_0.

#### Current behavior
Q4_K_M, Q5_K_M, and Q6_K tensors are dequantized to Q8_0 during model loading. In ModelLoader.java we see:

```java
/**
  * Dispatcher method for loading a TornadoVM-compatible tensor based on GGML type.
  * Used in GPU-path.
  */
public static TornadoTensor loadTornadoTensor(GGMLTensorEntry entry) {
    GGMLType ggmlType = entry.ggmlType();
    int size = FloatTensor.numberOfElements(entry.shape());
    return switch (ggmlType) {
        case F32 -> FP32TornadoTensor.fromTornadoMemorySegment(entry.memorySegment());
        case F16 -> FP16TornadoTensor.fromTornadoMemorySegment(entry.memorySegment());
        case Q8_0 -> Q8_0TornadoTensor.fromTornadoMemorySegment(entry.memorySegment());
        case Q4_0 -> throw new UnsupportedOperationException("Q4 format not supported yet");
        case Q4_K, Q5_K, Q6_K -> dequantizeToQ8_0TornadoTensor(entry);
        case Q4_0 -> throw new UnsupportedOperationException("Q4_0 format not supported for TornadoVM yet");
        default -> throw new UnsupportedOperationException("Quantization format " + ggmlType);
    };
}
```

GPU kernels operate on the resulting Q8_0 representation.
Model execution is correct, but memory usage is significantly higher than the original GGUF quantization format.

#### Proposed work

Implement direct TornadoVM GPU support for K-quant tensor formats:

- Q4_K_M
- Q5_K_M
- Q6_K

by implementing:

1) the tornado-based quantized tensor class in `tensor.tornado` similarly to `Q8_0TornadoTensor`.
2) in `AbstractModelLoader.effectiveGpuWeightType()` point to the corresponding quantized type (i.e. Q4_K_M) instead of the effective one
3) implement corresponding embeddings copy in `InferenceCore.forwardTornadoVM`
3) implement corresponding helper dispatch methods in `QuantizationPlannerFactory.create`
4) implement corresponding TornadoVM kernels in `tornadovm.kernels` (i.e., see `matrixVectorRowMajorOptimizedQ8_0Byte`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support direct execution of K-quant GGUF weights on GPU (Q4_K, Q5_K, Q6_K) #118