[Opt] Manipulation of Q8_0 tensors with Tornado ByteArrays#79
[Opt] Manipulation of Q8_0 tensors with Tornado ByteArrays#79mikepapadim merged 30 commits intobeehive-lab:mainfrom
ByteArrays#79Conversation
…n in TornadoVM acceleration.
…el loaders for consistent tensor loading.
# Conflicts: # src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java
…ray.fromSegmentShallow`
…0Byte` kernels for Q8_0 matrix-vector computations
…trix-vector computations
…thSiLUAndGLUActivationQ8_0Byte` kernels for byte-based Q8_0 computations
… compute kernels
There was a problem hiding this comment.
Pull request overview
This work-in-progress PR refactors Q8_0 quantized tensor handling to use Tornado's ByteArray type instead of separate arrays for quantized values and scales. The new approach stores Q8_0 blocks (2-byte FP16 scale + 32-byte quantized values) contiguously in ByteArrays, with new kernels that dequantize on-the-fly during computation. The changes are currently functional for Llama models, with other models still under development.
Key Changes:
- New Q8_0 kernel implementations using ByteArray format with inline dequantization
- Addition of
modelType()to Configuration interface to distinguish FP16 vs Q8_0 models - New activation conversion layer supporting FP16-to-FP32 and Q8_0-to-FP32 transformations
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 32 comments.
Show a summary per file
| File | Description |
|---|---|
TransformerComputeKernelsLayered.java |
Adds new Q8_0Byte kernel variants for matrix operations with inline dequantization |
TransformerComputeKernels.java |
Implements conversion kernels for FP16 and Q8_0 to FP32 format |
Q8_0TornadoTensor.java |
Adds ByteArray constructor and factory method; removes old unpacking methods |
TornadoTensor.java |
Adds asByteArray() method for Q8_0 tensor access |
Configuration.java + implementations |
Adds modelType() method to distinguish FP16 vs Q8_0 models |
AbstractModelLoader.java |
Implements readModelType() to map GGUF file types to model type strings |
ModelLoader.java |
Simplifies tensor loading by removing FP32 conversion helper |
State.java + implementations |
Adds embeddingX field and buffer allocation methods for quantized embeddings |
Activation.java |
Refactors to perform format conversion based on model type |
InferenceCore.java |
Updates token embedding copying to handle FP16 and Q8_0 formats |
| Various FFN layer files | Updates to use new ByteArray-based kernel APIs |
LogitsQ8_0Layer.java |
Updates to use new ByteArray-based kernel API |
| Various loader files | Removes loadTornadoTensorAsFP32 usage in favor of unified loading |
Comments suppressed due to low confidence (1)
src/main/java/org/beehive/gpullama3/tensor/tornado/Q8_0TornadoTensor.java:49
- The method
getSize()returnssizewhich will be-1if the tensor was created using the newQ8_0TornadoTensor(ByteArray)constructor. This will cause incorrect behavior for any code calling this method. The size should be calculated from the ByteArray iftornadoNativeArrayis not null.
public int getSize() {
return size;
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/rerun all |
|
🚀 Workflow rerun started Mode: |
|
✅ Workflow rerun success |
… associated usages.
fc1cc89 to
6b66a59
Compare
ByteArraysByteArrays
| return switch (modelQuantizationAsInt) { | ||
| case 1 -> "FP16"; | ||
| case 7 -> "Q8_0"; |
There was a problem hiding this comment.
what are these magic numbers 1 & 7?
Description
This update integrates the new TornadoVM Q8_0
ByteArraykernels into GPULlama3, enabling unified memory layout for quantized transformer inference. The changes replace separate scale and quantized value arrays with a singleByteArrayrepresentation that matches the GGUF Q8_0 format, improving memory efficiency and performance.Key Features:
ByteArraySupport: Integration of TornadoVM's newByteArrayHalfFloatmethods for Q8_0 quantized weightsconvertQ8_0toFP32()kernel for efficient Q8_0 → FP32 dequantizationByteArrayQ8_0 formatByteArraymapping without intermediate conversionsProblem Description
The previous implementation required converting GGUF Q8_0 data into separate
Int8Array(quantized values) andHalfFloatArray(scales) structures, causing:The GGUF Q8_0 format stores data as interleaved blocks (2-byte HalfFloat scale + 32 quantized bytes), which maps naturally to a single ByteArray.