[Opt] Manipulation of Q8_0 tensors with Tornado `ByteArray`s by orionpapadakis · Pull Request #79 · beehive-lab/GPULlama3.java

orionpapadakis · 2025-12-04T19:29:38Z

Description

This update integrates the new TornadoVM Q8_0 ByteArray kernels into GPULlama3, enabling unified memory layout for quantized transformer inference. The changes replace separate scale and quantized value arrays with a single ByteArray representation that matches the GGUF Q8_0 format, improving memory efficiency and performance.

Key Features:

Unified Q8_0 ByteArray Support: Integration of TornadoVM's new ByteArray HalfFloat methods for Q8_0 quantized weights
Q8_0 Conversion Kernels: New convertQ8_0toFP32() kernel for efficient Q8_0 → FP32 dequantization
Matrix-Vector Q8_0 Kernels: Updated transformer compute kernels supporting ByteArray Q8_0 format
Memory Layout Optimization: Direct GGUF → ByteArray mapping without intermediate conversions

Problem Description

The previous implementation required converting GGUF Q8_0 data into separate Int8Array (quantized values) and HalfFloatArray (scales) structures, causing:

Performance penalty during model loading from data transformation and separate array accesses
Complexity in managing multiple array types for a single logical tensor

The GGUF Q8_0 format stores data as interleaved blocks (2-byte HalfFloat scale + 32 quantized bytes), which maps naturally to a single ByteArray.

…n in TornadoVM acceleration.

…el loaders for consistent tensor loading.

# Conflicts: # src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java

…ray.fromSegmentShallow`

… copy

…uteKernels`

…0Byte` kernels for Q8_0 matrix-vector computations

…trix-vector computations

…thSiLUAndGLUActivationQ8_0Byte` kernels for byte-based Q8_0 computations

… compute kernels

Copilot

Pull request overview

This work-in-progress PR refactors Q8_0 quantized tensor handling to use Tornado's ByteArray type instead of separate arrays for quantized values and scales. The new approach stores Q8_0 blocks (2-byte FP16 scale + 32-byte quantized values) contiguously in ByteArrays, with new kernels that dequantize on-the-fly during computation. The changes are currently functional for Llama models, with other models still under development.

Key Changes:

New Q8_0 kernel implementations using ByteArray format with inline dequantization
Addition of modelType() to Configuration interface to distinguish FP16 vs Q8_0 models
New activation conversion layer supporting FP16-to-FP32 and Q8_0-to-FP32 transformations

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 32 comments.

Show a summary per file

File	Description
`TransformerComputeKernelsLayered.java`	Adds new Q8_0Byte kernel variants for matrix operations with inline dequantization
`TransformerComputeKernels.java`	Implements conversion kernels for FP16 and Q8_0 to FP32 format
`Q8_0TornadoTensor.java`	Adds ByteArray constructor and factory method; removes old unpacking methods
`TornadoTensor.java`	Adds `asByteArray()` method for Q8_0 tensor access
`Configuration.java` + implementations	Adds `modelType()` method to distinguish FP16 vs Q8_0 models
`AbstractModelLoader.java`	Implements `readModelType()` to map GGUF file types to model type strings
`ModelLoader.java`	Simplifies tensor loading by removing FP32 conversion helper
`State.java` + implementations	Adds `embeddingX` field and buffer allocation methods for quantized embeddings
`Activation.java`	Refactors to perform format conversion based on model type
`InferenceCore.java`	Updates token embedding copying to handle FP16 and Q8_0 formats
Various FFN layer files	Updates to use new ByteArray-based kernel APIs
`LogitsQ8_0Layer.java`	Updates to use new ByteArray-based kernel API
Various loader files	Removes `loadTornadoTensorAsFP32` usage in favor of unified loading

Comments suppressed due to low confidence (1)

src/main/java/org/beehive/gpullama3/tensor/tornado/Q8_0TornadoTensor.java:49

The method getSize() returns size which will be -1 if the tensor was created using the new Q8_0TornadoTensor(ByteArray) constructor. This will cause incorrect behavior for any code calling this method. The size should be calculated from the ByteArray if tornadoNativeArray is not null.

    public int getSize() {
        return size;
    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mikepapadim · 2025-12-05T13:23:28Z

/rerun all

github-actions · 2025-12-05T13:23:39Z

🚀 Workflow rerun started

Mode: all
Triggered by: @mikepapadim

View Actions

github-actions · 2025-12-05T13:23:41Z

✅ Workflow rerun success

View Actions

… associated usages.

mikepapadim · 2025-12-08T10:12:22Z

+        return switch (modelQuantizationAsInt) {
+            case 1 -> "FP16";
+            case 7 -> "Q8_0";


what are these magic numbers 1 & 7?

mikepapadim and others added 25 commits December 4, 2025 19:44

Refactor tensor loading and introduce support for Half-Float precisio…

e6ce9e0

…n in TornadoVM acceleration.

Replace loadTornadoTensorAsFP32 with loadTornadoTensor across mod…

db30dba

…el loaders for consistent tensor loading.

Add modelType to Configuration

553015d

Add readModelType integration for all model loaders

da10c5c

Update Q8_0 tensor creation to use fromTornadoMemorySegment method

579d6ea

# Conflicts: # src/main/java/org/beehive/gpullama3/model/loader/ModelLoader.java

Add new Q8_0TornadoTensor constructor using ByteArray and `ByteAr…

7adfd80

…ray.fromSegmentShallow`

Add support for Q8_0 weight type in InferenceCore embedding table…

613cdd2

… copy

Change embeddingX type from HalfFloatArray to TornadoNativeArray

91f48b0

Add type field to LlamaConfiguration constructor

786bdc2

Add FP16 and Q8_0 support to Activation layer initialization

78d6a18

Add convertQ8_0toFP32 kernel for dequantization in `TransformerComp…

7456d59

…uteKernels`

Add matrixVectorGenericQ8Byte and `matrixVectorRowMajorOptimizedQ8_…

04db93d

…0Byte` kernels for Q8_0 matrix-vector computations

Update FFN and attention layers to use Q8_0 byte-based kernels for ma…

dd8064e

…trix-vector computations

Update LogitsQ8_0Layer to use byte-based Q8_0 kernels

2316ca1

Remove deprecated methods for Q8_0 tensor loading and conversion to FP32

56a960a

Add matrixVectorGenericWithResidualQ8_0Byte and `fusedFeedForwardWi…

9d0fb16

…thSiLUAndGLUActivationQ8_0Byte` kernels for byte-based Q8_0 computations

Replace getHalf with getHalfFloat for Q8_0 block scale loading in…

68729ee

… compute kernels

Add FP16 and Q8_0 activation initialization methods in State class

843e30c

Use quantization-specific activation init in Llama models

111dbdd

Use quantization-specific activation init in Qwen3 models

4e984fa

Update Qwen3 FFN layers to use byte-based Q8_0 kernels

ed5f882

Use quantization-specific activation init in Qwen2 and Deepseek models

4e30022

Update Qwen2 and Deepseek FFN layers to use byte-based Q8_0 kernels

c52bcaa

Use quantization-specific activation init in Phi3 models

9562505

Update Phi3 FFN layers to use byte-based Q8_0 kernels

2c8cf24

mikepapadim requested review from Copilot, mairooni and mikepapadim December 5, 2025 12:42

mikepapadim marked this pull request as ready for review December 5, 2025 12:42

Copilot started reviewing on behalf of mikepapadim December 5, 2025 12:42 View session

Copilot finished reviewing on behalf of mikepapadim December 5, 2025 12:47

Copilot AI reviewed Dec 5, 2025

View reviewed changes

orionpapadakis added 5 commits December 8, 2025 11:31

Rename modelType to quantization across configurations and update…

eccdce6

… associated usages.

Cleanup unused memorySegment copy

4f13785

Cleanup and document Q8_0TornadoTensor

572f7b3

Use Configuration.quantization() method in Activation

a920424

[CI] Update Tornado dependencies to version 2.0.1-dev to run CI

6b66a59

orionpapadakis force-pushed the opt/q8-load-bytearray branch from fc1cc89 to 6b66a59 Compare December 8, 2025 09:32

orionpapadakis changed the title ~~[WIP] Manipulation of Q8_0 tensors with Tornado ByteArrays~~ [Opt] Manipulation of Q8_0 tensors with Tornado ByteArrays Dec 8, 2025

mikepapadim reviewed Dec 8, 2025

View reviewed changes

mikepapadim approved these changes Dec 8, 2025

View reviewed changes

mikepapadim merged commit edc8fac into beehive-lab:main Dec 8, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Opt] Manipulation of Q8_0 tensors with Tornado `ByteArray`s#79

[Opt] Manipulation of Q8_0 tensors with Tornado `ByteArray`s#79
mikepapadim merged 30 commits intobeehive-lab:mainfrom
orionpapadakis:opt/q8-load-bytearray

orionpapadakis commented Dec 4, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mikepapadim commented Dec 5, 2025

Uh oh!

github-actions Bot commented Dec 5, 2025

Uh oh!

github-actions Bot commented Dec 5, 2025

Uh oh!

mikepapadim Dec 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

orionpapadakis commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Features:

Problem Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mikepapadim commented Dec 5, 2025

Uh oh!

github-actions Bot commented Dec 5, 2025

Uh oh!

github-actions Bot commented Dec 5, 2025

Uh oh!

mikepapadim Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

orionpapadakis commented Dec 4, 2025 •

edited

Loading