Improve collection of performance/throughput metrics by orionpapadakis · Pull Request #113 · beehive-lab/GPULlama3.java

orionpapadakis · 2026-05-05T12:47:21Z

This PR addresses #104 .

Introduces a RunMetrics singleton that accumulates timing data across the full inference stack and prints a unified performance summary, replacing the previous ad-hoc LastRunMetrics record.

Metrics collected

Core (aligned with Ollama-style reporting):

load_duration — model file load time, measured in ModelLoader (both overloads)
prompt_eval_count / prompt_eval_duration — prefill token count and wall-clock time
eval_count / eval_duration — decode token count and wall-clock time
total_duration — full inference wall-clock time

TornadoVM-specific:

tornado_task_graph_compile_duration — plan/graph construction time
tornado_task_graph_warmup_duration — JIT compilation (withPreCompilation())
read_only_weights_copy_in_duration — first-execution weight upload (forceCopyInReadOnlyData())

All values stored in nanoseconds.

Derived throughput printed at end of run

Standard engine: achieved tok/s: X. Tokens: N, seconds: T
Prefill-decode engines: per-phase rates (prefill tok/s, decode tok/s) in addition to total
TornadoVM init breakdown printed only when --verbose-init is set — no new flags introduced

How to test

All commands below use Llama-3.2-1B-Instruct-F16.gguf. Each should print a ==== Performance Metrics ==== block at the end.

Standard engine — CPU

./llama-tornado --model <model> --prompt "Tell me a joke"
Expected: single "achieved tok/s" line (no prefill/decode split)

Standard engine — GPU

./llama-tornado --gpu --ptx --model <model> --prompt "Tell me a joke"
Expected: single "achieved tok/s" line

Standard engine — GPU + --verbose-init

./llama-tornado --gpu --ptx --model <model> --prompt "Tell me a joke" --verbose-init
Expected: throughput line + "Compilation & CodeGen / Warmup / Read-only weights Copy-in" lines

Prefill-decode — CPU

./llama-tornado --model <model> --prompt "Tell me a joke" --with-prefill-decode
Expected: Total + ¬Prefill + ¬Decode tok/s lines

Prefill-decode — GPU

./llama-tornado --gpu --ptx --model <model> --prompt "Tell me a joke" --with-prefill-decode --verbose-init
Expected: per-phase throughput + TornadoVM init breakdown

Batch prefill-decode — GPU + CUDA graphs

./llama-tornado --gpu --ptx --model <model> --prompt "Tell me a joke" \ --with-prefill-decode --batch-prefill-size 32 --cuda-graphs --verbose-init
Expected: per-phase throughput + TornadoVM init breakdown

…formance metrics across inference stages

…ranularity across plan creation, JIT, and weight transfer stages

… reports

… JSON, and GitHub Step Summary formats

orionpapadakis added 4 commits May 5, 2026 13:40

Introduce RunMetrics singleton to track and report fine-grained per…

45ada2f

…formance metrics across inference stages

Refactor TornadoVM initialization metrics tracking to include finer g…

83998cc

…ranularity across plan creation, JIT, and weight transfer stages

Track model loading duration in RunMetrics and include it in timing…

623f613

… reports

Remove LastRunMetrics as it is replaced by RunMetrics

964927b

orionpapadakis requested a review from mikepapadim May 5, 2026 12:47

orionpapadakis added the enhancement New feature or request label May 5, 2026

mikepapadim approved these changes May 5, 2026

View reviewed changes

Add RunMetricsRenderer abstraction with support for human-readable,…

e966ed1

… JSON, and GitHub Step Summary formats

mikepapadim self-requested a review May 6, 2026 08:51

orionpapadakis force-pushed the feat/performance-metrics branch from 72ce770 to e966ed1 Compare May 6, 2026 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve collection of performance/throughput metrics#113

Improve collection of performance/throughput metrics#113
orionpapadakis wants to merge 5 commits intomainfrom
feat/performance-metrics

orionpapadakis commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

orionpapadakis commented May 5, 2026

Metrics collected

Core (aligned with Ollama-style reporting):

TornadoVM-specific:

How to test

Standard engine — CPU

Standard engine — GPU

Standard engine — GPU + --verbose-init

Prefill-decode — CPU

Prefill-decode — GPU

Batch prefill-decode — GPU + CUDA graphs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants