Skip to content

Improve collection of performance/throughput metrics#113

Open
orionpapadakis wants to merge 5 commits intomainfrom
feat/performance-metrics
Open

Improve collection of performance/throughput metrics#113
orionpapadakis wants to merge 5 commits intomainfrom
feat/performance-metrics

Conversation

@orionpapadakis
Copy link
Copy Markdown
Collaborator

This PR addresses #104 .

Introduces a RunMetrics singleton that accumulates timing data across the full inference stack and prints a unified performance summary, replacing the previous ad-hoc LastRunMetrics record.

Metrics collected

Core (aligned with Ollama-style reporting):
  • load_duration — model file load time, measured in ModelLoader (both overloads)
  • prompt_eval_count / prompt_eval_duration — prefill token count and wall-clock time
  • eval_count / eval_duration — decode token count and wall-clock time
  • total_duration — full inference wall-clock time
TornadoVM-specific:
  • tornado_task_graph_compile_duration — plan/graph construction time
  • tornado_task_graph_warmup_duration — JIT compilation (withPreCompilation())
  • read_only_weights_copy_in_duration — first-execution weight upload (forceCopyInReadOnlyData())

All values stored in nanoseconds.

Derived throughput printed at end of run

  • Standard engine: achieved tok/s: X. Tokens: N, seconds: T
  • Prefill-decode engines: per-phase rates (prefill tok/s, decode tok/s) in addition to total
  • TornadoVM init breakdown printed only when --verbose-init is set — no new flags introduced

How to test

All commands below use Llama-3.2-1B-Instruct-F16.gguf. Each should print a ==== Performance Metrics ==== block at the end.

Standard engine — CPU

./llama-tornado --model <model> --prompt "Tell me a joke"
Expected: single "achieved tok/s" line (no prefill/decode split)

Standard engine — GPU

./llama-tornado --gpu --ptx --model <model> --prompt "Tell me a joke"
Expected: single "achieved tok/s" line

Standard engine — GPU + --verbose-init

./llama-tornado --gpu --ptx --model <model> --prompt "Tell me a joke" --verbose-init
Expected: throughput line + "Compilation & CodeGen / Warmup / Read-only weights Copy-in" lines

Prefill-decode — CPU

./llama-tornado --model <model> --prompt "Tell me a joke" --with-prefill-decode
Expected: Total + ¬Prefill + ¬Decode tok/s lines

Prefill-decode — GPU

./llama-tornado --gpu --ptx --model <model> --prompt "Tell me a joke" --with-prefill-decode --verbose-init
Expected: per-phase throughput + TornadoVM init breakdown

Batch prefill-decode — GPU + CUDA graphs

./llama-tornado --gpu --ptx --model <model> --prompt "Tell me a joke" \ --with-prefill-decode --batch-prefill-size 32 --cuda-graphs --verbose-init
Expected: per-phase throughput + TornadoVM init breakdown

@orionpapadakis orionpapadakis requested a review from mikepapadim May 5, 2026 12:47
@orionpapadakis orionpapadakis added the enhancement New feature or request label May 5, 2026
@mikepapadim mikepapadim self-requested a review May 6, 2026 08:51
@orionpapadakis orionpapadakis force-pushed the feat/performance-metrics branch from 72ce770 to e966ed1 Compare May 6, 2026 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants