LLM inference benchmark runner for any OpenAI-compatible server. Works with any backend — ROCm (Lemonade SDK), CUDA (Ollama, vLLM), Vulkan, or CPU. Runs latency and throughput benchmarks against remote servers at multiple context depths, then generates comparison plots.
- Multi-server support — per-server config files (
config.<NAME>.sh) with auto-discovery - Multi-model runs — benchmark several models in a single invocation
- Variable context depths — test prompt processing and token generation from zero-context to hundreds of thousands of tokens
- Atomic results — temp-directory pattern ensures no partial output on failure
- Built-in plotting — matplotlib charts for prompt processing and token generation throughput vs. context depth
- Traceable results — backend versions embedded in every result file
| Tool | Purpose |
|---|---|
| bash 4+ | Script runtime |
| curl | HTTP requests to OpenAI-compatible API |
| python3 | JSON parsing, plot generation |
| uvx / llama-benchy | Benchmark execution (install uv) |
| matplotlib | Chart generation (auto-installed if missing) |
Install matplotlib ahead of time:
pip3 install matplotlib# 1. Create a server configuration
cp config.template.sh config.MYSERVER.sh
# Edit config.MYSERVER.sh with your server IP, port, models, and depths
# 2. Run benchmarks
./run_bench.sh MYSERVER
# 3. Check results
ls results/<timestamp>/llamabench/
├── .gitignore # Ignores config files and results
├── README.md # This file
├── run_bench.sh # Main benchmark orchestrator
├── plot.py # Result visualization (matplotlib)
├── config.template.sh # Configuration template (committed)
├── config.<NAME>.sh # Server configs (gitignored, one per server)
└── results/ # Benchmark output (gitignored)
└── <YYYYMMDD_HHMMSS>/
├── system-info.json
├── system-info.md
├── <model>.md # Per-model benchmark tables
├── combined.p.png # Prompt processing plot
└── combined.g.png # Token generation plot
Each server has its own config.<NAME>.sh file sourced by run_bench.sh. Copy the template and fill in your values:
cp config.template.sh config.MYSERVER.sh| Variable | Description | Example |
|---|---|---|
IP |
Server IP address | "192.168.2.238" |
PORT |
API port | "13305" |
PLOT_PREFIX |
Filename prefix for plots | "combined." |
DEPTHS |
Array of context depths to test | (0 8192 32768 65535 128000) |
MODELS |
Array of model names on the server | ("user.MyModel-Q4") |
# 1. Copy template
cp config.template.sh config.NEWBOX.sh
# 2. Edit with your values
code config.NEWBOX.sh
# 3. Run
./run_bench.sh NEWBOXThe script auto-discovers all config.*.sh files (excluding config.template.sh) and lists them as available configs.
./run_bench.sh <CONFIG_NAME> [OPTIONS]
| Argument | Description |
|---|---|
CONFIG_NAME |
Name of the config file (config.<NAME>.sh). Also accepts --config NAME. |
| Option | Description |
|---|---|
--ip <addr> |
Override server IP from config |
--port <num> |
Override server port from config |
--prefix <str> |
Override plot filename prefix from config |
# Run with default config values
./run_bench.sh LEO3090
# Override IP for a single run
./run_bench.sh LEO3090 --ip 10.0.0.50
# List available configs
./run_bench.sh- Config load — Source
config.<NAME>.sh, validate arrays, apply CLI overrides - Temp dir — Create isolated workspace (cleaned up on failure)
- System info — Fetch
/api/v1/system-info, extract backend versions - Per-model loop:
- Unload current model via
/api/v1/unload - Run
llama-benchywith all configured depths - Save markdown table, prepend backend header
- Unload current model via
- Plot — Generate PNG charts from all result files
- Finalize — Move temp contents to
results/<timestamp>/only on success
All intermediate files are written to a temporary directory. The final results/<timestamp> folder is created only after all benchmarks and plots succeed. If any step fails, the temp directory is removed automatically via trap ... EXIT.
Each run produces a timestamped directory in results/:
Server hardware and backend versions captured at run time.
Standard format with columns:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
Test types:
- pp — Prompt processing (baseline, 2048 tokens)
- tg — Token generation (32 tokens)
- ctx_pp/tg @ d<N> — Full context processing/generation at depth N
- pp/tg @ d<N> — Incremental processing/generation at depth N
<prefix>p.png— Prompt processing throughput vs context depth<prefix>g.png— Token generation throughput vs context depth
Each line represents one model. X-axis is context depth, Y-axis is tokens/second.
Standalone Python script for generating charts from markdown result files.
# Plot specific result files
python plot.py --prefix "output." results/*/model*.md
# Plot from stdin
cat results.md | python plot.py -
# Default output: p.png and g.png
python plot.py results.md| Argument | Description |
|---|---|
files |
Markdown files to parse (use - for stdin) |
--prefix |
Output filename prefix (default: empty) |
| Error | Cause | Fix |
|---|---|---|
No config files found |
No config.*.sh files present (template excluded) |
Copy config.template.sh to a named config file |
MODELS array is empty |
Config file has no models defined | Add entries to the MODELS array in your config |
Connection refused |
Server not reachable on IP:port | Verify server is running and firewall allows traffic |
tokenizer not found |
Model name not on Hugging Face | llama-benchy falls back to gpt2 tokenizer — harmless warning |
matplotlib not found |
Package not installed | Run pip3 install matplotlib or let script auto-install |
- An OpenAI-compatible inference server must be running and accessible before running benchmarks (e.g., Lemonade SDK on ROCm, Ollama, vLLM, etc.)
- Models are unloaded between runs to ensure clean state
- Prefix caching is enabled by default in all benchmark runs
- Backend version headers are prepended to each result file for traceability
This project is provided as-is. See individual dependencies for their respective licenses.