llamabench

LLM inference benchmark runner for any OpenAI-compatible server. Works with any backend — ROCm (Lemonade SDK), CUDA (Ollama, vLLM), Vulkan, or CPU. Runs latency and throughput benchmarks against remote servers at multiple context depths, then generates comparison plots.

Features

Multi-server support — per-server config files (config.<NAME>.sh) with auto-discovery
Multi-model runs — benchmark several models in a single invocation
Variable context depths — test prompt processing and token generation from zero-context to hundreds of thousands of tokens
Atomic results — temp-directory pattern ensures no partial output on failure
Built-in plotting — matplotlib charts for prompt processing and token generation throughput vs. context depth
Traceable results — backend versions embedded in every result file

Requirements

Tool	Purpose
bash 4+	Script runtime
curl	HTTP requests to OpenAI-compatible API
python3	JSON parsing, plot generation
uvx / llama-benchy	Benchmark execution (install uv)
matplotlib	Chart generation (auto-installed if missing)

Install matplotlib ahead of time:

pip3 install matplotlib

Quick Start

# 1. Create a server configuration
cp config.template.sh config.MYSERVER.sh
# Edit config.MYSERVER.sh with your server IP, port, models, and depths

# 2. Run benchmarks
./run_bench.sh MYSERVER

# 3. Check results
ls results/<timestamp>/

Project Structure

llamabench/
├── .gitignore              # Ignores config files and results
├── README.md               # This file
├── run_bench.sh            # Main benchmark orchestrator
├── plot.py                 # Result visualization (matplotlib)
├── config.template.sh      # Configuration template (committed)
├── config.<NAME>.sh        # Server configs (gitignored, one per server)
└── results/                # Benchmark output (gitignored)
    └── <YYYYMMDD_HHMMSS>/
        ├── system-info.json
        ├── system-info.md
        ├── <model>.md       # Per-model benchmark tables
        ├── combined.p.png   # Prompt processing plot
        └── combined.g.png   # Token generation plot

Configuration

Each server has its own config.<NAME>.sh file sourced by run_bench.sh. Copy the template and fill in your values:

cp config.template.sh config.MYSERVER.sh

Config Variables

Variable	Description	Example
`IP`	Server IP address	`"192.168.2.238"`
`PORT`	API port	`"13305"`
`PLOT_PREFIX`	Filename prefix for plots	`"combined."`
`DEPTHS`	Array of context depths to test	`(0 8192 32768 65535 128000)`
`MODELS`	Array of model names on the server	`("user.MyModel-Q4")`

Adding a New Server

# 1. Copy template
cp config.template.sh config.NEWBOX.sh

# 2. Edit with your values
code config.NEWBOX.sh

# 3. Run
./run_bench.sh NEWBOX

The script auto-discovers all config.*.sh files (excluding config.template.sh) and lists them as available configs.

Usage

./run_bench.sh <CONFIG_NAME> [OPTIONS]

Arguments

Argument	Description
`CONFIG_NAME`	Name of the config file (`config.<NAME>.sh`). Also accepts `--config NAME`.

Options

Option	Description
`--ip <addr>`	Override server IP from config
`--port <num>`	Override server port from config
`--prefix <str>`	Override plot filename prefix from config

Examples

# Run with default config values
./run_bench.sh LEO3090

# Override IP for a single run
./run_bench.sh LEO3090 --ip 10.0.0.50

# List available configs
./run_bench.sh

Benchmark Flow

Config load — Source config.<NAME>.sh, validate arrays, apply CLI overrides
Temp dir — Create isolated workspace (cleaned up on failure)
System info — Fetch /api/v1/system-info, extract backend versions
Per-model loop:
- Unload current model via /api/v1/unload
- Run llama-benchy with all configured depths
- Save markdown table, prepend backend header
Plot — Generate PNG charts from all result files
Finalize — Move temp contents to results/<timestamp>/ only on success

Temp Directory Pattern

All intermediate files are written to a temporary directory. The final results/<timestamp> folder is created only after all benchmarks and plots succeed. If any step fails, the temp directory is removed automatically via trap ... EXIT.

Result Format

Each run produces a timestamped directory in results/:

system-info.json / system-info.md

Server hardware and backend versions captured at run time.

Per-model markdown tables

Standard format with columns:

Test types:

pp — Prompt processing (baseline, 2048 tokens)
tg — Token generation (32 tokens)
ctx_pp/tg @ d<N> — Full context processing/generation at depth N
pp/tg @ d<N> — Incremental processing/generation at depth N

Plots

<prefix>p.png — Prompt processing throughput vs context depth
<prefix>g.png — Token generation throughput vs context depth

Each line represents one model. X-axis is context depth, Y-axis is tokens/second.

plot.py

Standalone Python script for generating charts from markdown result files.

# Plot specific result files
python plot.py --prefix "output." results/*/model*.md

# Plot from stdin
cat results.md | python plot.py -

# Default output: p.png and g.png
python plot.py results.md

Arguments

Argument	Description
`files`	Markdown files to parse (use `-` for stdin)
`--prefix`	Output filename prefix (default: empty)

Troubleshooting

Error	Cause	Fix
`No config files found`	No `config.*.sh` files present (template excluded)	Copy `config.template.sh` to a named config file
`MODELS array is empty`	Config file has no models defined	Add entries to the `MODELS` array in your config
`Connection refused`	Server not reachable on IP:port	Verify server is running and firewall allows traffic
`tokenizer not found`	Model name not on Hugging Face	`llama-benchy` falls back to `gpt2` tokenizer — harmless warning
`matplotlib not found`	Package not installed	Run `pip3 install matplotlib` or let script auto-install

Notes

An OpenAI-compatible inference server must be running and accessible before running benchmarks (e.g., Lemonade SDK on ROCm, Ollama, vLLM, etc.)
Models are unloaded between runs to ensure clean state
Prefix caching is enabled by default in all benchmark runs
Backend version headers are prepended to each result file for traceability

License

This project is provided as-is. See individual dependencies for their respective licenses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llamabench

Features

Requirements

Quick Start

Project Structure

Configuration

Config Variables

Adding a New Server

Usage

Arguments

Options

Examples

Benchmark Flow

Temp Directory Pattern

Result Format

system-info.json / system-info.md

Per-model markdown tables

Plots

plot.py

Arguments

Troubleshooting

Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
config.template.sh		config.template.sh
plot.py		plot.py
run_bench.sh		run_bench.sh

Folders and files

Latest commit

History

Repository files navigation

llamabench

Features

Requirements

Quick Start

Project Structure

Configuration

Config Variables

Adding a New Server

Usage

Arguments

Options

Examples

Benchmark Flow

Temp Directory Pattern

Result Format

system-info.json / system-info.md

Per-model markdown tables

Plots

plot.py

Arguments

Troubleshooting

Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages