Skip to content

leoai-81/llamabench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llamabench

LLM inference benchmark runner for any OpenAI-compatible server. Works with any backend — ROCm (Lemonade SDK), CUDA (Ollama, vLLM), Vulkan, or CPU. Runs latency and throughput benchmarks against remote servers at multiple context depths, then generates comparison plots.

Features

  • Multi-server support — per-server config files (config.<NAME>.sh) with auto-discovery
  • Multi-model runs — benchmark several models in a single invocation
  • Variable context depths — test prompt processing and token generation from zero-context to hundreds of thousands of tokens
  • Atomic results — temp-directory pattern ensures no partial output on failure
  • Built-in plotting — matplotlib charts for prompt processing and token generation throughput vs. context depth
  • Traceable results — backend versions embedded in every result file

Requirements

Tool Purpose
bash 4+ Script runtime
curl HTTP requests to OpenAI-compatible API
python3 JSON parsing, plot generation
uvx / llama-benchy Benchmark execution (install uv)
matplotlib Chart generation (auto-installed if missing)

Install matplotlib ahead of time:

pip3 install matplotlib

Quick Start

# 1. Create a server configuration
cp config.template.sh config.MYSERVER.sh
# Edit config.MYSERVER.sh with your server IP, port, models, and depths

# 2. Run benchmarks
./run_bench.sh MYSERVER

# 3. Check results
ls results/<timestamp>/

Project Structure

llamabench/
├── .gitignore              # Ignores config files and results
├── README.md               # This file
├── run_bench.sh            # Main benchmark orchestrator
├── plot.py                 # Result visualization (matplotlib)
├── config.template.sh      # Configuration template (committed)
├── config.<NAME>.sh        # Server configs (gitignored, one per server)
└── results/                # Benchmark output (gitignored)
    └── <YYYYMMDD_HHMMSS>/
        ├── system-info.json
        ├── system-info.md
        ├── <model>.md       # Per-model benchmark tables
        ├── combined.p.png   # Prompt processing plot
        └── combined.g.png   # Token generation plot

Configuration

Each server has its own config.<NAME>.sh file sourced by run_bench.sh. Copy the template and fill in your values:

cp config.template.sh config.MYSERVER.sh

Config Variables

Variable Description Example
IP Server IP address "192.168.2.238"
PORT API port "13305"
PLOT_PREFIX Filename prefix for plots "combined."
DEPTHS Array of context depths to test (0 8192 32768 65535 128000)
MODELS Array of model names on the server ("user.MyModel-Q4")

Adding a New Server

# 1. Copy template
cp config.template.sh config.NEWBOX.sh

# 2. Edit with your values
code config.NEWBOX.sh

# 3. Run
./run_bench.sh NEWBOX

The script auto-discovers all config.*.sh files (excluding config.template.sh) and lists them as available configs.

Usage

./run_bench.sh <CONFIG_NAME> [OPTIONS]

Arguments

Argument Description
CONFIG_NAME Name of the config file (config.<NAME>.sh). Also accepts --config NAME.

Options

Option Description
--ip <addr> Override server IP from config
--port <num> Override server port from config
--prefix <str> Override plot filename prefix from config

Examples

# Run with default config values
./run_bench.sh LEO3090

# Override IP for a single run
./run_bench.sh LEO3090 --ip 10.0.0.50

# List available configs
./run_bench.sh

Benchmark Flow

  1. Config load — Source config.<NAME>.sh, validate arrays, apply CLI overrides
  2. Temp dir — Create isolated workspace (cleaned up on failure)
  3. System info — Fetch /api/v1/system-info, extract backend versions
  4. Per-model loop:
    • Unload current model via /api/v1/unload
    • Run llama-benchy with all configured depths
    • Save markdown table, prepend backend header
  5. Plot — Generate PNG charts from all result files
  6. Finalize — Move temp contents to results/<timestamp>/ only on success

Temp Directory Pattern

All intermediate files are written to a temporary directory. The final results/<timestamp> folder is created only after all benchmarks and plots succeed. If any step fails, the temp directory is removed automatically via trap ... EXIT.

Result Format

Each run produces a timestamped directory in results/:

system-info.json / system-info.md

Server hardware and backend versions captured at run time.

Per-model markdown tables

Standard format with columns:

| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |

Test types:

  • pp — Prompt processing (baseline, 2048 tokens)
  • tg — Token generation (32 tokens)
  • ctx_pp/tg @ d<N> — Full context processing/generation at depth N
  • pp/tg @ d<N> — Incremental processing/generation at depth N

Plots

  • <prefix>p.png — Prompt processing throughput vs context depth
  • <prefix>g.png — Token generation throughput vs context depth

Each line represents one model. X-axis is context depth, Y-axis is tokens/second.

plot.py

Standalone Python script for generating charts from markdown result files.

# Plot specific result files
python plot.py --prefix "output." results/*/model*.md

# Plot from stdin
cat results.md | python plot.py -

# Default output: p.png and g.png
python plot.py results.md

Arguments

Argument Description
files Markdown files to parse (use - for stdin)
--prefix Output filename prefix (default: empty)

Troubleshooting

Error Cause Fix
No config files found No config.*.sh files present (template excluded) Copy config.template.sh to a named config file
MODELS array is empty Config file has no models defined Add entries to the MODELS array in your config
Connection refused Server not reachable on IP:port Verify server is running and firewall allows traffic
tokenizer not found Model name not on Hugging Face llama-benchy falls back to gpt2 tokenizer — harmless warning
matplotlib not found Package not installed Run pip3 install matplotlib or let script auto-install

Notes

  • An OpenAI-compatible inference server must be running and accessible before running benchmarks (e.g., Lemonade SDK on ROCm, Ollama, vLLM, etc.)
  • Models are unloaded between runs to ensure clean state
  • Prefix caching is enabled by default in all benchmark runs
  • Backend version headers are prepended to each result file for traceability

License

This project is provided as-is. See individual dependencies for their respective licenses.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors