llama-benchy Docker Container

A Docker container for running llama-benchy, a benchmarking tool for LLM inference performance testing.

Prerequisites

Docker 20.10+ or Docker Desktop
A running llama-cpp-server instance (or compatible OpenAI-compatible API)
curl and jq (for finding available models)

Building the Image

docker build -t llama-benchy .

Finding Your Model Name

Query your llama-cpp-server to list available models:

curl -s http://your-llama-cpp-server:8080/v1/models | jq -r '.data[].id'

Example output:

Qwen3.6-27B-UD-Q4_K_XL.gguf
mistral-7b-instruct-v0.1.gguf

Usage

Quick Run

Run a single benchmark against your LLM server:

docker run --rm \
  llama-benchy \
  --base-url http://your-llama-cpp-server:8080/v1 \
  --model your-model-name

Example with actual values:

docker run --rm \
  llama-benchy \
  --base-url http://192.168.5.33:8080/v1 \
  --model Qwen3.6-27B-UD-Q4_K_XL.gguf \
  --skip-coherence

Example Output

Hardware:

GPU: NVIDIA Quadro RTX 6000 (24GB VRAM)
CUDA Version: 12.8
Driver Version: 570.211.01

Successful benchmark run produces a results table:

| model                       |   test |           t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:----------------------------|-------:|--------------:|-------------:|----------------:|----------------:|----------------:|
| Qwen3.6-27B-UD-Q4_K_XL.gguf | pp2048 | 519.56 ± 6.15 |              | 3341.17 ± 37.20 | 3339.18 ± 37.20 | 3341.25 ± 37.20 |
| Qwen3.6-27B-UD-Q4_K_XL.gguf |   tg32 |  13.47 ± 1.35 | 18.00 ± 1.41 |                 |                 |                 |

llama-benchy (0.3.5)
date: 2026-04-24 21:42:27 | latency mode: api

Key metrics explained:

pp2048: Prompt processing at 2048 tokens - 519.56 tokens/sec throughput
tg32: Token generation (32 tokens) - 13.47 tokens/sec throughput
ttfr: Time to first response (3.34 seconds for this model)
peak t/s: Peak throughput achieved during concurrent requests

Using Docker Compose

docker compose up

Note: Update the base-url and model values in docker-compose.yml before running.

Configuration

The container accepts the following parameters:

--base-url: The URL of your LLM API endpoint (required)
--model: The name of the model to benchmark (required)
Additional llama-benchy parameters as needed

Refer to the llama-benchy documentation for full parameter details.

Troubleshooting

Connection Error: "Failed to connect to server"

Verify your --base-url is correct and accessible from the container
Ensure your llama-cpp-server is running
Check firewall rules if using a remote server

Model Not Found Error

Confirm the model name matches exactly (case-sensitive)
Run the curl command above to list available models
Ensure the model is loaded in your llama-cpp-server

Coherence Test FAILED

If you see: Error loading tokenizer: [model_name] is not a local folder or custom code which must be executed

This happens when llama-benchy can't load the correct tokenizer for your model. This is common with quantized GGUF models.

Recommended: Skip the coherence test for quantized models

For .gguf quantized models, the coherence test often fails due to tokenizer mismatches, even if the model itself works correctly. Skip it:

docker run --rm \
  llama-benchy \
  --base-url http://192.168.5.33:8080/v1 \
  --model Qwen3.6-27B-UD-Q4_K_XL.gguf \
  --skip-coherence

Before skipping, verify your model works:

curl -s http://192.168.5.33:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen3.6-27B-UD-Q4_K_XL.gguf", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

If it returns correct answers, the model is fine and --skip-coherence is safe to use.

Alternative: Use a HuggingFace model identifier

If running a standard HuggingFace model (not quantized), specify the tokenizer:

docker run --rm \
  llama-benchy \
  --base-url http://192.168.5.33:8080/v1 \
  --model Qwen/Qwen-7B-Chat \
  --tokenizer Qwen/Qwen-7B-Chat

Permission Denied or Access Issues

On Linux/Mac, you may need to prefix with sudo
Ensure your user is in the docker group: sudo usermod -aG docker $USER

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama-benchy Docker Container

Prerequisites

Building the Image

Finding Your Model Name

Usage

Quick Run

Example Output

Using Docker Compose

Configuration

Troubleshooting

Connection Error: "Failed to connect to server"

Model Not Found Error

Coherence Test FAILED

Permission Denied or Access Issues

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llama-benchy Docker Container

Prerequisites

Building the Image

Finding Your Model Name

Usage

Quick Run

Example Output

Using Docker Compose

Configuration

Troubleshooting

Connection Error: "Failed to connect to server"

Model Not Found Error

Coherence Test FAILED

Permission Denied or Access Issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages