A Docker container for running llama-benchy, a benchmarking tool for LLM inference performance testing.
- Docker 20.10+ or Docker Desktop
- A running llama-cpp-server instance (or compatible OpenAI-compatible API)
curlandjq(for finding available models)
docker build -t llama-benchy .Query your llama-cpp-server to list available models:
curl -s http://your-llama-cpp-server:8080/v1/models | jq -r '.data[].id'Example output:
Qwen3.6-27B-UD-Q4_K_XL.gguf
mistral-7b-instruct-v0.1.gguf
Run a single benchmark against your LLM server:
docker run --rm \
llama-benchy \
--base-url http://your-llama-cpp-server:8080/v1 \
--model your-model-nameExample with actual values:
docker run --rm \
llama-benchy \
--base-url http://192.168.5.33:8080/v1 \
--model Qwen3.6-27B-UD-Q4_K_XL.gguf \
--skip-coherenceHardware:
GPU: NVIDIA Quadro RTX 6000 (24GB VRAM)
CUDA Version: 12.8
Driver Version: 570.211.01
Successful benchmark run produces a results table:
| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |
|:----------------------------|-------:|--------------:|-------------:|----------------:|----------------:|----------------:|
| Qwen3.6-27B-UD-Q4_K_XL.gguf | pp2048 | 519.56 ± 6.15 | | 3341.17 ± 37.20 | 3339.18 ± 37.20 | 3341.25 ± 37.20 |
| Qwen3.6-27B-UD-Q4_K_XL.gguf | tg32 | 13.47 ± 1.35 | 18.00 ± 1.41 | | | |
llama-benchy (0.3.5)
date: 2026-04-24 21:42:27 | latency mode: api
Key metrics explained:
- pp2048: Prompt processing at 2048 tokens - 519.56 tokens/sec throughput
- tg32: Token generation (32 tokens) - 13.47 tokens/sec throughput
- ttfr: Time to first response (3.34 seconds for this model)
- peak t/s: Peak throughput achieved during concurrent requests
docker compose upNote: Update the base-url and model values in docker-compose.yml before running.
The container accepts the following parameters:
--base-url: The URL of your LLM API endpoint (required)--model: The name of the model to benchmark (required)- Additional llama-benchy parameters as needed
Refer to the llama-benchy documentation for full parameter details.
- Verify your
--base-urlis correct and accessible from the container - Ensure your llama-cpp-server is running
- Check firewall rules if using a remote server
- Confirm the model name matches exactly (case-sensitive)
- Run the curl command above to list available models
- Ensure the model is loaded in your llama-cpp-server
If you see: Error loading tokenizer: [model_name] is not a local folder or custom code which must be executed
This happens when llama-benchy can't load the correct tokenizer for your model. This is common with quantized GGUF models.
Recommended: Skip the coherence test for quantized models
For .gguf quantized models, the coherence test often fails due to tokenizer mismatches, even if the model itself works correctly. Skip it:
docker run --rm \
llama-benchy \
--base-url http://192.168.5.33:8080/v1 \
--model Qwen3.6-27B-UD-Q4_K_XL.gguf \
--skip-coherenceBefore skipping, verify your model works:
curl -s http://192.168.5.33:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen3.6-27B-UD-Q4_K_XL.gguf", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'If it returns correct answers, the model is fine and --skip-coherence is safe to use.
Alternative: Use a HuggingFace model identifier
If running a standard HuggingFace model (not quantized), specify the tokenizer:
docker run --rm \
llama-benchy \
--base-url http://192.168.5.33:8080/v1 \
--model Qwen/Qwen-7B-Chat \
--tokenizer Qwen/Qwen-7B-Chat- On Linux/Mac, you may need to prefix with
sudo - Ensure your user is in the docker group:
sudo usermod -aG docker $USER