Skip to content

geekho-me/llama-benchy-docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

llama-benchy Docker Container

A Docker container for running llama-benchy, a benchmarking tool for LLM inference performance testing.

Prerequisites

  • Docker 20.10+ or Docker Desktop
  • A running llama-cpp-server instance (or compatible OpenAI-compatible API)
  • curl and jq (for finding available models)

Building the Image

docker build -t llama-benchy .

Finding Your Model Name

Query your llama-cpp-server to list available models:

curl -s http://your-llama-cpp-server:8080/v1/models | jq -r '.data[].id'

Example output:

Qwen3.6-27B-UD-Q4_K_XL.gguf
mistral-7b-instruct-v0.1.gguf

Usage

Quick Run

Run a single benchmark against your LLM server:

docker run --rm \
  llama-benchy \
  --base-url http://your-llama-cpp-server:8080/v1 \
  --model your-model-name

Example with actual values:

docker run --rm \
  llama-benchy \
  --base-url http://192.168.5.33:8080/v1 \
  --model Qwen3.6-27B-UD-Q4_K_XL.gguf \
  --skip-coherence

Example Output

Hardware:

GPU: NVIDIA Quadro RTX 6000 (24GB VRAM)
CUDA Version: 12.8
Driver Version: 570.211.01

Successful benchmark run produces a results table:

| model                       |   test |           t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:----------------------------|-------:|--------------:|-------------:|----------------:|----------------:|----------------:|
| Qwen3.6-27B-UD-Q4_K_XL.gguf | pp2048 | 519.56 ± 6.15 |              | 3341.17 ± 37.20 | 3339.18 ± 37.20 | 3341.25 ± 37.20 |
| Qwen3.6-27B-UD-Q4_K_XL.gguf |   tg32 |  13.47 ± 1.35 | 18.00 ± 1.41 |                 |                 |                 |

llama-benchy (0.3.5)
date: 2026-04-24 21:42:27 | latency mode: api

Key metrics explained:

  • pp2048: Prompt processing at 2048 tokens - 519.56 tokens/sec throughput
  • tg32: Token generation (32 tokens) - 13.47 tokens/sec throughput
  • ttfr: Time to first response (3.34 seconds for this model)
  • peak t/s: Peak throughput achieved during concurrent requests

Using Docker Compose

docker compose up

Note: Update the base-url and model values in docker-compose.yml before running.

Configuration

The container accepts the following parameters:

  • --base-url: The URL of your LLM API endpoint (required)
  • --model: The name of the model to benchmark (required)
  • Additional llama-benchy parameters as needed

Refer to the llama-benchy documentation for full parameter details.

Troubleshooting

Connection Error: "Failed to connect to server"

  • Verify your --base-url is correct and accessible from the container
  • Ensure your llama-cpp-server is running
  • Check firewall rules if using a remote server

Model Not Found Error

  • Confirm the model name matches exactly (case-sensitive)
  • Run the curl command above to list available models
  • Ensure the model is loaded in your llama-cpp-server

Coherence Test FAILED

If you see: Error loading tokenizer: [model_name] is not a local folder or custom code which must be executed

This happens when llama-benchy can't load the correct tokenizer for your model. This is common with quantized GGUF models.

Recommended: Skip the coherence test for quantized models

For .gguf quantized models, the coherence test often fails due to tokenizer mismatches, even if the model itself works correctly. Skip it:

docker run --rm \
  llama-benchy \
  --base-url http://192.168.5.33:8080/v1 \
  --model Qwen3.6-27B-UD-Q4_K_XL.gguf \
  --skip-coherence

Before skipping, verify your model works:

curl -s http://192.168.5.33:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen3.6-27B-UD-Q4_K_XL.gguf", "messages": [{"role": "user", "content": "What is the capital of France?"}]}'

If it returns correct answers, the model is fine and --skip-coherence is safe to use.

Alternative: Use a HuggingFace model identifier

If running a standard HuggingFace model (not quantized), specify the tokenizer:

docker run --rm \
  llama-benchy \
  --base-url http://192.168.5.33:8080/v1 \
  --model Qwen/Qwen-7B-Chat \
  --tokenizer Qwen/Qwen-7B-Chat

Permission Denied or Access Issues

  • On Linux/Mac, you may need to prefix with sudo
  • Ensure your user is in the docker group: sudo usermod -aG docker $USER

About

Dockerized llama-benchy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors