Skip to content

Benchmark suite for LLM inference on NVIDIA consumer GPUs (RTX 5060 Ti, 5070 Ti, 5090)

Notifications You must be signed in to change notification settings

hholtmann/llm-consumer-gpu-benchmark

Repository files navigation

LLM GPU Benchmark Suite

Automated LLM inference benchmarking on consumer GPUs via vast.ai

Spin up GPU instances, run comprehensive benchmarks, collect results, and tear down - all with a single command.

Docker Hub License: MIT

Features

  • One-command benchmarking: python benchmark_runner.py --suite config.yaml
  • Consumer GPU focus: RTX 5060 Ti, 5070 Ti, 5090 (and extensible to others)
  • Multiple workloads: RAG (long context), API (high concurrency), Agentic (LoRA multi-tenant)
  • Comprehensive metrics: Throughput, latency (TTFT, ITL, P95/P99), power, energy efficiency
  • Cost-efficient: Uses vast.ai spot instances, auto-terminates after completion
  • Paper-ready output: Auto-generates LaTeX tables and CSV exports

Quick Start

Prerequisites

  • Python 3.10+
  • vast.ai account with API key
  • AWS S3 bucket for results (optional but recommended)
  • HuggingFace account with token (for gated models)

Installation

git clone https://github.com/yourusername/llm-gpu-benchmark.git
cd llm-gpu-benchmark

# Install dependencies
pip install -r requirements.txt

# Configure credentials
cp .env.example .env
# Edit .env with your API keys

Your First Benchmark

# Run using one of our benchmark configs (reproduces paper results)
python benchmark_runner.py --suite research_results/results_config/rtx5090_1x.yaml

# Or create your own config from the template
cp configs/template.yaml configs/my_benchmark.yaml
# Edit configs/my_benchmark.yaml with your settings
python benchmark_runner.py --suite configs/my_benchmark.yaml

# The script will:
# 1. Find and rent a GPU on vast.ai
# 2. Set up the environment (vLLM, models)
# 3. Run all benchmarks in the config
# 4. Upload results to S3
# 5. Terminate the instance

Configuration

Benchmarks are defined in YAML files. Here's the structure:

name: RTX 5090 1x GPU
description: Benchmark suite for RTX 5090 32GB single GPU

instance:
  gpu_type: RTX 5090          # GPU to rent on vast.ai
  gpu_count: 1                # Number of GPUs
  disk_space_gb: 100          # Disk space for models
  image: holtmann/llm-benchmark:latest
  max_bid_price: 2.0          # Max $/hr for spot instance

s3:
  bucket: ${S3_BUCKET_NAME}   # From .env
  prefix: benchmarks/rtx5090_1x
  upload_json: true
  upload_csv: true

benchmarks:
  - name: qwen3-8b-nvfp4-rag-8k-c8
    model: nvidia/Qwen3-8B-NVFP4
    vllm:
      max_model_len: 9216
      gpu_memory_utilization: 0.9
      dtype: auto
    aiperf:
      endpoint_type: chat
      streaming: true
      concurrency: 8
      synthetic_input_tokens_mean: 8192
      output_tokens_mean: 512
      request_count: 500

Workload Types

RAG (Retrieval-Augmented Generation)

Long context, moderate concurrency - typical enterprise RAG pipelines.

- name: qwen3-8b-nvfp4-rag-16k-c4
  aiperf:
    concurrency: 4
    synthetic_input_tokens_mean: 16384  # 16k context
    output_tokens_mean: 512

API (High-Concurrency)

Short context, high concurrency - chatbot/API serving.

- name: qwen3-8b-nvfp4-api-c128
  aiperf:
    concurrency: 128
    synthetic_input_tokens_mean: 256   # Short prompts
    output_tokens_mean: 256

Agentic (Multi-LoRA)

LoRA adapter switching for multi-tenant deployments.

- name: qwen3-8b-nvfp4-agentic-lora-c32
  vllm:
    enable_lora: true
    max_loras: 3
    lora_modules:
      - name: customer-support
        path: /models/loras/customer-support
      - name: code-assistant
        path: /models/loras/code-assistant
  aiperf:
    model_selection_strategy: random  # Randomly select adapter per request

Metrics Collected

Metric Description Unit
output_token_throughput Total tokens generated per second tok/s
output_token_throughput_per_user Per-request throughput tok/s
time_to_first_token Latency to first token (avg, P50, P95, P99) ms
inter_token_latency Streaming speed (avg, P50, P95, P99) ms
avg_power_w Average GPU power during benchmark W
wh_per_mtok Energy efficiency Wh/million tokens
max_temp_c Peak GPU temperature °C
throttle_pct Thermal throttling percentage %

Running Multiple Configs in Parallel

Each config runs independently - perfect for parallel execution:

# Terminal 1
python benchmark_runner.py --suite research_results/results_config/rtx5060ti_1x.yaml

# Terminal 2
python benchmark_runner.py --suite research_results/results_config/rtx5070ti_1x.yaml

# Terminal 3
python benchmark_runner.py --suite research_results/results_config/rtx5090_1x.yaml

# Results go to separate S3 prefixes, no conflicts

Cost Estimates

GPU Config Benchmarks Est. Time Est. Cost
RTX 5060 Ti 1x 33 ~4-5 hrs ~$2-4
RTX 5060 Ti 2x 22 ~3 hrs ~$3-5
RTX 5070 Ti 1x 34 ~4-5 hrs ~$3-5
RTX 5070 Ti 2x 26 ~3.5 hrs ~$5-8
RTX 5090 1x 38 ~5-6 hrs ~$5-10
RTX 5090 2x 30 ~4 hrs ~$8-15

Costs vary based on vast.ai spot pricing

Output Structure

Results are organized per benchmark run:

results/
└── 20241217_143052_RTX_5090_1x/
    ├── qwen3-8b-nvfp4-rag-8k-c8/
    │   ├── metadata.json           # Config and status
    │   ├── profile_export_aiperf.json  # Detailed metrics
    │   ├── gpu_metrics.log         # Power/temp timeline
    │   └── vllm.log               # Server logs
    ├── qwen3-8b-nvfp4-rag-16k-c4/
    │   └── ...
    └── summary.json               # Suite-level summary

Generating Paper Tables

Convert results to LaTeX tables:

python scripts/generate_paper_tables.py --results-dir results/ --output paper_tables.tex

Output:

\begin{table}[H]
  \caption{RAG workload throughput on RTX 5090}
  \begin{tabular}{l l c c c c}
    Model & Precision & Context & TPS & TTFT & ITL P95 \\
    \midrule
    Qwen3-8B & NVFP4 & 8k & 422.4 & 565 & 18.4 \\
    Qwen3-8B & NVFP4 & 16k & 225.3 & 1474 & 38.4 \\
    ...
  \end{tabular}
\end{table}

Environment Variables

Create a .env file:

# Required
VAST_API_KEY=your_vast_ai_api_key

# For gated models (Gemma, etc.)
HF_TOKEN=your_huggingface_token

# For S3 upload
AWS_ACCESS_KEY_ID=your_aws_key
AWS_SECRET_ACCESS_KEY=your_aws_secret
S3_BUCKET_NAME=your-bucket-name

# Optional: Custom LoRA adapters
LORA_CUSTOMER_SUPPORT=holtmann/qwen3-8b-customer-support-lora
LORA_TECHNICAL_DOCS=holtmann/qwen3-8b-technical-docs-lora

Adding New GPUs

  1. Create a new config file:
cp configs/template.yaml configs/rtx4090_1x.yaml
  1. Update the instance section:
instance:
  gpu_type: RTX 4090
  gpu_count: 1
  1. Adjust benchmarks based on VRAM (24GB for 4090):

    • Remove models that won't fit
    • Adjust max context lengths
    • Set appropriate concurrency levels
  2. Run:

python benchmark_runner.py --suite configs/rtx4090_1x.yaml

Model Compatibility

This suite works with any model supported by vLLM, including:

  • Standard HuggingFace models (Llama, Mistral, Qwen, Gemma, etc.)
  • Quantized models (GPTQ, AWQ, NVFP4, W4A16, etc.)
  • MoE models (Mixtral, GPT-OSS, etc.)
  • Any model with LoRA adapters

Simply specify the HuggingFace model ID in your config:

model: Qwen/Qwen3-8B                       # Popular 2025 model
model: mistralai/Mistral-Small-24B-Instruct-2501  # Mistral's latest
model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B    # DeepSeek R1 distilled
model: google/gemma-3-12b-it               # Google's Gemma 3

See vLLM Supported Models for the full list.

Docker Image

The benchmark environment is pre-built:

docker pull holtmann/llm-benchmark:latest

Includes:

  • vLLM with NVFP4/MXFP4 support
  • aiperf benchmarking tool
  • CUDA 12.x + cuDNN
  • HuggingFace Hub CLI
  • AWS CLI for S3 uploads

Build your own:

docker build -t my-benchmark:latest -f Dockerfile .

Troubleshooting

OOM Errors

  • Reduce max_model_len in config
  • Lower concurrency
  • Use more aggressive quantization (NVFP4 vs BF16)

vast.ai Instance Not Found

  • Increase max_bid_price
  • Try different GPU type
  • Check vast.ai availability

Model Download Fails

  • Verify HF_TOKEN is set
  • Check model exists on HuggingFace
  • Ensure sufficient disk space

LoRA Adapter Not Found

  • Check adapter name matches config
  • Verify HuggingFace repo exists
  • Check LORA_* env variables

Citation

If you use this benchmark suite in your research, please cite:

@misc{knoop2026privatellminferenceconsumer,
      title={Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs}, 
      author={Jonathan Knoop and Hendrik Holtmann},
      year={2026},
      eprint={2601.09527},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.09527}, 
}

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add your GPU configs or improvements
  4. Submit a pull request

Ideas for contribution:

  • New GPU configurations (RTX 4090, A6000, etc.)
  • Additional workload types
  • Improved metrics collection
  • Documentation improvements

Acknowledgments

  • vLLM - High-throughput LLM serving
  • aiperf - LLM benchmarking tool
  • vast.ai - GPU cloud marketplace
  • NVIDIA - NVFP4 quantization support

About

Benchmark suite for LLM inference on NVIDIA consumer GPUs (RTX 5060 Ti, 5070 Ti, 5090)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published