Skip to content

capetron/ptg-vllm-deploy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Petronella Technology Group vLLM Multi-GPU Deployment Configs

Production-ready Docker Compose files and configuration templates for deploying vLLM across multi-GPU systems. Covers pipeline parallelism (PCIe) and tensor parallelism (NVLink) strategies for RTX 6000 Pro Blackwell, H100 SXM, and H200 NVL GPUs.

Why This Exists

Deploying large language models across multiple GPUs requires choosing between two parallelism strategies, each with dramatically different performance characteristics depending on your hardware interconnect and workload. Most guides cover single-GPU setups. This repository provides tested, production-oriented configs for real multi-GPU deployments.

The core insight: vLLM's pipeline parallel scheduler fills idle GPU "bubbles" with queued requests from other users. This makes pipeline parallelism (PP) the optimal choice for multi-user API serving on PCIe-connected GPU systems (workstations, standard rack servers). Tensor parallelism (TP) is better for latency-sensitive single-user workloads on NVLink-connected systems (DGX, HGX).

Pipeline Parallelism vs Tensor Parallelism

Pipeline Parallelism (PP) Tensor Parallelism (TP)
How it splits the model By layers (GPU 0: layers 0-19, GPU 1: layers 20-39, ...) By weight matrices (every GPU computes a shard of every layer)
Inter-GPU communication Activation tensors between adjacent stages (~67 MB/pass) All-reduce after every layer (~160 ops/token for 70B)
Required interconnect PCIe is sufficient NVLink/NVSwitch required
Best for Multi-user serving (API, many concurrent requests) Low-latency single-user (chatbot, real-time)
GPU utilization at low concurrency Low (pipeline bubbles) High (all GPUs always active)
GPU utilization at high concurrency High (bubbles filled with other users) High but all-reduce overhead grows
Mixed GPU support Yes No (requires identical GPUs)

Decision rule: If your GPUs are connected via PCIe (most setups), use pipeline parallelism. If you have NVLink (DGX/HGX), benchmark both.

See docs/pipeline-vs-tensor.md for the full technical deep-dive.

Quick Start

1. Prerequisites

Verify your setup:

# Check GPU detection
nvidia-smi

# Check Docker NVIDIA runtime
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

# Run the full health check
bash scripts/health-check.sh

2. Configure

cd docker-compose
cp .env.example .env

Edit .env with your model choice and HuggingFace token:

VLLM_MODEL=meta-llama/Llama-3.1-70B-Instruct
HF_TOKEN=hf_your_token_here
VLLM_API_KEY=your-secure-key

3. Deploy

Choose the compose file matching your GPU count and parallelism strategy:

# 2 GPUs, Pipeline Parallel (PCIe)
docker compose -f 2-gpu-pp.yml up -d

# 4 GPUs, Pipeline Parallel (PCIe)
docker compose -f 4-gpu-pp.yml up -d

# 8 GPUs, Pipeline Parallel (PCIe)
docker compose -f 8-gpu-pp.yml up -d

# 8 GPUs, Tensor Parallel (NVLink required)
docker compose -f 8-gpu-tp.yml up -d

4. Verify

# Check container is running
docker logs -f vllm-4gpu-pp

# Wait for "Uvicorn running on" message, then:
curl http://localhost:8000/health

# Test inference
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secure-key" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Repository Structure

ptg-vllm-deploy/
├── README.md
├── LICENSE
├── docker-compose/
│   ├── .env.example            # Environment variables template
│   ├── 2-gpu-pp.yml            # 2x GPU pipeline parallel
│   ├── 4-gpu-pp.yml            # 4x GPU pipeline parallel
│   ├── 8-gpu-pp.yml            # 8x GPU pipeline parallel
│   └── 8-gpu-tp.yml            # 8x GPU tensor parallel (NVLink)
├── configs/
│   ├── vllm-rtx6000-pro.yaml   # RTX 6000 Pro Blackwell optimized
│   ├── vllm-h100-sxm.yaml      # H100 SXM optimized
│   ├── vllm-h200-nvl.yaml      # H200 NVL optimized
│   └── vllm-mixed-gpu.yaml     # Heterogeneous GPU config
├── scripts/
│   ├── benchmark.py             # Throughput/latency benchmark
│   ├── health-check.sh          # Pre-deployment GPU verification
│   └── monitor-inference.sh     # Real-time performance dashboard
└── docs/
    ├── pipeline-vs-tensor.md    # PP vs TP technical guide
    └── gpu-memory-calculator.md # Memory sizing formulas & tables

Configuration Guide

Which Compose File?

Your Hardware GPU Count Interconnect Compose File Why
RTX 4090 workstation 2 PCIe 4.0 2-gpu-pp.yml PP over PCIe, bubble filling
RTX 6000 Pro workstation 4 PCIe 5.0 4-gpu-pp.yml PP over PCIe, high VRAM
Multi-GPU rack server 8 PCIe 5.0 8-gpu-pp.yml PP for throughput
DGX H100 8 NVSwitch 8-gpu-tp.yml TP over NVLink for latency
HGX H200 8 NVLink 4.0 8-gpu-tp.yml TP over NVLink, massive VRAM
Mixed GPUs 2-8 PCIe 2-gpu-pp.yml+ PP only; see mixed config

Which Model Fits?

See docs/gpu-memory-calculator.md for full sizing tables. Quick reference:

Model Size (BF16) Minimum GPUs Recommended Setup
Llama 3.1 8B 16 GB 1x 24GB 1x RTX 4090
Llama 3.1 70B 140 GB 2x 80GB or 4x 48GB 4x RTX 6000 Pro (PP=4)
Llama 3.1 405B 810 GB 8x 80GB (FP8) 8x H100 SXM (TP=8)
Mixtral 8x7B 93 GB 2x 48GB 2x RTX 6000 Pro (PP=2)
Mixtral 8x22B 282 GB 4x 80GB 4x H100 SXM (TP=4)
DeepSeek-V3 1,342 GB 8x 80GB (FP8) 8x H200 NVL (TP=8)

Key Environment Variables

Variable Default Description
VLLM_MODEL meta-llama/Llama-3.1-70B-Instruct HuggingFace model ID
HF_TOKEN (required) HuggingFace access token
VLLM_API_KEY (optional) API authentication key
MAX_MODEL_LEN 32768 Maximum context length
GPU_MEMORY_UTILIZATION 0.90 Fraction of GPU memory to use
MAX_NUM_SEQS 256 Max concurrent sequences
QUANTIZATION none none, fp8, awq, gptq
DTYPE auto auto, bfloat16, float16
SWAP_SPACE 4 CPU swap space in GiB

GPU-Specific Configs

The configs/ directory contains YAML files with tuned parameters and memory budgets for specific GPU models. These are reference configurations -- copy the relevant settings into your .env file.

Benchmarking

Compare throughput and latency across configurations:

# Install dependency
pip install aiohttp

# Benchmark single endpoint
python scripts/benchmark.py --url http://localhost:8000 --concurrency 1,4,8,16,32

# Compare PP vs TP (run both servers first)
python scripts/benchmark.py --compare \
    --pp-url http://localhost:8000 \
    --tp-url http://localhost:8001 \
    --concurrency 1,4,8,16,32,64

# Save results as JSON
python scripts/benchmark.py --url http://localhost:8000 --output results.json

The benchmark measures:

  • Output tokens per second at each concurrency level
  • Request latency percentiles (p50, p95, p99)
  • Time to first token (TTFT)
  • Requests per second throughput

Monitoring

Real-Time Dashboard

# Monitor GPU + vLLM metrics
bash scripts/monitor-inference.sh --url http://localhost:8000

# Log metrics to CSV for analysis
bash scripts/monitor-inference.sh --url http://localhost:8000 --log metrics.csv

Health Check

# Full pre-deployment check
bash scripts/health-check.sh

# Quick check (skip bandwidth test)
bash scripts/health-check.sh --quick

# Include vLLM server check
bash scripts/health-check.sh --vllm-url http://localhost:8000

The health check verifies: GPU availability, driver version, memory status, temperature, ECC errors, NVLink detection, PCIe configuration, Docker NVIDIA runtime, and system resources.

Prometheus Metrics

vLLM exposes Prometheus metrics at /metrics. Key metrics to monitor:

  • vllm:num_requests_running -- Active requests
  • vllm:num_requests_waiting -- Queued requests
  • vllm:gpu_cache_usage_perc -- KV cache utilization
  • vllm:time_to_first_token_seconds -- TTFT histogram
  • vllm:generation_tokens_total -- Total output tokens

Who We Are

Petronella Technology Group is a Raleigh, NC-based technology firm specializing in AI infrastructure, cybersecurity, and compliance. We design and deploy multi-GPU inference systems for enterprises running private LLMs -- from workstation builds with RTX 6000 Pro GPUs to full DGX clusters with NVLink.

Our team holds CMMC-RP, CCNA, CWNE, and DFE certifications. We have hands-on experience deploying vLLM, TGI, and other inference engines across PCIe and NVLink-connected GPU systems for clients in healthcare, defense, financial services, and legal.

Related Resources

Contact

License

MIT License. See LICENSE for details.

About

Production vLLM deployment configs for multi-GPU setups. Docker Compose, pipeline parallelism configs for 2/4/6/8 GPU RTX 6000 Pro, H100, and H200 systems. By Petronella Technology Group.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors