Production-ready Docker Compose files and configuration templates for deploying vLLM across multi-GPU systems. Covers pipeline parallelism (PCIe) and tensor parallelism (NVLink) strategies for RTX 6000 Pro Blackwell, H100 SXM, and H200 NVL GPUs.
Deploying large language models across multiple GPUs requires choosing between two parallelism strategies, each with dramatically different performance characteristics depending on your hardware interconnect and workload. Most guides cover single-GPU setups. This repository provides tested, production-oriented configs for real multi-GPU deployments.
The core insight: vLLM's pipeline parallel scheduler fills idle GPU "bubbles" with queued requests from other users. This makes pipeline parallelism (PP) the optimal choice for multi-user API serving on PCIe-connected GPU systems (workstations, standard rack servers). Tensor parallelism (TP) is better for latency-sensitive single-user workloads on NVLink-connected systems (DGX, HGX).
| Pipeline Parallelism (PP) | Tensor Parallelism (TP) | |
|---|---|---|
| How it splits the model | By layers (GPU 0: layers 0-19, GPU 1: layers 20-39, ...) | By weight matrices (every GPU computes a shard of every layer) |
| Inter-GPU communication | Activation tensors between adjacent stages (~67 MB/pass) | All-reduce after every layer (~160 ops/token for 70B) |
| Required interconnect | PCIe is sufficient | NVLink/NVSwitch required |
| Best for | Multi-user serving (API, many concurrent requests) | Low-latency single-user (chatbot, real-time) |
| GPU utilization at low concurrency | Low (pipeline bubbles) | High (all GPUs always active) |
| GPU utilization at high concurrency | High (bubbles filled with other users) | High but all-reduce overhead grows |
| Mixed GPU support | Yes | No (requires identical GPUs) |
Decision rule: If your GPUs are connected via PCIe (most setups), use pipeline parallelism. If you have NVLink (DGX/HGX), benchmark both.
See docs/pipeline-vs-tensor.md for the full technical deep-dive.
- NVIDIA GPU(s) with sufficient VRAM for your model
- Docker with NVIDIA Container Toolkit
- Docker Compose v2+
Verify your setup:
# Check GPU detection
nvidia-smi
# Check Docker NVIDIA runtime
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
# Run the full health check
bash scripts/health-check.shcd docker-compose
cp .env.example .envEdit .env with your model choice and HuggingFace token:
VLLM_MODEL=meta-llama/Llama-3.1-70B-Instruct
HF_TOKEN=hf_your_token_here
VLLM_API_KEY=your-secure-keyChoose the compose file matching your GPU count and parallelism strategy:
# 2 GPUs, Pipeline Parallel (PCIe)
docker compose -f 2-gpu-pp.yml up -d
# 4 GPUs, Pipeline Parallel (PCIe)
docker compose -f 4-gpu-pp.yml up -d
# 8 GPUs, Pipeline Parallel (PCIe)
docker compose -f 8-gpu-pp.yml up -d
# 8 GPUs, Tensor Parallel (NVLink required)
docker compose -f 8-gpu-tp.yml up -d# Check container is running
docker logs -f vllm-4gpu-pp
# Wait for "Uvicorn running on" message, then:
curl http://localhost:8000/health
# Test inference
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secure-key" \
-d '{
"model": "meta-llama/Llama-3.1-70B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'ptg-vllm-deploy/
├── README.md
├── LICENSE
├── docker-compose/
│ ├── .env.example # Environment variables template
│ ├── 2-gpu-pp.yml # 2x GPU pipeline parallel
│ ├── 4-gpu-pp.yml # 4x GPU pipeline parallel
│ ├── 8-gpu-pp.yml # 8x GPU pipeline parallel
│ └── 8-gpu-tp.yml # 8x GPU tensor parallel (NVLink)
├── configs/
│ ├── vllm-rtx6000-pro.yaml # RTX 6000 Pro Blackwell optimized
│ ├── vllm-h100-sxm.yaml # H100 SXM optimized
│ ├── vllm-h200-nvl.yaml # H200 NVL optimized
│ └── vllm-mixed-gpu.yaml # Heterogeneous GPU config
├── scripts/
│ ├── benchmark.py # Throughput/latency benchmark
│ ├── health-check.sh # Pre-deployment GPU verification
│ └── monitor-inference.sh # Real-time performance dashboard
└── docs/
├── pipeline-vs-tensor.md # PP vs TP technical guide
└── gpu-memory-calculator.md # Memory sizing formulas & tables
| Your Hardware | GPU Count | Interconnect | Compose File | Why |
|---|---|---|---|---|
| RTX 4090 workstation | 2 | PCIe 4.0 | 2-gpu-pp.yml |
PP over PCIe, bubble filling |
| RTX 6000 Pro workstation | 4 | PCIe 5.0 | 4-gpu-pp.yml |
PP over PCIe, high VRAM |
| Multi-GPU rack server | 8 | PCIe 5.0 | 8-gpu-pp.yml |
PP for throughput |
| DGX H100 | 8 | NVSwitch | 8-gpu-tp.yml |
TP over NVLink for latency |
| HGX H200 | 8 | NVLink 4.0 | 8-gpu-tp.yml |
TP over NVLink, massive VRAM |
| Mixed GPUs | 2-8 | PCIe | 2-gpu-pp.yml+ |
PP only; see mixed config |
See docs/gpu-memory-calculator.md for full sizing tables. Quick reference:
| Model | Size (BF16) | Minimum GPUs | Recommended Setup |
|---|---|---|---|
| Llama 3.1 8B | 16 GB | 1x 24GB | 1x RTX 4090 |
| Llama 3.1 70B | 140 GB | 2x 80GB or 4x 48GB | 4x RTX 6000 Pro (PP=4) |
| Llama 3.1 405B | 810 GB | 8x 80GB (FP8) | 8x H100 SXM (TP=8) |
| Mixtral 8x7B | 93 GB | 2x 48GB | 2x RTX 6000 Pro (PP=2) |
| Mixtral 8x22B | 282 GB | 4x 80GB | 4x H100 SXM (TP=4) |
| DeepSeek-V3 | 1,342 GB | 8x 80GB (FP8) | 8x H200 NVL (TP=8) |
| Variable | Default | Description |
|---|---|---|
VLLM_MODEL |
meta-llama/Llama-3.1-70B-Instruct |
HuggingFace model ID |
HF_TOKEN |
(required) | HuggingFace access token |
VLLM_API_KEY |
(optional) | API authentication key |
MAX_MODEL_LEN |
32768 |
Maximum context length |
GPU_MEMORY_UTILIZATION |
0.90 |
Fraction of GPU memory to use |
MAX_NUM_SEQS |
256 |
Max concurrent sequences |
QUANTIZATION |
none |
none, fp8, awq, gptq |
DTYPE |
auto |
auto, bfloat16, float16 |
SWAP_SPACE |
4 |
CPU swap space in GiB |
The configs/ directory contains YAML files with tuned parameters and memory budgets for specific GPU models. These are reference configurations -- copy the relevant settings into your .env file.
- vllm-rtx6000-pro.yaml -- 96 GB GDDR7, PCIe 5.0, optimized for pipeline parallelism
- vllm-h100-sxm.yaml -- 80 GB HBM3, NVLink 4.0, optimized for tensor parallelism
- vllm-h200-nvl.yaml -- 141 GB HBM3e, NVLink 4.0, maximum VRAM per GPU
- vllm-mixed-gpu.yaml -- Heterogeneous GPU strategies
Compare throughput and latency across configurations:
# Install dependency
pip install aiohttp
# Benchmark single endpoint
python scripts/benchmark.py --url http://localhost:8000 --concurrency 1,4,8,16,32
# Compare PP vs TP (run both servers first)
python scripts/benchmark.py --compare \
--pp-url http://localhost:8000 \
--tp-url http://localhost:8001 \
--concurrency 1,4,8,16,32,64
# Save results as JSON
python scripts/benchmark.py --url http://localhost:8000 --output results.jsonThe benchmark measures:
- Output tokens per second at each concurrency level
- Request latency percentiles (p50, p95, p99)
- Time to first token (TTFT)
- Requests per second throughput
# Monitor GPU + vLLM metrics
bash scripts/monitor-inference.sh --url http://localhost:8000
# Log metrics to CSV for analysis
bash scripts/monitor-inference.sh --url http://localhost:8000 --log metrics.csv# Full pre-deployment check
bash scripts/health-check.sh
# Quick check (skip bandwidth test)
bash scripts/health-check.sh --quick
# Include vLLM server check
bash scripts/health-check.sh --vllm-url http://localhost:8000The health check verifies: GPU availability, driver version, memory status, temperature, ECC errors, NVLink detection, PCIe configuration, Docker NVIDIA runtime, and system resources.
vLLM exposes Prometheus metrics at /metrics. Key metrics to monitor:
vllm:num_requests_running-- Active requestsvllm:num_requests_waiting-- Queued requestsvllm:gpu_cache_usage_perc-- KV cache utilizationvllm:time_to_first_token_seconds-- TTFT histogramvllm:generation_tokens_total-- Total output tokens
Petronella Technology Group is a Raleigh, NC-based technology firm specializing in AI infrastructure, cybersecurity, and compliance. We design and deploy multi-GPU inference systems for enterprises running private LLMs -- from workstation builds with RTX 6000 Pro GPUs to full DGX clusters with NVLink.
Our team holds CMMC-RP, CCNA, CWNE, and DFE certifications. We have hands-on experience deploying vLLM, TGI, and other inference engines across PCIe and NVLink-connected GPU systems for clients in healthcare, defense, financial services, and legal.
- RTX 6000 Pro Blackwell Multi-GPU vLLM -- Detailed guide for RTX 6000 Pro workstation deployments
- Tensor vs Pipeline Parallelism -- When to use each strategy
- NVIDIA DGX Systems -- Enterprise DGX deployment services
- AI Development Systems -- Custom GPU server builds
- H100/H200 NVLink Cluster Scaling -- Multi-node GPU cluster architecture
- AI Services -- Full AI consulting and deployment services
- Website: petronellatech.com
- Phone: (919) 348-4912
- Consultation: Schedule a Call
MIT License. See LICENSE for details.