Petronella Technology Group vLLM Multi-GPU Deployment Configs

Production-ready Docker Compose files and configuration templates for deploying vLLM across multi-GPU systems. Covers pipeline parallelism (PCIe) and tensor parallelism (NVLink) strategies for RTX 6000 Pro Blackwell, H100 SXM, and H200 NVL GPUs.

Why This Exists

Deploying large language models across multiple GPUs requires choosing between two parallelism strategies, each with dramatically different performance characteristics depending on your hardware interconnect and workload. Most guides cover single-GPU setups. This repository provides tested, production-oriented configs for real multi-GPU deployments.

The core insight: vLLM's pipeline parallel scheduler fills idle GPU "bubbles" with queued requests from other users. This makes pipeline parallelism (PP) the optimal choice for multi-user API serving on PCIe-connected GPU systems (workstations, standard rack servers). Tensor parallelism (TP) is better for latency-sensitive single-user workloads on NVLink-connected systems (DGX, HGX).

Pipeline Parallelism vs Tensor Parallelism

	Pipeline Parallelism (PP)	Tensor Parallelism (TP)
How it splits the model	By layers (GPU 0: layers 0-19, GPU 1: layers 20-39, ...)	By weight matrices (every GPU computes a shard of every layer)
Inter-GPU communication	Activation tensors between adjacent stages (~67 MB/pass)	All-reduce after every layer (~160 ops/token for 70B)
Required interconnect	PCIe is sufficient	NVLink/NVSwitch required
Best for	Multi-user serving (API, many concurrent requests)	Low-latency single-user (chatbot, real-time)
GPU utilization at low concurrency	Low (pipeline bubbles)	High (all GPUs always active)
GPU utilization at high concurrency	High (bubbles filled with other users)	High but all-reduce overhead grows
Mixed GPU support	Yes	No (requires identical GPUs)

Decision rule: If your GPUs are connected via PCIe (most setups), use pipeline parallelism. If you have NVLink (DGX/HGX), benchmark both.

See docs/pipeline-vs-tensor.md for the full technical deep-dive.

Quick Start

1. Prerequisites

NVIDIA GPU(s) with sufficient VRAM for your model
Docker with NVIDIA Container Toolkit
Docker Compose v2+

Verify your setup:

# Check GPU detection
nvidia-smi

# Check Docker NVIDIA runtime
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

# Run the full health check
bash scripts/health-check.sh

2. Configure

cd docker-compose
cp .env.example .env

Edit .env with your model choice and HuggingFace token:

VLLM_MODEL=meta-llama/Llama-3.1-70B-Instruct
HF_TOKEN=hf_your_token_here
VLLM_API_KEY=your-secure-key

3. Deploy

Choose the compose file matching your GPU count and parallelism strategy:

# 2 GPUs, Pipeline Parallel (PCIe)
docker compose -f 2-gpu-pp.yml up -d

# 4 GPUs, Pipeline Parallel (PCIe)
docker compose -f 4-gpu-pp.yml up -d

# 8 GPUs, Pipeline Parallel (PCIe)
docker compose -f 8-gpu-pp.yml up -d

# 8 GPUs, Tensor Parallel (NVLink required)
docker compose -f 8-gpu-tp.yml up -d

4. Verify

# Check container is running
docker logs -f vllm-4gpu-pp

# Wait for "Uvicorn running on" message, then:
curl http://localhost:8000/health

# Test inference
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secure-key" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Repository Structure

ptg-vllm-deploy/
├── README.md
├── LICENSE
├── docker-compose/
│   ├── .env.example            # Environment variables template
│   ├── 2-gpu-pp.yml            # 2x GPU pipeline parallel
│   ├── 4-gpu-pp.yml            # 4x GPU pipeline parallel
│   ├── 8-gpu-pp.yml            # 8x GPU pipeline parallel
│   └── 8-gpu-tp.yml            # 8x GPU tensor parallel (NVLink)
├── configs/
│   ├── vllm-rtx6000-pro.yaml   # RTX 6000 Pro Blackwell optimized
│   ├── vllm-h100-sxm.yaml      # H100 SXM optimized
│   ├── vllm-h200-nvl.yaml      # H200 NVL optimized
│   └── vllm-mixed-gpu.yaml     # Heterogeneous GPU config
├── scripts/
│   ├── benchmark.py             # Throughput/latency benchmark
│   ├── health-check.sh          # Pre-deployment GPU verification
│   └── monitor-inference.sh     # Real-time performance dashboard
└── docs/
    ├── pipeline-vs-tensor.md    # PP vs TP technical guide
    └── gpu-memory-calculator.md # Memory sizing formulas & tables

Configuration Guide

Which Compose File?

Your Hardware	GPU Count	Interconnect	Compose File	Why
RTX 4090 workstation	2	PCIe 4.0	`2-gpu-pp.yml`	PP over PCIe, bubble filling
RTX 6000 Pro workstation	4	PCIe 5.0	`4-gpu-pp.yml`	PP over PCIe, high VRAM
Multi-GPU rack server	8	PCIe 5.0	`8-gpu-pp.yml`	PP for throughput
DGX H100	8	NVSwitch	`8-gpu-tp.yml`	TP over NVLink for latency
HGX H200	8	NVLink 4.0	`8-gpu-tp.yml`	TP over NVLink, massive VRAM
Mixed GPUs	2-8	PCIe	`2-gpu-pp.yml`+	PP only; see mixed config

Which Model Fits?

See docs/gpu-memory-calculator.md for full sizing tables. Quick reference:

Model	Size (BF16)	Minimum GPUs	Recommended Setup
Llama 3.1 8B	16 GB	1x 24GB	1x RTX 4090
Llama 3.1 70B	140 GB	2x 80GB or 4x 48GB	4x RTX 6000 Pro (PP=4)
Llama 3.1 405B	810 GB	8x 80GB (FP8)	8x H100 SXM (TP=8)
Mixtral 8x7B	93 GB	2x 48GB	2x RTX 6000 Pro (PP=2)
Mixtral 8x22B	282 GB	4x 80GB	4x H100 SXM (TP=4)
DeepSeek-V3	1,342 GB	8x 80GB (FP8)	8x H200 NVL (TP=8)

Key Environment Variables

Variable	Default	Description
`VLLM_MODEL`	`meta-llama/Llama-3.1-70B-Instruct`	HuggingFace model ID
`HF_TOKEN`	(required)	HuggingFace access token
`VLLM_API_KEY`	(optional)	API authentication key
`MAX_MODEL_LEN`	`32768`	Maximum context length
`GPU_MEMORY_UTILIZATION`	`0.90`	Fraction of GPU memory to use
`MAX_NUM_SEQS`	`256`	Max concurrent sequences
`QUANTIZATION`	`none`	`none`, `fp8`, `awq`, `gptq`
`DTYPE`	`auto`	`auto`, `bfloat16`, `float16`
`SWAP_SPACE`	`4`	CPU swap space in GiB

GPU-Specific Configs

The configs/ directory contains YAML files with tuned parameters and memory budgets for specific GPU models. These are reference configurations -- copy the relevant settings into your .env file.

vllm-rtx6000-pro.yaml -- 96 GB GDDR7, PCIe 5.0, optimized for pipeline parallelism
vllm-h100-sxm.yaml -- 80 GB HBM3, NVLink 4.0, optimized for tensor parallelism
vllm-h200-nvl.yaml -- 141 GB HBM3e, NVLink 4.0, maximum VRAM per GPU
vllm-mixed-gpu.yaml -- Heterogeneous GPU strategies

Benchmarking

Compare throughput and latency across configurations:

# Install dependency
pip install aiohttp

# Benchmark single endpoint
python scripts/benchmark.py --url http://localhost:8000 --concurrency 1,4,8,16,32

# Compare PP vs TP (run both servers first)
python scripts/benchmark.py --compare \
    --pp-url http://localhost:8000 \
    --tp-url http://localhost:8001 \
    --concurrency 1,4,8,16,32,64

# Save results as JSON
python scripts/benchmark.py --url http://localhost:8000 --output results.json

The benchmark measures:

Output tokens per second at each concurrency level
Request latency percentiles (p50, p95, p99)
Time to first token (TTFT)
Requests per second throughput

Monitoring

Real-Time Dashboard

# Monitor GPU + vLLM metrics
bash scripts/monitor-inference.sh --url http://localhost:8000

# Log metrics to CSV for analysis
bash scripts/monitor-inference.sh --url http://localhost:8000 --log metrics.csv

Health Check

# Full pre-deployment check
bash scripts/health-check.sh

# Quick check (skip bandwidth test)
bash scripts/health-check.sh --quick

# Include vLLM server check
bash scripts/health-check.sh --vllm-url http://localhost:8000

The health check verifies: GPU availability, driver version, memory status, temperature, ECC errors, NVLink detection, PCIe configuration, Docker NVIDIA runtime, and system resources.

Prometheus Metrics

vLLM exposes Prometheus metrics at /metrics. Key metrics to monitor:

vllm:num_requests_running -- Active requests
vllm:num_requests_waiting -- Queued requests
vllm:gpu_cache_usage_perc -- KV cache utilization
vllm:time_to_first_token_seconds -- TTFT histogram
vllm:generation_tokens_total -- Total output tokens

Who We Are

Petronella Technology Group is a Raleigh, NC-based technology firm specializing in AI infrastructure, cybersecurity, and compliance. We design and deploy multi-GPU inference systems for enterprises running private LLMs -- from workstation builds with RTX 6000 Pro GPUs to full DGX clusters with NVLink.

Our team holds CMMC-RP, CCNA, CWNE, and DFE certifications. We have hands-on experience deploying vLLM, TGI, and other inference engines across PCIe and NVLink-connected GPU systems for clients in healthcare, defense, financial services, and legal.

Related Resources

RTX 6000 Pro Blackwell Multi-GPU vLLM -- Detailed guide for RTX 6000 Pro workstation deployments
Tensor vs Pipeline Parallelism -- When to use each strategy
NVIDIA DGX Systems -- Enterprise DGX deployment services
AI Development Systems -- Custom GPU server builds
H100/H200 NVLink Cluster Scaling -- Multi-node GPU cluster architecture
AI Services -- Full AI consulting and deployment services

Contact

Website: petronellatech.com
Phone: (919) 348-4912
Consultation: Schedule a Call

License

MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Petronella Technology Group vLLM Multi-GPU Deployment Configs

Why This Exists

Pipeline Parallelism vs Tensor Parallelism

Quick Start

1. Prerequisites

2. Configure

3. Deploy

4. Verify

Repository Structure

Configuration Guide

Which Compose File?

Which Model Fits?

Key Environment Variables

GPU-Specific Configs

Benchmarking

Monitoring

Real-Time Dashboard

Health Check

Prometheus Metrics

Who We Are

Related Resources

Contact

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
docker-compose		docker-compose
docs		docs
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Petronella Technology Group vLLM Multi-GPU Deployment Configs

Why This Exists

Pipeline Parallelism vs Tensor Parallelism

Quick Start

1. Prerequisites

2. Configure

3. Deploy

4. Verify

Repository Structure

Configuration Guide

Which Compose File?

Which Model Fits?

Key Environment Variables

GPU-Specific Configs

Benchmarking

Monitoring

Real-Time Dashboard

Health Check

Prometheus Metrics

Who We Are

Related Resources

Contact

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages