Distributed Inference with vLLM

Executable documentation and knowledge base for running distributed LLM inference using vLLM on HPC clusters.

Overview

This repository provides reproducible recipes for deploying large language model inference at scale. Each workflow includes complete environment specifications, step-by-step instructions, and performance benchmarks tested on real GPU clusters.

Key Features:

Fully Reproducible - Exact package versions, commit hashes, and hardware configs
Production-Ready - Tested on HPC clusters with real workloads
Comprehensive Documentation - From environment setup to troubleshooting
Multiple Parallelism Options - Single GPU, tensor parallel, and multi-node setups

Quick Start

# 1. Set up environment
cd envs/uv/u260304_vllm
export UV_CACHE_DIR=<your-cache-directory>  # Set cache directory
uv venv vllm_env --python 3.12 --seed       # Create virtual environment
source vllm_env/bin/activate                # Activate environment
uv pip install -r requirements-frozen.txt   # Install packages

# 2. Run a workflow
cd ../../..                                 # Return to repo root
cd workflows/Qwen2.5-32B-Instruct_single-gpu-inference
python simple_inference_test.py

See workflows/ for all available models and configurations.

Repository Structure

├── envs/          # Reproducible runtime environments
├── workflows/     # Model inference recipes and examples
├── reports/       # Benchmarking and evaluation studies
├── workshops/     # Training and educational materials
├── scripts/       # Utility scripts and tools
└── CONTRIBUTING.md # Detailed contribution guidelines

Getting Started

1. Choose Your Path

For Quick Testing: Start with a single-GPU workflow:

Qwen2.5-32B-Instruct - 32B parameter model on A100/H100

For Production Deployment: Review environment specifications in envs/ and select the appropriate workflow from workflows/

2. Set Up Environment

Each workflow specifies its required environment. Navigate to the environment directory and follow setup instructions:

cd envs/uv/u260304_vllm
# Follow README.md for installation

3. Run Workflow

Navigate to your chosen workflow and follow its README:

cd workflows/Qwen2.5-32B-Instruct_single-gpu-inference
# Follow README.md for execution

Available Resources

See envs for complete environment catalog.
See workflows for complete workflow catalog with specifications.

Contributing

See CONTRIBUTING.md for detailed guidelines.

License

See LICENSE for details.

NEWS

2026-03-06: Added Meta-Llama-3.1-405B-Instruct-FP8 multi-node workflow. New workflow for deploying the 405B parameter model with FP8 quantization (382GB storage) on 8×H100 or 4×H200 GPUs. Features ~50% memory reduction vs FP16/BF16, improved throughput, and comprehensive HPC deployment guide with Ray cluster initialization, batch processing examples, and production-ready SLURM scripts.
2026-03-04: First uv environment (u260304_vllm) and workflow (Qwen2.5-32B-Instruct single-GPU inference). Includes vLLM 0.11.2 with CUDA 12.9 support and comprehensive documentation following the new contribution guidelines.
2025-06-09: Added DeepSeek-R1-0528 workflow - an upgraded version with enhanced math, programming, and logic reasoning. See DeepSeek-R1-0528 workflow for details.
2025-06-09: DeepSeek-R1 multi-node deployment. New conda environment (c250609_vllm085) with vLLM 0.8.5.post1 and comprehensive workflow for deploying 671B parameter model with FP8 precision on 16×H100 or 8×H200 GPUs. Includes throughput benchmarks and SLURM scripts.
2024-10-09: Added Llama 3.1 workflows from Timothy Ngotiaoco and Max Shad. Two new workflows: Llama 3.1 70B (4×H100) and Llama 3.1 405B (16×H100) with 128k context length support.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
envs		envs
examples/workshop		examples/workshop
figures		figures
scripts		scripts
server		server
workflows		workflows
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Inference with vLLM

Overview

Quick Start

Repository Structure

Getting Started

1. Choose Your Path

2. Set Up Environment

3. Run Workflow

Available Resources

Contributing

License

NEWS

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed Inference with vLLM

Overview

Quick Start

Repository Structure

Getting Started

1. Choose Your Path

2. Set Up Environment

3. Run Workflow

Available Resources

Contributing

License

NEWS

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages