Executable documentation and knowledge base for running distributed LLM inference using vLLM on HPC clusters.
This repository provides reproducible recipes for deploying large language model inference at scale. Each workflow includes complete environment specifications, step-by-step instructions, and performance benchmarks tested on real GPU clusters.
Key Features:
- Fully Reproducible - Exact package versions, commit hashes, and hardware configs
- Production-Ready - Tested on HPC clusters with real workloads
- Comprehensive Documentation - From environment setup to troubleshooting
- Multiple Parallelism Options - Single GPU, tensor parallel, and multi-node setups
# 1. Set up environment
cd envs/uv/u260304_vllm
export UV_CACHE_DIR=<your-cache-directory> # Set cache directory
uv venv vllm_env --python 3.12 --seed # Create virtual environment
source vllm_env/bin/activate # Activate environment
uv pip install -r requirements-frozen.txt # Install packages
# 2. Run a workflow
cd ../../.. # Return to repo root
cd workflows/Qwen2.5-32B-Instruct_single-gpu-inference
python simple_inference_test.pySee workflows/ for all available models and configurations.
├── envs/ # Reproducible runtime environments
├── workflows/ # Model inference recipes and examples
├── reports/ # Benchmarking and evaluation studies
├── workshops/ # Training and educational materials
├── scripts/ # Utility scripts and tools
└── CONTRIBUTING.md # Detailed contribution guidelines
For Quick Testing: Start with a single-GPU workflow:
- Qwen2.5-32B-Instruct - 32B parameter model on A100/H100
For Production Deployment: Review environment specifications in envs/ and select the appropriate workflow from workflows/
Each workflow specifies its required environment. Navigate to the environment directory and follow setup instructions:
cd envs/uv/u260304_vllm
# Follow README.md for installationNavigate to your chosen workflow and follow its README:
cd workflows/Qwen2.5-32B-Instruct_single-gpu-inference
# Follow README.md for execution- See envs for complete environment catalog.
- See workflows for complete workflow catalog with specifications.
See CONTRIBUTING.md for detailed guidelines.
See LICENSE for details.
- 2026-03-06: Added Meta-Llama-3.1-405B-Instruct-FP8 multi-node workflow. New workflow for deploying the 405B parameter model with FP8 quantization (382GB storage) on 8×H100 or 4×H200 GPUs. Features ~50% memory reduction vs FP16/BF16, improved throughput, and comprehensive HPC deployment guide with Ray cluster initialization, batch processing examples, and production-ready SLURM scripts.
- 2026-03-04: First uv environment (u260304_vllm) and workflow (Qwen2.5-32B-Instruct single-GPU inference). Includes vLLM 0.11.2 with CUDA 12.9 support and comprehensive documentation following the new contribution guidelines.
- 2025-06-09: Added DeepSeek-R1-0528 workflow - an upgraded version with enhanced math, programming, and logic reasoning. See DeepSeek-R1-0528 workflow for details.
- 2025-06-09: DeepSeek-R1 multi-node deployment. New conda environment (c250609_vllm085) with vLLM 0.8.5.post1 and comprehensive workflow for deploying 671B parameter model with FP8 precision on 16×H100 or 8×H200 GPUs. Includes throughput benchmarks and SLURM scripts.
- 2024-10-09: Added Llama 3.1 workflows from Timothy Ngotiaoco and Max Shad. Two new workflows: Llama 3.1 70B (4×H100) and Llama 3.1 405B (16×H100) with 128k context length support.
