Complete guide for setting up a production-ready vLLM inference environment with CUDA 12.9 support.
This directory contains everything needed to create a reproducible vLLM environment for LLM inference on GPU clusters.
What's included:
- vLLM 0.11.2 with CUDA 12.9 support
- PyTorch 2.9.0+cu129
- NCCL 2.27.5 for multi-GPU communication
- All dependencies with exact versions
- Test scripts to verify installation
- Multi-GPU configuration guides
Tested on:
- Python 3.12.11
- CUDA 12.9.1
- GCC 13.2.0
- Rocky Linux 8 / RHEL 8
- Create and activate virtual environment
Warning
uv comes with significant amount of files. So it is important to use common cache directory to avoid hitting quota issues and to avoid unnecessary downloads. Please set your cache directory to a common location (e.g. in your lab directory or provided by the workshop) by setting the UV_CACHE_DIR environment variable before running uv commands.
export UV_CACHE_DIR=<your cache directory> # Set cache directory to avoid quota issues
uv venv vllm_env --python 3.12 --seed
source vllm_env/bin/activate- Install packages
uv pip install -r requirements-frozen.txt - Run tests to verify installation
python test_vllm_installation.py- Run inference test (downloads small model)
python test_vllm_inference.pyNote
The previous steps work because the requirements-frozen.txt file contains exact versions of all packages, including vLLM 0.11.2 and its dependencies for the workshop. For any new project you will need to create your own environment and install the correct versions of vLLM and dependencies which are explained in the detailed instructions below.
| File | Purpose |
|---|---|
| README.md | This guide (you are here) |
| pyproject.toml | Project configuration for uv |
| uv.lock | Lock file with exact package versions (724KB) |
| requirements-frozen.txt | Alternative pip-compatible requirements |
| test_vllm_installation.py | Quick installation verification (no model) |
| test_vllm_inference.py | Full inference test with small model |
| check_nccl_status.py | Check NCCL and multi-GPU status |
| nccl_and_multi_gpu.md | Detailed multi-GPU setup guide |
Loading modules is only necessary if you need to compile dependencies; otherwise, you don’t need to load them.
# Request a GPU node (adjust for your scheduler)
# SLURM example:
salloc -N 1 --gres=gpu:1 -t 2:00:00
# Load modules (IMPORTANT: Use GCC 13, not 15!)
module purge
module load gcc/13.2.0-fasrc01 # GCC 13 required for CUDA 12.9
module load cuda/12.9.1-fasrc01 # CUDA 12.9
module load cudnn/9.10.2.21_cuda12-fasrc01
module load cmakecurl -LsSf https://astral.sh/uv/install.sh | sh# Create environment
uv venv vllm_env --python 3.12 --seed
# Activate it
source vllm_env/bin/activate
# Optional: Set custom cache location to avoid quota issues
export UV_CACHE_DIR=$PWD/.uv_cache# Build tools (required even with wheels)
uv pip install ninja packaging wheel setuptools
# Install vLLM with CUDA 12.9 support
uv pip install "vllm==0.11.2" \
--extra-index-url https://download.pytorch.org/whl/cu129 \
--index-strategy unsafe-best-matchWhy these flags?
--extra-index-url: Gets CUDA-enabled PyTorch wheels--index-strategy unsafe-best-match: Allows resolving setuptools from PyPI
python test_vllm_installation.pyExpected output:
vLLM imported successfully (version 0.11.2)
PyTorch imported successfully (version 2.9.0+cu129)
CUDA available: True
CUDA version: 12.9
GPU count: 1
GPU 0: NVIDIA H200
Transformers imported successfully (version 4.57.6)
All tests passed!
What it checks:
- vLLM, PyTorch, Transformers imports
- CUDA availability
- GPU detection
- Basic vLLM functionality
python test_vllm_inference.pyWhat it does:
- Downloads facebook/opt-125m model (~250MB, first run only)
- Runs inference on 3 test prompts
- Displays generated text
Sample output:
Prompt 1: Hello, my name is
Generated: John and I am a software engineer...
Prompt 2: The capital of France is
Generated: Paris, which is known for...
Inference completed successfully!
python check_nccl_status.pyShows:
- NCCL availability
- GPU count and names
- Multi-GPU recommendations
- Example tensor parallelism configs
The uv.lock file contains exact versions of all packages with checksums.
To recreate this exact environment:
# Create new environment
uv venv new_env --python 3.12
source new_env/bin/activate
# Install from lock file
uv sync --index-strategy unsafe-best-matchSimpler, pip-compatible format with exact versions.
# Create new environment
uv venv new_env --python 3.12
source new_env/bin/activate
# Install from frozen requirements
uv pip install -r requirements-frozen.txtfrom vllm import LLM
# Automatically uses single GPU
llm = LLM(model="facebook/opt-13b")Characteristics:
- No NCCL communication needed
- Simpler setup
- Limited by single GPU memory
# Request multiple GPUs
salloc -N 1 --gres=gpu:4 -t 2:00:00from vllm import LLM
# Automatically uses NCCL for communication
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=4, # Split across 4 GPUs
)Benefits:
- Run much larger models
- Higher throughput
- NCCL handles all GPU communication
See nccl_and_multi_gpu.md for detailed multi-GPU guide.
error: -- unsupported GNU version! gcc versions later than 14 are not supported!
Solution: Use GCC 13, not GCC 15
module load gcc/13.2.0-fasrc01 # NOT gcc/15.xCause: Not specifying version explicitly
Solution: Use exact version
uv pip install "vllm==0.11.2" --extra-index-url https://download.pytorch.org/whl/cu129 --index-strategy unsafe-best-matchtorch.cuda.is_available() # Returns FalseChecks:
- Are you on a GPU node?
nvidia-smi - Are CUDA modules loaded?
module list - Is CUDA visible?
echo $CUDA_VISIBLE_DEVICES
Solutions:
- Use smaller model or batch size
- Enable quantization (8-bit/4-bit)
- Reduce
gpu_memory_utilization - Use multiple GPUs with tensor parallelism
Cause: Wrong Python environment
Solution: Ensure virtual environment is activated
source vllm_env/bin/activate
which python # Should show vllm_env/bin/pythonKey packages installed:
| Package | Version | Purpose |
|---|---|---|
| vllm | 0.11.2 | LLM inference engine |
| torch | 2.9.0+cu129 | Deep learning framework |
| transformers | 4.57.6 | HuggingFace models |
| nvidia-nccl-cu12 | 2.27.5 | Multi-GPU communication |
| xformers | 0.0.33.post1 | Memory-efficient attention |
| flashinfer-python | 0.5.2 | Fast attention kernels |
Full list: See requirements-frozen.txt
from vllm import LLM, SamplingParams
# Initialize model
llm = LLM(model="facebook/opt-125m")
# Set parameters
sampling_params = SamplingParams(
temperature=0.8,
top_p=0.95,
max_tokens=100
)
# Generate
prompts = ["Hello, how are you?"]
outputs = llm.generate(prompts, sampling_params)
print(outputs[0].outputs[0].text)llm = LLM(
model="meta-llama/Llama-2-70b-hf",
quantization="awq", # or "gptq"
gpu_memory_utilization=0.9
)llm = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=4, # Uses NCCL
max_model_len=4096
)- vLLM Documentation: https://docs.vllm.ai/
- Multi-GPU Guide: nccl_and_multi_gpu.md
- Issue Tracker: https://github.com/vllm-project/vllm/issues
To update this environment:
- Make changes to your environment
- Update lock file:
uv lock --index-strategy unsafe-best-match
- Update frozen requirements:
uv pip freeze > requirements-frozen.txt - Test with both installation methods
- Update this README with any new instructions
This setup guide is part of the distributed-inference-vllm project. See LICENSE for details.
- Created by: Naeem Khoshnevis
- Date: 2026-03-04
- Last updated: 2026-03-04
You now have a production-ready vLLM environment:
- vLLM 0.11.2 with CUDA 12.9 support
- Fully reproducible with lock files
- Tested and verified
- Multi-GPU ready (NCCL included)
- Documentation and troubleshooting guides
Next steps:
- Run tests to verify:
python test_vllm_installation.py - Try your first inference:
python test_vllm_inference.py - Explore multi-GPU:
python check_nccl_status.py
- Check the setup with ordinary virtual environments (venv) and update instructions if needed
- Add steps for building Docker image + converting to Singularity
- Add instructions on compiling the latest vLLM from source with CUDA 12.9 support (for bleeding-edge users)