A production-ready Docker Compose setup for running vLLM-based LLM inference on CPU-only systems, optimized for macOS with minimal footprint.
This setup demonstrates how to:
- Deploy vLLM for CPU-only inference on macOS
- Serve small, efficient models (SmolLM2 family)
- Optimize resource usage for local development
- Build custom vLLM images with critical patches
- CPU-Optimized: Patched vLLM with NUMA node handling for containerized environments
- Small Footprint: Configurable memory limits and model sizes
- macOS Compatible: Thread tuning for Apple Silicon (M1/M2) and Intel Macs
- Production Ready: Health checks, automatic restarts, and resource limits
- Easy Configuration: Environment-based setup with presets
- Interactive Chatbot: Gradio-based web interface included
📚 Teaching a Workshop? Check out our comprehensive workshop guide:
- WORKSHOP.md - Complete 2-3 hour workshop curriculum
- WORKSHOP_SETUP.md - Pre-workshop setup checklist for participants
The workshop covers:
- Comparing default vs. optimized vLLM images
- Understanding Dockerfile optimization techniques
- Building and deploying with Docker Compose
- Creating an interactive chatbot with Gradio
- Performance tuning and optimization experiments
- Docker Desktop for Mac (4.x or later)
- At least 4GB free RAM (8GB recommended)
- 10GB free disk space
cd /path/to/vllm-cpuEdit .env to customize model and resource limits:
# Use the default balanced preset (360M model)
# Or uncomment one of the presets at the bottom of .env# Build and start vLLM
docker compose up -d
# View logs
docker compose logs -f vllm-cpu
# Wait for model download and initialization (first run may take 5-10 minutes)# Health check
curl http://localhost:8009/health
# List available models
curl http://localhost:8009/v1/models
# Generate text
curl http://localhost:8009/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "HuggingFaceTB/SmolLM2-360M-Instruct",
"prompt": "What is the capital of France?",
"max_tokens": 100,
"temperature": 0.7
}'MODEL_NAME=HuggingFaceTB/SmolLM2-135M-Instruct
MAX_MODEL_LEN=1024
MEMORY_LIMIT=4GMODEL_NAME=HuggingFaceTB/SmolLM2-360M-Instruct
MAX_MODEL_LEN=2048
MEMORY_LIMIT=8GMODEL_NAME=HuggingFaceTB/SmolLM2-1.7B-Instruct
MAX_MODEL_LEN=4096
MEMORY_LIMIT=12G-
Base Image:
openeuler/vllm-cpu:0.9.1-oe2403lts- Pre-built vLLM with CPU optimizations
- OpenEuler Linux for stability
-
NUMA Patch: Fixes division-by-zero on systems without NUMA nodes
RUN sed -i 's/cpu_count_per_numa = cpu_count // numa_size/\ cpu_count_per_numa = cpu_count // numa_size if numa_size > 0 else cpu_count/g' \ /workspace/vllm/vllm/worker/cpu_worker.py
-
Environment Tuning:
VLLM_CPU_KVCACHE_SPACE=1: Limited key-value cache for memory efficiencyOMP_NUM_THREADS=2: Controlled parallelism to avoid CPU thrashingOPENBLAS_NUM_THREADS=1: Single-threaded BLAS operationsMKL_NUM_THREADS=1: Single-threaded Intel MKL
Docker Compose applies CPU and memory limits to prevent system overload:
deploy:
resources:
limits:
cpus: '4.0' # Maximum CPU cores
memory: 8G # Maximum RAM
reservations:
cpus: '2.0' # Guaranteed CPU cores
memory: 4G # Guaranteed RAMOMP_THREADS=4 # M1/M2 have 8+ cores
CPU_LIMIT=6.0 # Use more cores
MEMORY_LIMIT=12G # If you have 16GB+ RAMOMP_THREADS=2 # Conservative threading
CPU_LIMIT=4.0 # Moderate CPU usage
MEMORY_LIMIT=8G # Standard allocationIf running low on memory:
- Reduce
MAX_MODEL_LEN(limits context window) - Reduce
MAX_NUM_SEQS(limits concurrent requests) - Reduce
KVCACHE_SPACE(limits cached tokens) - Switch to a smaller model (135M instead of 360M)
For better responsiveness:
- Increase
OMP_THREADS(if you have CPU headroom) - Increase
CPU_LIMITin .env - Close other resource-intensive applications
See test_vllm.py for a complete example:
python test_vllm.pycurl http://localhost:8009/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "HuggingFaceTB/SmolLM2-360M-Instruct",
"messages": [
{"role": "user", "content": "Explain Docker in one sentence."}
],
"max_tokens": 50
}'curl http://localhost:8009/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "HuggingFaceTB/SmolLM2-360M-Instruct",
"prompt": "Write a haiku about containers:",
"max_tokens": 50,
"stream": true
}'from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8009/v1",
api_key="dummy" # vLLM doesn't require authentication
)
response = client.chat.completions.create(
model="HuggingFaceTB/SmolLM2-360M-Instruct",
messages=[
{"role": "user", "content": "What is Docker?"}
]
)
print(response.choices[0].message.content)# Check logs
docker compose logs vllm-cpu
# Common issues:
# 1. Insufficient memory - reduce MEMORY_LIMIT in .env
# 2. Model download failed - check internet connection
# 3. Port conflict - change VLLM_PORT in .env# Stop the service
docker compose down
# Edit .env and reduce memory usage:
# - Switch to SmolLM2-135M-Instruct
# - Set MAX_MODEL_LEN=1024
# - Set MEMORY_LIMIT=4G
# Restart
docker compose up -d# Check CPU usage
docker stats vllm-smollm2
# Increase thread count in .env:
OMP_THREADS=4 # Or higher based on your CPU# Download can take 5-10 minutes on first run
# Monitor progress:
docker compose logs -f vllm-cpu
# If truly stuck, restart:
docker compose restart vllm-cpuMount a local model directory:
volumes:
- ./models:/workspace/models:roThen set:
MODEL_NAME=/workspace/models/my-modelUncomment the webui service in docker-compose.yml:
docker compose up -d
# Access UI at http://localhost:3000Create additional service definitions in docker-compose.yml with different ports and models.
from langchain.llms import OpenAI
llm = OpenAI(
openai_api_base="http://localhost:8009/v1",
openai_api_key="dummy",
model_name="HuggingFaceTB/SmolLM2-360M-Instruct"
)
response = llm("Explain vLLM in one sentence.")
print(response)# Pull latest base image
docker compose pull
# Rebuild with no cache
docker compose build --no-cache
# Restart services
docker compose up -d# Stop and remove containers
docker compose down
# Remove volumes (clears cached models)
docker compose down -v
# Remove built images
docker rmi vllm-cpu-optimized:latest| Model | Disk Space | RAM (Min) | RAM (Recommended) |
|---|---|---|---|
| SmolLM2-135M | ~500MB | 2GB | 4GB |
| SmolLM2-360M | ~1.3GB | 4GB | 8GB |
| SmolLM2-1.7B | ~6.5GB | 8GB | 12GB |
This deployment configuration is provided as-is. vLLM and the models have their own licenses.