vLLM CPU-Optimized Deployment for macOS

A production-ready Docker Compose setup for running vLLM-based LLM inference on CPU-only systems, optimized for macOS with minimal footprint.

Overview

This setup demonstrates how to:

Deploy vLLM for CPU-only inference on macOS
Serve small, efficient models (SmolLM2 family)
Optimize resource usage for local development
Build custom vLLM images with critical patches

Features

CPU-Optimized: Patched vLLM with NUMA node handling for containerized environments
Small Footprint: Configurable memory limits and model sizes
macOS Compatible: Thread tuning for Apple Silicon (M1/M2) and Intel Macs
Production Ready: Health checks, automatic restarts, and resource limits
Easy Configuration: Environment-based setup with presets
Interactive Chatbot: Gradio-based web interface included

Workshop

📚 Teaching a Workshop? Check out our comprehensive workshop guide:

WORKSHOP.md - Complete 2-3 hour workshop curriculum
WORKSHOP_SETUP.md - Pre-workshop setup checklist for participants

The workshop covers:

Comparing default vs. optimized vLLM images
Understanding Dockerfile optimization techniques
Building and deploying with Docker Compose
Creating an interactive chatbot with Gradio
Performance tuning and optimization experiments

Quick Start

Prerequisites

Docker Desktop for Mac (4.x or later)
At least 4GB free RAM (8GB recommended)
10GB free disk space

1. Clone/Navigate to Directory

cd /path/to/vllm-cpu

2. Configure Settings (Optional)

Edit .env to customize model and resource limits:

# Use the default balanced preset (360M model)
# Or uncomment one of the presets at the bottom of .env

3. Start the Service

# Build and start vLLM
docker compose up -d

# View logs
docker compose logs -f vllm-cpu

# Wait for model download and initialization (first run may take 5-10 minutes)

4. Test the API

# Health check
curl http://localhost:8009/health

# List available models
curl http://localhost:8009/v1/models

# Generate text
curl http://localhost:8009/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "HuggingFaceTB/SmolLM2-360M-Instruct",
    "prompt": "What is the capital of France?",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Configuration Presets

Minimal Footprint (~2GB RAM)

MODEL_NAME=HuggingFaceTB/SmolLM2-135M-Instruct
MAX_MODEL_LEN=1024
MEMORY_LIMIT=4G

Balanced (Default, ~4GB RAM)

MODEL_NAME=HuggingFaceTB/SmolLM2-360M-Instruct
MAX_MODEL_LEN=2048
MEMORY_LIMIT=8G

Maximum Quality (~10GB RAM)

MODEL_NAME=HuggingFaceTB/SmolLM2-1.7B-Instruct
MAX_MODEL_LEN=4096
MEMORY_LIMIT=12G

Architecture

Dockerfile Optimizations

Base Image: openeuler/vllm-cpu:0.9.1-oe2403lts
- Pre-built vLLM with CPU optimizations
- OpenEuler Linux for stability

NUMA Patch: Fixes division-by-zero on systems without NUMA nodes

RUN sed -i 's/cpu_count_per_numa = cpu_count // numa_size/\
    cpu_count_per_numa = cpu_count // numa_size if numa_size > 0 else cpu_count/g' \
    /workspace/vllm/vllm/worker/cpu_worker.py

Environment Tuning:
- VLLM_CPU_KVCACHE_SPACE=1: Limited key-value cache for memory efficiency
- OMP_NUM_THREADS=2: Controlled parallelism to avoid CPU thrashing
- OPENBLAS_NUM_THREADS=1: Single-threaded BLAS operations
- MKL_NUM_THREADS=1: Single-threaded Intel MKL

Resource Limits

Docker Compose applies CPU and memory limits to prevent system overload:

deploy:
  resources:
    limits:
      cpus: '4.0'        # Maximum CPU cores
      memory: 8G         # Maximum RAM
    reservations:
      cpus: '2.0'        # Guaranteed CPU cores
      memory: 4G         # Guaranteed RAM

Performance Tuning

For Apple Silicon (M1/M2/M3)

OMP_THREADS=4          # M1/M2 have 8+ cores
CPU_LIMIT=6.0          # Use more cores
MEMORY_LIMIT=12G       # If you have 16GB+ RAM

For Intel Macs

OMP_THREADS=2          # Conservative threading
CPU_LIMIT=4.0          # Moderate CPU usage
MEMORY_LIMIT=8G        # Standard allocation

Memory Optimization

If running low on memory:

Reduce MAX_MODEL_LEN (limits context window)
Reduce MAX_NUM_SEQS (limits concurrent requests)
Reduce KVCACHE_SPACE (limits cached tokens)
Switch to a smaller model (135M instead of 360M)

CPU Optimization

For better responsiveness:

Increase OMP_THREADS (if you have CPU headroom)
Increase CPU_LIMIT in .env
Close other resource-intensive applications

API Usage Examples

Using Python

See test_vllm.py for a complete example:

python test_vllm.py

Using curl

Chat Completion

curl http://localhost:8009/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "HuggingFaceTB/SmolLM2-360M-Instruct",
    "messages": [
      {"role": "user", "content": "Explain Docker in one sentence."}
    ],
    "max_tokens": 50
  }'

Streaming Response

curl http://localhost:8009/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "HuggingFaceTB/SmolLM2-360M-Instruct",
    "prompt": "Write a haiku about containers:",
    "max_tokens": 50,
    "stream": true
  }'

Using OpenAI Python Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8009/v1",
    api_key="dummy"  # vLLM doesn't require authentication
)

response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "user", "content": "What is Docker?"}
    ]
)

print(response.choices[0].message.content)

Troubleshooting

Container won't start

# Check logs
docker compose logs vllm-cpu

# Common issues:
# 1. Insufficient memory - reduce MEMORY_LIMIT in .env
# 2. Model download failed - check internet connection
# 3. Port conflict - change VLLM_PORT in .env

Out of memory errors

# Stop the service
docker compose down

# Edit .env and reduce memory usage:
# - Switch to SmolLM2-135M-Instruct
# - Set MAX_MODEL_LEN=1024
# - Set MEMORY_LIMIT=4G

# Restart
docker compose up -d

Slow inference

# Check CPU usage
docker stats vllm-smollm2

# Increase thread count in .env:
OMP_THREADS=4  # Or higher based on your CPU

Model download stuck

# Download can take 5-10 minutes on first run
# Monitor progress:
docker compose logs -f vllm-cpu

# If truly stuck, restart:
docker compose restart vllm-cpu

Advanced Usage

Using Local Models

Mount a local model directory:

volumes:
  - ./models:/workspace/models:ro

Then set:

MODEL_NAME=/workspace/models/my-model

Adding Web UI

Uncomment the webui service in docker-compose.yml:

docker compose up -d
# Access UI at http://localhost:3000

Multi-Model Setup

Create additional service definitions in docker-compose.yml with different ports and models.

Using with LangChain

from langchain.llms import OpenAI

llm = OpenAI(
    openai_api_base="http://localhost:8009/v1",
    openai_api_key="dummy",
    model_name="HuggingFaceTB/SmolLM2-360M-Instruct"
)

response = llm("Explain vLLM in one sentence.")
print(response)

Maintenance

Update vLLM

# Pull latest base image
docker compose pull

# Rebuild with no cache
docker compose build --no-cache

# Restart services
docker compose up -d

Clean Up

# Stop and remove containers
docker compose down

# Remove volumes (clears cached models)
docker compose down -v

# Remove built images
docker rmi vllm-cpu-optimized:latest

Resource Requirements

Model	Disk Space	RAM (Min)	RAM (Recommended)
SmolLM2-135M	~500MB	2GB	4GB
SmolLM2-360M	~1.3GB	4GB	8GB
SmolLM2-1.7B	~6.5GB	8GB	12GB

License

This deployment configuration is provided as-is. vLLM and the models have their own licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.chatbot		Dockerfile.chatbot
Dockerfile.orig		Dockerfile.orig
Makefile		Makefile
OPTIMIZATION.md		OPTIMIZATION.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
WORKSHOP.md		WORKSHOP.md
WORKSHOP_SETUP.md		WORKSHOP_SETUP.md
chatbot.py		chatbot.py
docker-compose.yml		docker-compose.yml
requirements-workshop.txt		requirements-workshop.txt
start.sh		start.sh
test_vllm.py		test_vllm.py

schoolofdevops/vllm-cpu-example

Folders and files

Latest commit

History

Repository files navigation