Based on analysis of parallel model performance and Ollama research, here are the critical optimizations for your containerized deployment on a single NVIDIA 40XX GPU.
Variable: OLLAMA_NUM_PARALLEL=8
- Why: Default limit of 4 causes queuing when multiple models are queried simultaneously or in rapid succession.
- Impact: Enables true parallel execution without bottlenecks.
Variable: OLLAMA_MAX_LOADED_MODELS=3
- Why: Rapidly switching between models adds 5-10 second reload delays.
- Impact: Eliminates model reload overhead, 50%+ faster multi-model workflows.
- Note: Adjust based on your GPU VRAM (3x 7B Q4_0 models ≈ 12-15GB).
Variable: OLLAMA_KEEP_ALIVE=1800 (30 minutes)
- Why: Prevents models from unloading between queries during active sessions.
- Impact: Models stay ready for immediate use.
Variable: OLLAMA_NUM_GPU=1
- Why: Ensures Ollama properly detects and uses your single GPU.
- Impact: Optimal GPU utilization.
- Primary Model:
qwen2.5-coder:7b(4.7GB) - Secondary Models:
qwen2.5-coder:1.5b(~1GB),qwen2.5-coder:3b(~2GB) - Total VRAM: ~7.7GB
- Performance: High throughput, low switching latency.
Use Quantized Models (Q4_0 or Q5_0):
- 7B Q4_0 models: ~4-5GB VRAM each.
- 7B Q5_0 models: ~5-6GB VRAM each.
VRAM Planning:
- 16GB GPU: 3x 7B-8B Q4_0 models (~12-15GB) or 2x 7B-8B Q5_0 models (~10-12GB).
- 12GB GPU: 2x 7B-8B Q4_0 models (~8-10GB) or 3x 3B-4B Q4_0 models (~6-9GB).
Add to services/docker-compose.yml:
services:
ollama:
environment:
- OLLAMA_HOST=127.0.0.1:11434
- OLLAMA_NUM_PARALLEL=8
- OLLAMA_MAX_LOADED_MODELS=3
- OLLAMA_KEEP_ALIVE=1800
- OLLAMA_NUM_GPU=1
- OLLAMA_NUM_THREAD=8 # Match CPU coresThen restart: ./aixcl stack restart ollama
Before:
- Sequential/limited parallel execution.
- Model reload delays between queries.
After:
- True parallel execution.
- No reload delays (models stay loaded).
- 50-70% improvement in multi-model workflow response times.
- GPU Memory: Monitor VRAM usage. If OOM errors occur, reduce
OLLAMA_MAX_LOADED_MODELS. - Model Selection: Use quantized models (Q4_0/Q5_0) to maximize models that fit in VRAM.
- Testing: Verify parallel execution in logs and measure actual performance improvements.
See ollama-performance-tuning.md for complete details.