Optimize SmarterRouter for your specific workload and hardware.
- Latency vs Throughput
- Configuration Trade-offs
- Profiling Optimization
- VRAM Management
- Database Performance
- Monitoring
Goal: Fastest possible response times for interactive use.
Settings:
# Pin a small model (1-3B) for instant responses
ROUTER_PINNED_MODEL=phi3:mini
# Prefer smaller models
ROUTER_QUALITY_PREFERENCE=0.2
ROUTER_PREFER_SMALLER_MODELS=true
# Enable response caching
ROUTER_CACHE_ENABLED=true
ROUTER_CACHE_MAX_SIZE=1000
ROUTER_CACHE_TTL_SECONDS=3600
# Fast profiling (less accurate)
ROUTER_PROFILE_PROMPTS_PER_CATEGORY=1
# Reduce cache complexity (exact match only)
ROUTER_EMBED_MODEL= # leave emptyHardware: GPU with at least 8GB VRAM for pinned model + one additional model.
Expected results:
- First token: ~200-500ms (if model already loaded)
- Full response (100 tokens): ~500ms - 2s
Goal: Maximize requests per minute, minimize cost.
Settings:
# Use batch processing where possible
ROUTER_CACHE_ENABLED=true
ROUTER_CACHE_MAX_SIZE=2000
ROUTER_CACHE_RESPONSE_MAX_SIZE=500
# Enable semantic caching if you have embedding model
ROUTER_EMBED_MODEL=nomic-embed-text:latest
ROUTER_CACHE_SIMILARITY_THRESHOLD=0.80
# Aggressive caching
ROUTER_CACHE_TTL_SECONDS=7200 # 2 hours
# Rate limiting to prevent overload
ROUTER_RATE_LIMIT_ENABLED=true
ROUTER_RATE_LIMIT_REQUESTS_PER_MINUTE=200
# Reduce logging overhead
ROUTER_LOG_FORMAT=json
ROUTER_LOG_LEVEL=WARNINGAdditional:
- Use load balancer in front of multiple router instances
- Use PostgreSQL instead of SQLite for concurrent access
- Increase worker processes if using gunicorn
Expected results:
- 100+ requests/minute with cached responses
- 30-50 requests/minute with fresh generations
Goal: Always select the best possible model, regardless of speed.
Settings:
ROUTER_QUALITY_PREFERENCE=1.0
ROUTER_JUDGE_ENABLED=true
ROUTER_JUDGE_MODEL=gpt-4o
ROUTER_PROFILE_PROMPTS_PER_CATEGORY=5 # more profiling samples
ROUTER_CACHE_ENABLED=false # don't compromise quality for speedTrade-offs: Slower responses, higher VRAM usage.
| Setting | Speed Impact | Quality Impact | Memory Impact |
|---|---|---|---|
ROUTER_QUALITY_PREFERENCE=0.0 |
✅✅✅ | ↓↓↓ | - |
ROUTER_QUALITY_PREFERENCE=1.0 |
↓↓↓ | ✅✅✅ | - |
ROUTER_CACHE_ENABLED=true |
✅✅✅ | ||
ROUTER_PINNED_MODEL=phi3:mini |
✅✅ | ↓ | ↑↑↑ |
ROUTER_PROFILE_PROMPTS_PER_CATEGORY=1 |
✅ | ↓ | - |
ROUTER_JUDGE_ENABLED=true |
- | ✅✅ | - |
Profiling is one-time cost but can be significant with many models.
# Fewer prompts per category (default=3)
ROUTER_PROFILE_PROMPTS_PER_CATEGORY=1
# Skip LLM-as-Judge (much faster, less accurate)
ROUTER_JUDGE_ENABLED=false
# Increase timeout to avoid retries
ROUTER_PROFILE_TIMEOUT=180Expected speedup:
- Default (3 prompts + judge): ~30-60 min for 18 models
- Optimized (1 prompt, no judge): ~10-20 min for 18 models
SmarterRouter profiles models in parallel by default (limited by ROUTER_PROFILE_CONCURRENCY if needed). Control concurrency:
# Maximum concurrent profiling operations
ROUTER_PROFILE_CONCURRENCY=4Higher values = faster profiling but more VRAM/CPU load.
# Set realistic limit (leave 10-15% headroom)
ROUTER_VRAM_MAX_TOTAL_GB=22.0 # for 24GB GPU
# Aggressive unloading
ROUTER_VRAM_AUTO_UNLOAD_ENABLED=true
ROUTER_VRAM_UNLOAD_THRESHOLD_PCT=75
ROUTER_VRAM_UNLOAD_STRATEGY=largest
# Don't pin too many models
# Only ROUTER_PINNED_MODEL is truly pinned; others unload automaticallycurl http://localhost:11436/admin/vram | jq .Watch for:
utilization_pct > 90%→ risk of OOMwarningsarray → threshold exceededfree_gb < 2→ load smaller models or increase limit
- Use quantized models:
Q4_K_M,Q5_K_Svariants - Prefer smaller base models:
1B,3B,7Bover70B - Adjust context size:
OLLAMA_NUM_CTX=2048reduces memory - Unload unused models manually:
# Restart with only necessary models, or adjust ROUTER_PINNED_MODEL
For write-heavy workloads with many models:
# Increase cache size
ROUTER_DATABASE_URL=sqlite:///data/router.db?cache=shared&cache_size=10000
# Use WAL mode (better concurrency)
# Add to startup script or custom code:
# engine._engine.execute('PRAGMA journal_mode=WAL')SmarterRouter 2.1.9 includes significant database performance improvements:
- Connection Pooling: Automatic connection pooling with
pool_size=10,max_overflow=20 - Batched Queries: VRAM estimates and other operations use batched queries to avoid N+1 patterns
- Query Optimization: Reduced redundant queries in model refresh and fallback logic
Impact:
- 90% reduction in database queries for model fallback scenarios
- Eliminated blocking GPU I/O with async wrapper
- Reduced prompt analysis overhead with 5-minute caching
ROUTER_DATABASE_URL=postgresql://user:pass@localhost:5432/smarterrouterBenefits:
- Better concurrency
- Faster writes
- Easier backups
- Connection pooling
Setup:
docker run -d \
-e POSTGRES_PASSWORD=secret \
-e POSTGRES_DB=smarterrouter \
-p 5432:5432 \
postgres:15Scrape http://localhost:11436/metrics every 30 seconds.
Key metrics to alert on:
smarterrouter_request_duration_seconds{endpoint="/v1/chat/completions"}- P95 > 10s?smarterrouter_errors_total- Rising error ratesmarterrouter_vram_utilization_pct- > 85%?smarterrouter_cache_hit_rate- < 50% suggests cache size too small
Use ROUTER_LOG_FORMAT=json and ship to:
- Loki/Grafana
- ELK stack
- Datadog
- CloudWatch
Example log entry:
{
"timestamp": "2024-02-20T12:34:56.789Z",
"level": "INFO",
"request_id": "abc123",
"endpoint": "/v1/chat/completions",
"model_selected": "llama3:70b",
"latency_ms": 1250,
"cache_hit": false,
"vram_used_gb": 18.2
}To measure improvements:
# Install hey (HTTP load testing)
go install github.com/rakyll/hey@latest
# Test with 100 requests, 10 concurrent
hey -c 10 -n 100 \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello"}]}' \
http://localhost:11436/v1/chat/completionsTrack:
- Avg latency: Target < 2s (first token < 500ms)
- RPS: Requests per second
- Error rate: Should be 0%
Compare before/after config changes to validate tuning.