SmarterRouter provides OpenAI-compatible endpoints for seamless integration with existing tools and applications.
- Development:
http://localhost:11436 - Production: Configure based on your deployment
| Endpoint Type | Authentication Required |
|---|---|
/v1/* (chat, embeddings, models) |
No |
/admin/* |
Yes - ROUTER_ADMIN_API_KEY required |
/health, /metrics |
No |
Admin endpoints require header: Authorization: Bearer your-admin-api-key
Health check endpoint with subsystem diagnostics.
Response:
{
"status": "healthy",
"checks": {
"database": {"status": "healthy", "details": "Database connection successful"},
"backend": {"status": "healthy", "details": "Backend initialized"},
"gpu_monitor": {"status": "healthy", "details": {"gpus": [], "total_gb": 0.0, "used_gb": 0.0}},
"cache": {"status": "healthy", "details": {"backend": "memory"}},
"background_tasks": {"status": "healthy", "details": {"count": 3}}
},
"version": "2.2.0",
"request_id": "req_abc123"
}Provider DB note: checks.database.details.provider_db includes available, degraded, and stale indicators so operators can detect when provider benchmark data is temporarily serving fallback behavior.
DLQ note: When DLQ is enabled, checks.dlq includes aggregate counts for failed, retrying, and dead background jobs.
Prometheus metrics endpoint for monitoring integration.
Content-Type: text/plain; version=0.0.4
Metrics included:
smarterrouter_requests_total- Request count by endpoint and methodsmarterrouter_request_duration_seconds- Request duration histogramsmarterrouter_errors_total- Error count by endpoint and typesmarterrouter_model_selections_total- Model selection distributionsmarterrouter_cache_hits_total/cache_misses_total- Cache statisticssmarterrouter_vram_total_gb,vram_used_gb,vram_utilization_pct- GPU memory
Returns the router itself as a single model.
Response:
{
"object": "list",
"data": [
{
"id": "smarterrouter/main",
"object": "model",
"created": 1708162374.0,
"owned_by": "local",
"description": "An intelligent LLM router that automatically selects the best model..."
}
]
}Main chat completion endpoint. Compatible with OpenAI API format.
Request:
{
"messages": [
{"role": "user", "content": "Write a Python function..."}
],
"temperature": 0.7,
"max_tokens": 500,
"stream": false
}Streaming: Set "stream": true for Server-Sent Events (SSE) format.
Rate limiting: When enabled, chat uses a dedicated per-IP limit via ROUTER_RATE_LIMIT_CHAT_REQUESTS_PER_MINUTE and returns HTTP 429 (Rate limit exceeded) when the threshold is exceeded.
Request timeout: End-to-end request processing is bounded by a global timeout (ROUTER_REQUEST_TIMEOUT_SECONDS, enabled by default). If exceeded, the endpoint returns HTTP 504 with a timeout_error payload.
Error observability: Failure paths on chat and related APIs emit structured error logs with correlation fields (request_id, user_ip, model_name, prompt_hash) when available, enabling easier tracing from API responses to backend logs.
Response:
{
"id": "chatcmpl-...",
"object": "chat.completion",
"created": 1708162374,
"model": "smarterrouter/main",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Generated response..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 150,
"total_tokens": 165
}
}Note: The model field always returns smarterrouter/main as the router is the interface. The actual model used is appended to the response signature (see ROUTER_SIGNATURE_ENABLED).
Generate vector embeddings for text.
Request:
{
"model": "nomic-embed-text",
"input": "The quick brown fox jumps over the lazy dog"
}Multiple inputs:
{
"model": "nomic-embed-text",
"input": ["text 1", "text 2", "text 3"]
}Response:
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [0.0123, -0.0456, ...],
"index": 0
}
],
"model": "nomic-embed-text",
"usage": {
"prompt_tokens": 10,
"total_tokens": 10
}
}Submit user feedback to improve routing decisions.
Request:
{
"response_id": "chatcmpl-...",
"score": 1.0,
"comment": "Great answer! Very helpful."
}Parameters:
response_id(required): Response ID from chat completionscore(required): Float 0.0-2.0 where:- 0.0 = poor quality
- 1.0 = acceptable/expected
- 2.0 = exceptional/exceeded expectations
comment(optional): Text feedback
Response: 200 OK on success
All admin endpoints require Authorization: Bearer your-admin-api-key header.
View performance profiles of all models.
Query params:
limit(default 1000)offset(legacy pagination, ignored whencursoris set)cursor(cursor-based pagination; returns rows withname > cursor)
Response:
{
"cursor": null,
"next_cursor": "llama3:70b",
"offset": 0,
"limit": 100,
"profiles": [
{
"model_name": "llama3:70b",
"reasoning_score": 0.94,
"coding_score": 0.87,
"creativity_score": 0.78,
"speed_score": 0.34,
"vram_required_gb": 42.5,
"last_profiled": "2024-02-20T10:30:00Z",
"profiling_status": "completed"
}
]
}View aggregated benchmark data from external sources (HuggingFace, LMSYS).
Query params:
?model=llama3:70b- Filter to specific model?limit=100- Page size?offset=0- Legacy offset mode (ignored whencursoris set)?cursor=llama3:8b- Cursor mode (returns rows withollama_name > cursor)
Response:
{
"cursor": null,
"next_cursor": "llama3:70b",
"offset": 0,
"limit": 100,
"benchmarks": [
{
"ollama_name": "llama3:70b",
"mmlu_score": 0.82,
"gpqa_score": 0.58,
"humaneval_score": 0.78,
"math_500_score": 0.85
}
]
}Trigger manual reprofiling of models.
Query params:
?force=true- Reprofile all models, even if already profiled?models=llama3:70b,codellama:34b- Specific models only
Response:
{
"message": "Reprofiling started for 3 models",
"task_id": "abc123",
"check_status": "/admin/profiling_status/abc123"
}List dead-letter-queue entries for failed background tasks.
Query params:
status(optional):failed,retrying,dead, orresolvedlimit(default50, max200)offset(default0)
Response:
{
"enabled": true,
"total": 2,
"limit": 50,
"offset": 0,
"status": "failed",
"entries": [
{
"id": 12,
"task_name": "benchmark_sync",
"status": "failed",
"attempts": 1,
"max_retries": 3,
"error_message": "timeout calling provider",
"payload": {"source": "huggingface"},
"created_at": "2026-03-16T10:15:00+00:00",
"last_attempt_at": "2026-03-16T10:16:00+00:00",
"next_retry_at": "2026-03-16T10:17:00+00:00",
"resolved_at": null
}
]
}Manually retry a specific DLQ entry.
Response:
{
"entry_id": 12,
"success": true,
"status": "resolved",
"attempts": 2,
"next_retry_at": null
}Check status of a profiling task.
Response:
{
"task_id": "abc123",
"status": "running",
"progress": {
"total": 5,
"completed": 2,
"current_model": "codellama:34b"
},
"estimated_completion": "2024-02-20T15:30:00Z"
}View real-time VRAM usage and model memory allocation.
Response:
{
"total_gb": 23.8,
"used_gb": 18.2,
"free_gb": 5.6,
"utilization_pct": 76.5,
"gpus": [
{
"index": 0,
"total_gb": 23.8,
"used_gb": 18.2,
"free_gb": 5.6
}
],
"loaded_models": [
{
"model_name": "llama3:70b",
"vram_used_gb": 42.5,
"loaded_at": "2024-02-20T10:15:00Z",
"last_used": "2024-02-20T14:30:00Z"
}
],
"warnings": [
"VRAM utilization above 75% threshold"
]
}Invalidate cache entries.
Query params:
?type=routing- Clear routing cache only?type=response- Clear response cache only?all=true- Clear all caches (default)
Response:
{
"message": "Cache invalidated",
"cleared": {
"routing_entries": 45,
"response_entries": 23
}
}Explain routing decision for a given prompt.
Query params:
?prompt=Your prompt here(required)?category=...- Override category detection
Response:
{
"prompt": "Write a Python function for binary search",
"detected_category": "coding",
"complexity": 0.42,
"selected_model": "codellama:34b",
"scores": {
"capability_score": 0.87,
"benchmark_score": 0.81,
"speed_score": 0.65,
"final_score": 0.82
},
"alternatives_considered": [
{"model": "llama3:70b", "score": 0.76, "rejected_reason": "Too slow"},
{"model": "phi3:mini", "score": 0.45, "rejected_reason": "Below minimum size"}
]
}| HTTP Status | Meaning | Common Causes |
|---|---|---|
| 200 | Success | - |
| 400 | Bad Request | Invalid JSON, missing required fields |
| 401 | Unauthorized | Invalid/missing admin API key |
| 404 | Not Found | Endpoint doesn't exist |
| 429 | Too Many Requests | Rate limit exceeded |
| 500 | Internal Server Error | Backend failure, model load error |
| 503 | Service Unavailable | Backend not connected, all models failed |
- General endpoints: 60 requests/minute per IP (configurable)
- Admin endpoints: 10 requests/minute per IP (configurable)
- Chat completions: Also limited by backend provider rate limits
Rate limit exceeded responses include Retry-After header with seconds to wait.
CORS is disabled by default. To enable for specific origins, set ROUTER_CORS_ALLOWED_ORIGINS in .env:
ROUTER_CORS_ALLOWED_ORIGINS=http://localhost:3000,https://myapp.example.com
SmarterRouter is compatible with any OpenAI client library that supports:
/v1/chat/completionsendpoint/v1/modelsendpoint- Optional:
/v1/embeddingsendpoint
Tested with:
- OpenAI Python SDK
- OpenWebUI (v0.2+)
- Continue (VS Code extension)
- Cursor IDE
- SillyTavern
- custom applications
Important: The router presents itself as a single model (smarterrouter/main) to simplify frontend integration. The actual model selection is transparent to the client.