API Reference

SmarterRouter provides OpenAI-compatible endpoints for seamless integration with existing tools and applications.

Base URL

Development: http://localhost:11436
Production: Configure based on your deployment

Authentication

Endpoint Type	Authentication Required
`/v1/*` (chat, embeddings, models)	No
`/admin/*`	Yes - `ROUTER_ADMIN_API_KEY` required
`/health`, `/metrics`	No

Admin endpoints require header: Authorization: Bearer your-admin-api-key

Endpoints

`GET /health`

Health check endpoint with subsystem diagnostics.

Response:

{
  "status": "healthy",
  "checks": {
    "database": {"status": "healthy", "details": "Database connection successful"},
    "backend": {"status": "healthy", "details": "Backend initialized"},
    "gpu_monitor": {"status": "healthy", "details": {"gpus": [], "total_gb": 0.0, "used_gb": 0.0}},
    "cache": {"status": "healthy", "details": {"backend": "memory"}},
    "background_tasks": {"status": "healthy", "details": {"count": 3}}
  },
  "version": "2.2.0",
  "request_id": "req_abc123"
}

Provider DB note: checks.database.details.provider_db includes available, degraded, and stale indicators so operators can detect when provider benchmark data is temporarily serving fallback behavior.

DLQ note: When DLQ is enabled, checks.dlq includes aggregate counts for failed, retrying, and dead background jobs.

`GET /metrics`

Prometheus metrics endpoint for monitoring integration.

Content-Type: text/plain; version=0.0.4

Metrics included:

smarterrouter_requests_total - Request count by endpoint and method
smarterrouter_request_duration_seconds - Request duration histogram
smarterrouter_errors_total - Error count by endpoint and type
smarterrouter_model_selections_total - Model selection distribution
smarterrouter_cache_hits_total / cache_misses_total - Cache statistics
smarterrouter_vram_total_gb, vram_used_gb, vram_utilization_pct - GPU memory

`GET /v1/models`

Returns the router itself as a single model.

Response:

{
  "object": "list",
  "data": [
    {
      "id": "smarterrouter/main",
      "object": "model",
      "created": 1708162374.0,
      "owned_by": "local",
      "description": "An intelligent LLM router that automatically selects the best model..."
    }
  ]
}

`POST /v1/chat/completions`

Main chat completion endpoint. Compatible with OpenAI API format.

Request:

{
  "messages": [
    {"role": "user", "content": "Write a Python function..."}
  ],
  "temperature": 0.7,
  "max_tokens": 500,
  "stream": false
}

Streaming: Set "stream": true for Server-Sent Events (SSE) format.

Rate limiting: When enabled, chat uses a dedicated per-IP limit via ROUTER_RATE_LIMIT_CHAT_REQUESTS_PER_MINUTE and returns HTTP 429 (Rate limit exceeded) when the threshold is exceeded.

Request timeout: End-to-end request processing is bounded by a global timeout (ROUTER_REQUEST_TIMEOUT_SECONDS, enabled by default). If exceeded, the endpoint returns HTTP 504 with a timeout_error payload.

Error observability: Failure paths on chat and related APIs emit structured error logs with correlation fields (request_id, user_ip, model_name, prompt_hash) when available, enabling easier tracing from API responses to backend logs.

Response:

{
  "id": "chatcmpl-...",
  "object": "chat.completion",
  "created": 1708162374,
  "model": "smarterrouter/main",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Generated response..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 150,
    "total_tokens": 165
  }
}

Note: The model field always returns smarterrouter/main as the router is the interface. The actual model used is appended to the response signature (see ROUTER_SIGNATURE_ENABLED).

`POST /v1/embeddings`

Generate vector embeddings for text.

Request:

{
  "model": "nomic-embed-text",
  "input": "The quick brown fox jumps over the lazy dog"
}

Multiple inputs:

{
  "model": "nomic-embed-text",
  "input": ["text 1", "text 2", "text 3"]
}

Response:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.0123, -0.0456, ...],
      "index": 0
    }
  ],
  "model": "nomic-embed-text",
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 10
  }
}

`POST /v1/feedback`

Submit user feedback to improve routing decisions.

Request:

{
  "response_id": "chatcmpl-...",
  "score": 1.0,
  "comment": "Great answer! Very helpful."
}

Parameters:

response_id (required): Response ID from chat completion
score (required): Float 0.0-2.0 where:
- 0.0 = poor quality
- 1.0 = acceptable/expected
- 2.0 = exceptional/exceeded expectations
comment (optional): Text feedback

Response: 200 OK on success

Admin Endpoints (Require Authentication)

All admin endpoints require Authorization: Bearer your-admin-api-key header.

`GET /admin/profiles`

View performance profiles of all models.

Query params:

limit (default 1000)
offset (legacy pagination, ignored when cursor is set)
cursor (cursor-based pagination; returns rows with name > cursor)

Response:

{
  "cursor": null,
  "next_cursor": "llama3:70b",
  "offset": 0,
  "limit": 100,
  "profiles": [
    {
      "model_name": "llama3:70b",
      "reasoning_score": 0.94,
      "coding_score": 0.87,
      "creativity_score": 0.78,
      "speed_score": 0.34,
      "vram_required_gb": 42.5,
      "last_profiled": "2024-02-20T10:30:00Z",
      "profiling_status": "completed"
    }
  ]
}

`GET /admin/benchmarks`

View aggregated benchmark data from external sources (HuggingFace, LMSYS).

Query params:

?model=llama3:70b - Filter to specific model
?limit=100 - Page size
?offset=0 - Legacy offset mode (ignored when cursor is set)
?cursor=llama3:8b - Cursor mode (returns rows with ollama_name > cursor)

Response:

{
  "cursor": null,
  "next_cursor": "llama3:70b",
  "offset": 0,
  "limit": 100,
  "benchmarks": [
    {
      "ollama_name": "llama3:70b",
      "mmlu_score": 0.82,
      "gpqa_score": 0.58,
      "humaneval_score": 0.78,
      "math_500_score": 0.85
    }
  ]
}

`POST /admin/reprofile`

Trigger manual reprofiling of models.

Query params:

?force=true - Reprofile all models, even if already profiled
?models=llama3:70b,codellama:34b - Specific models only

Response:

{
  "message": "Reprofiling started for 3 models",
  "task_id": "abc123",
  "check_status": "/admin/profiling_status/abc123"
}

`GET /admin/dlq`

List dead-letter-queue entries for failed background tasks.

Query params:

status (optional): failed, retrying, dead, or resolved
limit (default 50, max 200)
offset (default 0)

Response:

{
  "enabled": true,
  "total": 2,
  "limit": 50,
  "offset": 0,
  "status": "failed",
  "entries": [
    {
      "id": 12,
      "task_name": "benchmark_sync",
      "status": "failed",
      "attempts": 1,
      "max_retries": 3,
      "error_message": "timeout calling provider",
      "payload": {"source": "huggingface"},
      "created_at": "2026-03-16T10:15:00+00:00",
      "last_attempt_at": "2026-03-16T10:16:00+00:00",
      "next_retry_at": "2026-03-16T10:17:00+00:00",
      "resolved_at": null
    }
  ]
}

`POST /admin/dlq/retry/{entry_id}`

Manually retry a specific DLQ entry.

Response:

{
  "entry_id": 12,
  "success": true,
  "status": "resolved",
  "attempts": 2,
  "next_retry_at": null
}

`GET /admin/profiling_status/{task_id}`

Check status of a profiling task.

Response:

{
  "task_id": "abc123",
  "status": "running",
  "progress": {
    "total": 5,
    "completed": 2,
    "current_model": "codellama:34b"
  },
  "estimated_completion": "2024-02-20T15:30:00Z"
}

`GET /admin/vram`

View real-time VRAM usage and model memory allocation.

Response:

{
  "total_gb": 23.8,
  "used_gb": 18.2,
  "free_gb": 5.6,
  "utilization_pct": 76.5,
  "gpus": [
    {
      "index": 0,
      "total_gb": 23.8,
      "used_gb": 18.2,
      "free_gb": 5.6
    }
  ],
  "loaded_models": [
    {
      "model_name": "llama3:70b",
      "vram_used_gb": 42.5,
      "loaded_at": "2024-02-20T10:15:00Z",
      "last_used": "2024-02-20T14:30:00Z"
    }
  ],
  "warnings": [
    "VRAM utilization above 75% threshold"
  ]
}

`POST /admin/cache/invalidate`

Invalidate cache entries.

Query params:

?type=routing - Clear routing cache only
?type=response - Clear response cache only
?all=true - Clear all caches (default)

Response:

{
  "message": "Cache invalidated",
  "cleared": {
    "routing_entries": 45,
    "response_entries": 23
  }
}

`GET /admin/explain`

Explain routing decision for a given prompt.

Query params:

?prompt=Your prompt here (required)
?category=... - Override category detection

Response:

{
  "prompt": "Write a Python function for binary search",
  "detected_category": "coding",
  "complexity": 0.42,
  "selected_model": "codellama:34b",
  "scores": {
    "capability_score": 0.87,
    "benchmark_score": 0.81,
    "speed_score": 0.65,
    "final_score": 0.82
  },
  "alternatives_considered": [
    {"model": "llama3:70b", "score": 0.76, "rejected_reason": "Too slow"},
    {"model": "phi3:mini", "score": 0.45, "rejected_reason": "Below minimum size"}
  ]
}

Error Codes

HTTP Status	Meaning	Common Causes
200	Success	-
400	Bad Request	Invalid JSON, missing required fields
401	Unauthorized	Invalid/missing admin API key
404	Not Found	Endpoint doesn't exist
429	Too Many Requests	Rate limit exceeded
500	Internal Server Error	Backend failure, model load error
503	Service Unavailable	Backend not connected, all models failed

Rate Limits

General endpoints: 60 requests/minute per IP (configurable)
Admin endpoints: 10 requests/minute per IP (configurable)
Chat completions: Also limited by backend provider rate limits

Rate limit exceeded responses include Retry-After header with seconds to wait.

CORS

CORS is disabled by default. To enable for specific origins, set ROUTER_CORS_ALLOWED_ORIGINS in .env:

ROUTER_CORS_ALLOWED_ORIGINS=http://localhost:3000,https://myapp.example.com

Client Compatibility

SmarterRouter is compatible with any OpenAI client library that supports:

/v1/chat/completions endpoint
/v1/models endpoint
Optional: /v1/embeddings endpoint

Tested with:

OpenAI Python SDK
OpenWebUI (v0.2+)
Continue (VS Code extension)
Cursor IDE
SillyTavern
custom applications

Important: The router presents itself as a single model (smarterrouter/main) to simplify frontend integration. The actual model selection is transparent to the client.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API Reference

Base URL

Authentication

Endpoints

`GET /health`

`GET /metrics`

`GET /v1/models`

`POST /v1/chat/completions`

`POST /v1/embeddings`

`POST /v1/feedback`

Admin Endpoints (Require Authentication)

`GET /admin/profiles`

`GET /admin/benchmarks`

`POST /admin/reprofile`

`GET /admin/dlq`

`POST /admin/dlq/retry/{entry_id}`

`GET /admin/profiling_status/{task_id}`

`GET /admin/vram`

`POST /admin/cache/invalidate`

`GET /admin/explain`

Error Codes

Rate Limits

CORS

Client Compatibility

FilesExpand file tree

api.md

Latest commit

History

api.md

File metadata and controls

API Reference

Base URL

Authentication

Endpoints

GET /health

GET /metrics

GET /v1/models

POST /v1/chat/completions

POST /v1/embeddings

POST /v1/feedback

Admin Endpoints (Require Authentication)

GET /admin/profiles

GET /admin/benchmarks

POST /admin/reprofile

GET /admin/dlq

POST /admin/dlq/retry/{entry_id}

GET /admin/profiling_status/{task_id}

GET /admin/vram

POST /admin/cache/invalidate

GET /admin/explain

Error Codes

Rate Limits

CORS

Client Compatibility

`GET /health`

`GET /metrics`

`GET /v1/models`

`POST /v1/chat/completions`

`POST /v1/embeddings`

`POST /v1/feedback`

`GET /admin/profiles`

`GET /admin/benchmarks`

`POST /admin/reprofile`

`GET /admin/dlq`

`POST /admin/dlq/retry/{entry_id}`

`GET /admin/profiling_status/{task_id}`

`GET /admin/vram`

`POST /admin/cache/invalidate`

`GET /admin/explain`