Nexus exposes an OpenAI-compatible API gateway that unifies local and cloud LLM backends behind a single endpoint. All responses follow the OpenAI format — Nexus-specific metadata is conveyed exclusively through X-Nexus-* headers.
For setup and configuration, see the Getting Started guide.
| Method | Path | Description |
|---|---|---|
POST |
/v1/chat/completions |
Chat completion (streaming and non-streaming) |
POST |
/v1/embeddings |
Generate text embeddings |
GET |
/v1/models |
List available models from healthy backends |
POST |
/v1/models/load |
Load model on specific backend (lifecycle API) |
DELETE |
/v1/models/{id} |
Unload model from specific backend (lifecycle API) |
POST |
/v1/models/migrate |
Migrate model between backends (lifecycle API) |
GET |
/v1/fleet/recommendations |
Fleet intelligence recommendations (lifecycle API) |
GET |
/health |
System health with backend/model counts |
GET |
/v1/stats |
JSON stats: uptime, request counts, per-backend metrics |
GET |
/metrics |
Prometheus text format metrics |
GET |
/ |
Web dashboard (embedded, real-time via WebSocket) |
OpenAI-compatible chat completion endpoint. Supports both streaming and non-streaming responses.
Request:
{
"model": "llama3:70b",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello!" }
],
"stream": true,
"temperature": 0.7,
"max_tokens": 1000
}| Field | Type | Required | Description |
|---|---|---|---|
model |
string | Yes | Model identifier (supports aliases) |
messages |
array | Yes | Conversation messages (system, user, assistant) |
stream |
boolean | No | Enable Server-Sent Events streaming (default: false) |
temperature |
number | No | Sampling temperature (0.0–2.0) |
max_tokens |
integer | No | Maximum tokens to generate |
Response (non-streaming):
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1700000000,
"model": "llama3:70b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 10,
"total_tokens": 30
}
}Response (streaming):
When stream: true, the response uses Server-Sent Events (SSE). Each event is a data: line containing a JSON chunk, terminated by data: [DONE]:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"},"index":0}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"content":"!"},"index":0,"finish_reason":"stop"}]}
data: [DONE]
OpenAI-compatible embeddings endpoint. Generates vector representations of text input. Works with Ollama and OpenAI backends that support embedding models.
Request:
{
"model": "nomic-embed-text",
"input": "The quick brown fox jumps over the lazy dog"
}The input field accepts a single string or an array of strings for batch embedding:
{
"model": "nomic-embed-text",
"input": ["First document", "Second document", "Third document"]
}| Field | Type | Required | Description |
|---|---|---|---|
model |
string | Yes | Embedding model identifier |
input |
string | string[] | Yes | Text to embed — single string or array of strings |
encoding_format |
string | No | Encoding format (e.g., "float") |
Response:
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [0.0023, -0.0091, 0.0152, ...],
"index": 0
}
],
"model": "nomic-embed-text",
"usage": {
"prompt_tokens": 10,
"total_tokens": 10
}
}| Field | Type | Description |
|---|---|---|
object |
string | Always "list" |
data |
array | Array of embedding objects |
data[].object |
string | Always "embedding" |
data[].embedding |
float[] | Vector representation of the input |
data[].index |
integer | Index corresponding to the input position |
model |
string | Model used to generate the embeddings |
usage.prompt_tokens |
integer | Number of tokens in the input |
usage.total_tokens |
integer | Total tokens processed |
Supported backends: Ollama (e.g., nomic-embed-text, all-minilm), OpenAI (e.g., text-embedding-3-small, text-embedding-ada-002).
Error responses:
400— Empty input or invalid request format404— Model not found on any backend502— Backend agent not registered or agent error503— No healthy backend with embeddings support available
Lists all available models from healthy backends. Each entry corresponds to a specific model on a specific backend.
Response:
{
"object": "list",
"data": [
{
"id": "llama3:70b",
"object": "model",
"created": 1700000000,
"owned_by": "backend-name"
}
]
}System health check with backend and model counts.
Response:
{
"status": "healthy",
"version": "0.4.0",
"uptime_seconds": 3600,
"backends": { "total": 3, "healthy": 2, "unhealthy": 1 },
"models": { "total": 5 }
}JSON stats endpoint for dashboards and debugging. Returns uptime, per-backend request counts, latency, and pending request depth.
Example response fields:
uptime_seconds— time since Nexus startedtotal_requests— aggregate request countbackends[]— per-backend stats including request count, average latency, and pending depth
Prometheus text format metrics. Configure your Prometheus scraper to target:
http://<nexus-host>:8000/metrics
Exported metrics include:
- Request counters and duration histograms
- Error rates
- Backend latency
- Token usage
- Fleet state gauges
- Reconciler pipeline timing
Embedded web dashboard (HTML/JS/CSS) with real-time monitoring via WebSocket. See the WebSocket Protocol documentation for details on the real-time update format.
Nexus adds X-Nexus-* response headers to expose routing decisions without modifying the OpenAI-compatible JSON body. This keeps Nexus fully transparent to existing OpenAI client libraries.
| Header | Description | Example |
|---|---|---|
X-Nexus-Backend |
Backend that handled the request | local-ollama |
X-Nexus-Backend-Type |
local or cloud |
local |
X-Nexus-Route-Reason |
Why this backend was chosen | capability-match |
X-Nexus-Cost-Estimated |
Estimated cost in USD (cloud only) | 0.0023 |
X-Nexus-Privacy-Zone |
Privacy zone of the backend | restricted |
X-Nexus-Fallback-Model |
Model used if fallback occurred | gpt-3.5-turbo |
X-Nexus-Rejection-Reasons |
Why backends were excluded (on 503) | privacy_zone_mismatch |
X-Nexus-Rejection-Details |
Detailed rejection context (on 503) | JSON details |
| Header | Description |
|---|---|
X-Nexus-Strict |
Enforce same-or-higher capability tier (default behavior) |
X-Nexus-Flexible |
Allow higher-tier substitution when the exact tier is unavailable |
X-Nexus-Priority |
Queue priority: high or normal (default: normal). When all capable backends are at capacity and request queuing is enabled, high-priority requests are dequeued before normal-priority requests. Invalid values default to normal. |
When no backend can serve a request, Nexus returns HTTP 503 with actionable context instead of a generic error. This follows the project principle of honest failures over silent quality downgrades.
{
"error": {
"message": "No backend available for model 'gpt-4' with required capabilities",
"type": "service_unavailable",
"code": "no_available_backend",
"context": {
"required_tier": 4,
"available_backends": ["ollama-local"],
"privacy_zone_required": "restricted",
"eta_seconds": null
}
}
}The context object provides enough information for clients to take corrective action — for example, relaxing privacy constraints or falling back to a different model.