Skip to content

Latest commit

 

History

History
288 lines (220 loc) · 8.77 KB

File metadata and controls

288 lines (220 loc) · 8.77 KB

REST API Reference

Nexus exposes an OpenAI-compatible API gateway that unifies local and cloud LLM backends behind a single endpoint. All responses follow the OpenAI format — Nexus-specific metadata is conveyed exclusively through X-Nexus-* headers.

For setup and configuration, see the Getting Started guide.

Quick Reference

Method Path Description
POST /v1/chat/completions Chat completion (streaming and non-streaming)
POST /v1/embeddings Generate text embeddings
GET /v1/models List available models from healthy backends
POST /v1/models/load Load model on specific backend (lifecycle API)
DELETE /v1/models/{id} Unload model from specific backend (lifecycle API)
POST /v1/models/migrate Migrate model between backends (lifecycle API)
GET /v1/fleet/recommendations Fleet intelligence recommendations (lifecycle API)
GET /health System health with backend/model counts
GET /v1/stats JSON stats: uptime, request counts, per-backend metrics
GET /metrics Prometheus text format metrics
GET / Web dashboard (embedded, real-time via WebSocket)

Endpoints

POST /v1/chat/completions

OpenAI-compatible chat completion endpoint. Supports both streaming and non-streaming responses.

Request:

{
  "model": "llama3:70b",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Hello!" }
  ],
  "stream": true,
  "temperature": 0.7,
  "max_tokens": 1000
}
Field Type Required Description
model string Yes Model identifier (supports aliases)
messages array Yes Conversation messages (system, user, assistant)
stream boolean No Enable Server-Sent Events streaming (default: false)
temperature number No Sampling temperature (0.0–2.0)
max_tokens integer No Maximum tokens to generate

Response (non-streaming):

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1700000000,
  "model": "llama3:70b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 20,
    "completion_tokens": 10,
    "total_tokens": 30
  }
}

Response (streaming):

When stream: true, the response uses Server-Sent Events (SSE). Each event is a data: line containing a JSON chunk, terminated by data: [DONE]:

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"},"index":0}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"delta":{"content":"!"},"index":0,"finish_reason":"stop"}]}

data: [DONE]

POST /v1/embeddings

OpenAI-compatible embeddings endpoint. Generates vector representations of text input. Works with Ollama and OpenAI backends that support embedding models.

Request:

{
  "model": "nomic-embed-text",
  "input": "The quick brown fox jumps over the lazy dog"
}

The input field accepts a single string or an array of strings for batch embedding:

{
  "model": "nomic-embed-text",
  "input": ["First document", "Second document", "Third document"]
}
Field Type Required Description
model string Yes Embedding model identifier
input string | string[] Yes Text to embed — single string or array of strings
encoding_format string No Encoding format (e.g., "float")

Response:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.0023, -0.0091, 0.0152, ...],
      "index": 0
    }
  ],
  "model": "nomic-embed-text",
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 10
  }
}
Field Type Description
object string Always "list"
data array Array of embedding objects
data[].object string Always "embedding"
data[].embedding float[] Vector representation of the input
data[].index integer Index corresponding to the input position
model string Model used to generate the embeddings
usage.prompt_tokens integer Number of tokens in the input
usage.total_tokens integer Total tokens processed

Supported backends: Ollama (e.g., nomic-embed-text, all-minilm), OpenAI (e.g., text-embedding-3-small, text-embedding-ada-002).

Error responses:

  • 400 — Empty input or invalid request format
  • 404 — Model not found on any backend
  • 502 — Backend agent not registered or agent error
  • 503 — No healthy backend with embeddings support available

GET /v1/models

Lists all available models from healthy backends. Each entry corresponds to a specific model on a specific backend.

Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama3:70b",
      "object": "model",
      "created": 1700000000,
      "owned_by": "backend-name"
    }
  ]
}

GET /health

System health check with backend and model counts.

Response:

{
  "status": "healthy",
  "version": "0.4.0",
  "uptime_seconds": 3600,
  "backends": { "total": 3, "healthy": 2, "unhealthy": 1 },
  "models": { "total": 5 }
}

GET /v1/stats

JSON stats endpoint for dashboards and debugging. Returns uptime, per-backend request counts, latency, and pending request depth.

Example response fields:

  • uptime_seconds — time since Nexus started
  • total_requests — aggregate request count
  • backends[] — per-backend stats including request count, average latency, and pending depth

GET /metrics

Prometheus text format metrics. Configure your Prometheus scraper to target:

http://<nexus-host>:8000/metrics

Exported metrics include:

  • Request counters and duration histograms
  • Error rates
  • Backend latency
  • Token usage
  • Fleet state gauges
  • Reconciler pipeline timing

GET /

Embedded web dashboard (HTML/JS/CSS) with real-time monitoring via WebSocket. See the WebSocket Protocol documentation for details on the real-time update format.


Nexus-Transparent Protocol Headers

Nexus adds X-Nexus-* response headers to expose routing decisions without modifying the OpenAI-compatible JSON body. This keeps Nexus fully transparent to existing OpenAI client libraries.

Response Headers

Header Description Example
X-Nexus-Backend Backend that handled the request local-ollama
X-Nexus-Backend-Type local or cloud local
X-Nexus-Route-Reason Why this backend was chosen capability-match
X-Nexus-Cost-Estimated Estimated cost in USD (cloud only) 0.0023
X-Nexus-Privacy-Zone Privacy zone of the backend restricted
X-Nexus-Fallback-Model Model used if fallback occurred gpt-3.5-turbo
X-Nexus-Rejection-Reasons Why backends were excluded (on 503) privacy_zone_mismatch
X-Nexus-Rejection-Details Detailed rejection context (on 503) JSON details

Request Headers

Header Description
X-Nexus-Strict Enforce same-or-higher capability tier (default behavior)
X-Nexus-Flexible Allow higher-tier substitution when the exact tier is unavailable
X-Nexus-Priority Queue priority: high or normal (default: normal). When all capable backends are at capacity and request queuing is enabled, high-priority requests are dequeued before normal-priority requests. Invalid values default to normal.

Actionable Error Responses

When no backend can serve a request, Nexus returns HTTP 503 with actionable context instead of a generic error. This follows the project principle of honest failures over silent quality downgrades.

{
  "error": {
    "message": "No backend available for model 'gpt-4' with required capabilities",
    "type": "service_unavailable",
    "code": "no_available_backend",
    "context": {
      "required_tier": 4,
      "available_backends": ["ollama-local"],
      "privacy_zone_required": "restricted",
      "eta_seconds": null
    }
  }
}

The context object provides enough information for clients to take corrective action — for example, relaxing privacy constraints or falling back to a different model.