This guide walks you through installing Nexus, starting the server, and making your first LLM request — all in about five minutes.
- From source: Rust 1.87+ (with
cargo) - Docker: Docker installed and running
- Pre-built binary: No dependencies — just download and run
You'll also need at least one LLM backend running (e.g., Ollama on localhost:11434).
Choose one of the three installation methods:
# Clone and install
git clone https://github.com/leocamello/nexus.git
cd nexus
cargo install --path .
# Generate a default configuration file
nexus config init# Run with default settings
docker run -d -p 8000:8000 leocamello/nexus
# Run with a custom config file
docker run -d -p 8000:8000 \
-v $(pwd)/nexus.toml:/home/nexus/nexus.toml \
leocamello/nexus serve --config nexus.toml
# Run with host network (required for mDNS auto-discovery)
docker run -d --network host leocamello/nexusDownload the latest binary for your platform from GitHub Releases, extract it, and place it in your PATH.
nexus serveNexus starts on http://localhost:8000 by default. If mDNS discovery is enabled (the default), it will automatically find backends like Ollama on your local network.
You can customize the host and port:
nexus serve --port 9000 --host 127.0.0.1Or use a specific config file:
nexus serve --config nexus.tomlcurl http://localhost:8000/healthYou should see a JSON response with system status, backend count, and model count.
curl http://localhost:8000/v1/modelsThis returns all models discovered across your backends.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:70b",
"messages": [{"role": "user", "content": "Hello!"}]
}'For streaming responses, add "stream": true:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3:70b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Tip: Replace
llama3:70bwith any model name returned by/v1/models.
Nexus ships with a full CLI for managing backends and models at runtime:
# List discovered backends
nexus backends list
# Add a backend manually (type is auto-detected)
nexus backends add http://localhost:11434 --name my-ollama --type ollama
# Remove a backend
nexus backends remove <ID>
# List available models (optionally filter by backend)
nexus models
nexus models --backend <ID>
# Show system health
nexus health
# Generate a config file
nexus config init --output nexus.toml
# JSON output for scripting
nexus backends list --json
nexus health --json
nexus models --jsonEnable tab-completion for your shell:
# Bash
nexus completions bash > ~/.bash_completion.d/nexus
# Zsh
nexus completions zsh > ~/.zsh/completions/_nexus
# Fish
nexus completions fish > ~/.config/fish/completions/nexus.fishNexus works out of the box with zero configuration. For more control, create a nexus.toml file:
nexus config initHere's a fully annotated example:
# nexus.toml
[server]
host = "0.0.0.0"
port = 8000
[discovery]
# Auto-discover backends on your network via mDNS
enabled = true
# --- Local Backends ---
[[backends]]
name = "local-ollama"
url = "http://localhost:11434"
type = "ollama"
priority = 1
[[backends]]
name = "gpu-server"
url = "http://192.168.1.100:8000"
type = "vllm"
priority = 2
# --- Cloud Backend (requires API key via env var) ---
# [[backends]]
# name = "openai-cloud"
# url = "https://api.openai.com"
# type = "openai"
# priority = 100
# api_key_env = "OPENAI_API_KEY" # Reads key from this env var
# zone = "open" # Privacy zone: open | internal | confidential | restricted
# tier = 3 # Capability tier: 1 (fast) | 2 (standard) | 3 (premium)
# --- Routing ---
[routing]
strategy = "smart" # smart | round_robin | priority_only | random
# Model aliases — map common names to your local models
[routing.aliases]
"gpt-4" = "llama3:70b"
"gpt-3.5-turbo" = "mistral:7b"
# Fallback chains — try alternatives if the primary model is unavailable
[routing.fallbacks]
"llama3:70b" = ["qwen2:72b", "mixtral:8x7b"]
# Budget — monthly spending limits for cloud backends (optional)
# [routing.budget]
# monthly_limit_usd = 50.0
# soft_limit_percent = 75 # At 75%: prefer local backends
# hard_limit_action = "block_cloud" # At 100%: warn | block_cloud | block_all
# --- Quality Tracking ---
# [quality]
# enabled = true
# window_1h = 3600 # Rolling window for 1-hour metrics (seconds)
# window_24h = 86400 # Rolling window for 24-hour metrics (seconds)
# update_interval = 30 # How often to recompute quality scores (seconds)
# error_rate_threshold = 0.5 # Exclude backends with error rate above this
# --- Request Queuing ---
# [queue]
# enabled = true
# max_queue_size = 100 # Maximum queued requests before rejecting
# default_timeout_seconds = 30 # How long to wait before 503
# priority_header = "X-Nexus-Priority" # Header for request priority (high | normal)You can also configure Nexus via environment variables:
| Variable | Description | Default |
|---|---|---|
NEXUS_CONFIG |
Config file path | nexus.toml |
NEXUS_PORT |
Listen port | 8000 |
NEXUS_HOST |
Listen address | 0.0.0.0 |
NEXUS_LOG_LEVEL |
Log level (trace/debug/info/warn/error) | info |
NEXUS_LOG_FORMAT |
Log format (pretty/json) | pretty |
NEXUS_DISCOVERY |
Enable mDNS discovery | true |
NEXUS_HEALTH_CHECK |
Enable health checking | true |
Precedence: CLI args > Environment variables > Config file > Defaults
Nexus includes a built-in web dashboard for real-time monitoring. Open your browser to:
http://localhost:8000/
Features:
- Real-time backend health monitoring with status indicators
- Model availability matrix across backends
- Request history with durations and error details
- WebSocket-based live updates (with HTTP polling fallback)
- Dark mode support (follows system preference)
- Fully responsive — works on desktop, tablet, and mobile
- Works without JavaScript (graceful degradation with auto-refresh)
Nexus is OpenAI-compatible, so it works with any tool that speaks the OpenAI API.
Point your AI coding assistant's API endpoint to:
http://localhost:8000
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="llama3:70b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)Nexus supports the OpenAI-compatible embeddings endpoint for turning text into vector representations. This works with Ollama and OpenAI backends that support embedding models.
Single input:
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text",
"input": "Nexus is a distributed LLM orchestrator"
}'Batch input:
curl http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text",
"input": [
"First document to embed",
"Second document to embed",
"Third document to embed"
]
}'The response follows the OpenAI format — an array of embedding vectors with token usage:
{
"object": "list",
"data": [
{ "object": "embedding", "embedding": [0.1, 0.2, ...], "index": 0 }
],
"model": "nomic-embed-text",
"usage": { "prompt_tokens": 8, "total_tokens": 8 }
}Tip: Use
nomic-embed-textwith Ollama ortext-embedding-3-smallwith OpenAI backends. Any model listed by/v1/modelson an embeddings-capable backend will work.
Nexus exposes metrics for monitoring and debugging:
# Prometheus metrics (for Grafana, Prometheus, etc.)
curl http://localhost:8000/metrics
# JSON stats (uptime, per-backend request counts, latency)
curl http://localhost:8000/v1/stats | jqConfigure your Prometheus scraper to target http://<nexus-host>:8000/metrics for request counters, duration histograms, error rates, backend latency, and token usage gauges.
- API Reference — Full endpoint documentation
- Architecture — System internals, module structure, and data flows
- Roadmap — Feature index (F01–F23) and version history
- Example Config — Full annotated configuration file