Status: Draft
Priority: P1
Branch: feature/f06-intelligent-router
Dependencies: F02 (Backend Registry), F03 (Health Checker)
An intelligent request routing system that selects the best backend for each request based on model requirements, backend capabilities, and current system state.
- Route requests to backends that can fulfill model and capability requirements
- Balance load across backends using configurable scoring
- Support model aliases for transparent model substitution
- Provide fallback chains for resilience
- Make routing decisions in < 1ms with no external calls
- GPU/resource scheduling (backends manage their own resources)
- Request queuing (requests are routed immediately or rejected)
- Model downloading or management
- Load prediction or auto-scaling
As a developer using an OpenAI client
I want requests to be routed to a backend that has my requested model
So that I can use any model available in my cluster without knowing which backend hosts it
Priority: P0 (Core functionality)
Acceptance Scenarios:
-
Given backends A (llama3:8b) and B (mistral:7b) are healthy
-
When I request model "llama3:8b"
-
Then the request is routed to backend A
-
Given no backend has model "gpt-5"
-
When I request model "gpt-5"
-
Then I receive a 404 error with message "Model 'gpt-5' not found"
As a developer sending multimodal requests
I want requests to be routed only to backends that support the required capabilities
So that my vision/tool requests don't fail due to capability mismatch
Priority: P0 (Core functionality)
Acceptance Scenarios:
-
Given backend A has llama3 (no vision) and backend B has llava (vision)
-
When I send a request with image_url in messages
-
Then the request is routed to backend B
-
Given backend A has llama3 (no tools) and backend B has llama3 (tools)
-
When I send a request with tools array
-
Then the request is routed to backend B
-
Given no backend supports vision for model "llama3:8b"
-
When I send a vision request for "llama3:8b"
-
Then I receive a 400 error explaining capability mismatch
As a system administrator
I want requests distributed based on backend load and latency
So that no single backend becomes overwhelmed
Priority: P0 (Core functionality)
Acceptance Scenarios:
-
Given backends A (10 pending requests) and B (2 pending requests) both have llama3
-
When I request model "llama3:8b"
-
Then the request is more likely to route to backend B
-
Given backends A (50ms avg latency) and B (200ms avg latency)
-
When I request a model both support
-
Then backend A receives higher score
As a developer migrating from OpenAI
I want to use familiar model names like "gpt-4" that map to local models
So that I don't need to change my client code
Priority: P1 (Enhanced functionality)
Acceptance Scenarios:
-
Given alias "gpt-4" → "llama3:70b" is configured
-
When I request model "gpt-4"
-
Then the request is routed to a backend with "llama3:70b"
-
Given alias "gpt-4" → "llama3:70b" but no backend has llama3:70b
-
When I request model "gpt-4"
-
Then I receive a 404 error mentioning both the alias and target model
As a system administrator
I want to configure fallback models when primary models are unavailable
So that requests succeed even when preferred backends are down
Priority: P1 (Enhanced functionality)
Acceptance Scenarios:
-
Given fallback chain "claude-3-opus" → ["llama3:70b", "mistral:7b"]
-
And no backend has claude-3-opus or llama3:70b
-
When I request model "claude-3-opus"
-
Then the request is routed to a backend with "mistral:7b"
-
Given all models in fallback chain are unavailable
-
When I request the primary model
-
Then I receive a 503 error listing the attempted models
As a system administrator
I want to choose different routing strategies for different use cases
So that I can optimize for my specific workload
Priority: P1 (Enhanced functionality)
Acceptance Scenarios:
-
Given strategy is "round_robin" with 3 healthy backends
-
When I send 6 requests
-
Then each backend receives exactly 2 requests
-
Given strategy is "priority_only" with backends at priority 1 and 2
-
When I send requests
-
Then all requests go to the priority 1 backend
-
Given strategy is "random"
-
When I send 100 requests to 3 backends
-
Then distribution is approximately even (each gets 25-45 requests)
Requirements are extracted from the incoming ChatCompletionRequest:
pub struct RequestRequirements {
/// Model name from request
pub model: String,
/// Estimated token count (characters / 4)
pub estimated_tokens: u32,
/// Request contains image_url in messages
pub needs_vision: bool,
/// Request has tools array
pub needs_tools: bool,
/// Request needs JSON mode (response_format.type == "json_object")
pub needs_json_mode: bool,
}
impl RequestRequirements {
pub fn from_request(request: &ChatCompletionRequest) -> Self;
}Detection Logic:
| Requirement | Detection Method |
|---|---|
| Vision | Any messages[*].content[*].type == "image_url" |
| Tools | tools array present and non-empty |
| JSON Mode | response_format.type == "json_object" |
| Token Estimate | sum(len(m.content) for m in messages) / 4 where content is the text string (for multipart content, only text parts are counted) |
┌──────────────────────────────────────────────────────────────┐
│ select_backend(request) │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 1. Extract requirements from request │
│ - model_name, estimated_tokens │
│ - needs_vision, needs_tools, needs_json_mode │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 2. Get candidate backends for model │
│ registry.get_backends_for_model(model_name) │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 3. Filter by health status (Healthy only) │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 4. Filter by capabilities │
│ - context_length >= estimated_tokens │
│ - supports_vision if needs_vision │
│ - supports_tools if needs_tools │
└──────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────┐
│ Any candidates? │
└─────────────────┘
│ No │ Yes
▼ │
┌────────────────────────────┐ │
│ 5a. Try alias resolution │ │
│ If alias exists, retry │ │
│ with aliased model │ │
└────────────────────────────┘ │
│ No alias │
▼ │
┌────────────────────────────┐ │
│ 5b. Try fallback chain │ │
│ For each fallback: │ │
│ retry with that model │ │
└────────────────────────────┘ │
│ No fallback │
▼ │
┌────────────────────────────┐ │
│ Return NoBackendAvailable │ │
│ error with details │ │
└────────────────────────────┘ │
│
▼
┌──────────────────────────────────────────────────────────────┐
│ 6. Apply routing strategy │
│ - smart: score and select best │
│ - round_robin: next in rotation │
│ - priority_only: lowest priority number │
│ - random: random selection │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Return selected backend │
└──────────────────────────────────────────────────────────────┘
pub struct ScoringWeights {
pub priority: u32, // Default: 50
pub load: u32, // Default: 30
pub latency: u32, // Default: 20
}
impl Default for ScoringWeights {
fn default() -> Self {
Self { priority: 50, load: 30, latency: 20 }
}
}
pub fn score(backend: &Backend, weights: &ScoringWeights) -> u32 {
let priority_score = 100 - backend.priority.min(100);
let load_score = 100 - backend.pending_requests().min(100);
let latency_score = 100 - (backend.avg_latency_ms() / 10).min(100);
(priority_score * weights.priority
+ load_score * weights.load
+ latency_score * weights.latency) / 100
}Score Components:
| Component | Calculation | Range | Weight |
|---|---|---|---|
| Priority | 100 - min(priority, 100) |
0-100 | 50% |
| Load | 100 - min(pending_requests, 100) |
0-100 | 30% |
| Latency | 100 - min(avg_latency_ms / 10, 100) |
0-100 | 20% |
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
pub enum RoutingStrategy {
#[default]
Smart,
RoundRobin,
PriorityOnly,
Random,
}| Strategy | Selection Logic | Use Case |
|---|---|---|
Smart |
Score by priority + load + latency, select highest | Default, balanced |
RoundRobin |
Rotate through candidates in order | Even distribution |
PriorityOnly |
Always select lowest priority number | Dedicated primary |
Random |
Random selection from candidates | Testing, chaos |
#[derive(Debug, thiserror::Error)]
pub enum RoutingError {
#[error("Model '{model}' not found")]
ModelNotFound { model: String },
#[error("No healthy backend available for model '{model}'")]
NoHealthyBackend { model: String },
#[error("No backend supports required capabilities for model '{model}': {missing:?}")]
CapabilityMismatch { model: String, missing: Vec<String> },
#[error("All backends in fallback chain unavailable: {chain:?}")]
FallbackChainExhausted { chain: Vec<String> },
}pub struct Router {
/// Reference to backend registry
registry: Arc<Registry>,
/// Routing strategy
strategy: RoutingStrategy,
/// Scoring weights for smart strategy
weights: ScoringWeights,
/// Model aliases (alias → target)
aliases: HashMap<String, String>,
/// Fallback chains (model → [fallback1, fallback2, ...])
fallbacks: HashMap<String, Vec<String>>,
/// Round-robin counter (atomic for thread safety)
round_robin_counter: AtomicU64,
}[routing]
# Routing strategy: smart, round_robin, priority_only, random
strategy = "smart"
# Maximum retry attempts on backend failure
max_retries = 2
[routing.weights]
# Scoring weights for smart strategy (must sum to 100)
priority = 50
load = 30
latency = 20
[routing.aliases]
# Model aliases for OpenAI compatibility
"gpt-4" = "llama3:70b"
"gpt-4-turbo" = "llama3:70b"
"gpt-3.5-turbo" = "llama3:8b"
"claude-3-opus" = "llama3:70b"
"claude-3-sonnet" = "mistral:7b"
[routing.fallbacks]
# Fallback chains when primary model unavailable
"llama3:70b" = ["llama3:8b", "mistral:7b"]
"claude-3-opus" = ["llama3:70b", "mistral:7b"]Environment Variable Overrides:
| Config | Environment Variable | Example |
|---|---|---|
routing.strategy |
NEXUS_ROUTING_STRATEGY |
round_robin |
routing.max_retries |
NEXUS_ROUTING_MAX_RETRIES |
3 |
The router integrates with the existing API layer:
// In POST /v1/chat/completions handler
async fn chat_completions(
State(state): State<AppState>,
Json(request): Json<ChatCompletionRequest>,
) -> Result<Response, ApiError> {
// Extract requirements
let requirements = RequestRequirements::from_request(&request);
// Select backend
let backend = state.router.select_backend(&requirements)?;
// Proxy request to backend
proxy_request(&backend, request).await
}| Metric | Target | Maximum |
|---|---|---|
| Routing decision time | < 1ms | 2ms |
| Memory per alias | 100 bytes | 500 bytes |
| Memory per fallback chain | 200 bytes | 1KB |
- Routing decisions must be thread-safe
- Multiple concurrent routing decisions allowed
- Round-robin counter uses atomic operations
- No locks during candidate scoring
- No external calls during routing (use cached registry data)
- Graceful degradation when all backends unhealthy
- Clear error messages for debugging
| Condition | Behavior |
|---|---|
| No backends registered | Return ModelNotFound error |
| All backends unhealthy | Return NoHealthyBackend error |
| Empty model name in request | Return 400 Bad Request |
| Unknown routing strategy | Use Smart as default |
| Condition | Behavior |
|---|---|
| Circular alias (a→b→a) | Detect and return error (aliases are single-level) |
| Alias points to unavailable model | Try fallback chain for aliased model |
| Empty fallback chain | Treat as no fallback configured |
| Fallback model also has fallbacks | Do not chain fallbacks (single level only) |
| Condition | Behavior |
|---|---|
| Vision request, no vision backends | Return CapabilityMismatch with "vision" |
| Tools request, no tools backends | Return CapabilityMismatch with "tools" |
| Context too long for all backends | Return CapabilityMismatch with "context_length" |
| Multiple missing capabilities | List all in error response |
| Condition | Behavior |
|---|---|
| All backends same score | Return first candidate |
| Backend with priority > 100 | Clamp to 100 in score calculation |
| No latency data yet | Use 0ms (best possible score) |
| Pending requests > 100 | Clamp to 100 in score calculation |
- Requirements extraction from various request types
- Scoring function with different weights
- Each routing strategy in isolation
- Alias resolution (including circular detection)
- Fallback chain traversal
- Capability matching logic
- Score function always returns value in valid range
- Round-robin distributes evenly over N iterations
- Smart strategy always selects highest-scoring backend
- Alias resolution terminates (no infinite loops)
- End-to-end routing through API
- Routing with live registry updates
- Fallback behavior when backends go down
- Concurrent routing decisions
- Routing decision < 1ms with 100 backends
- Routing decision < 1ms with 1000 models
- No degradation under concurrent load
src/registry/mod.rs- Backend and model datasrc/api/types.rs- ChatCompletionRequest typesrc/config.rs- RoutingConfig
- None new (uses existing:
thiserror,tracing)
src/
├── routing/
│ ├── mod.rs # Router struct and main logic
│ ├── requirements.rs # RequestRequirements extraction
│ ├── scoring.rs # Scoring function and weights
│ ├── strategies.rs # RoutingStrategy implementations
│ └── error.rs # RoutingError types
└── config.rs # Add RoutingConfig
- AC-01: Routes to backend with exact model match
- AC-02: Filters candidates by health status (Healthy only)
- AC-03: Filters by vision capability when request has images
- AC-04: Filters by tools capability when request has tools
- AC-05: Filters by context length (estimated tokens vs model limit)
- AC-06: Scores backends using priority, load, latency
- AC-07: Resolves model aliases transparently
- AC-08: Traverses fallback chain when model unavailable
- AC-09: Detects and prevents circular aliases
- AC-10: Returns descriptive errors for all failure cases
- AC-11: Smart strategy selects highest-scoring backend
- AC-12: Round-robin distributes evenly
- AC-13: Priority-only always selects lowest priority number
- AC-14: Random strategy provides approximate even distribution
- AC-15: Routing decision completes in < 1ms
- AC-16: Thread-safe concurrent routing decisions