F06: Intelligent Router

Status: Draft
Priority: P1
Branch: feature/f06-intelligent-router
Dependencies: F02 (Backend Registry), F03 (Health Checker)

Overview

What It Is

An intelligent request routing system that selects the best backend for each request based on model requirements, backend capabilities, and current system state.

Goals

Route requests to backends that can fulfill model and capability requirements
Balance load across backends using configurable scoring
Support model aliases for transparent model substitution
Provide fallback chains for resilience
Make routing decisions in < 1ms with no external calls

Non-Goals

GPU/resource scheduling (backends manage their own resources)
Request queuing (requests are routed immediately or rejected)
Model downloading or management
Load prediction or auto-scaling

User Stories

US-01: Basic Model Routing

As a developer using an OpenAI client
I want requests to be routed to a backend that has my requested model
So that I can use any model available in my cluster without knowing which backend hosts it

Priority: P0 (Core functionality)

Acceptance Scenarios:

Given backends A (llama3:8b) and B (mistral:7b) are healthy
When I request model "llama3:8b"
Then the request is routed to backend A
Given no backend has model "gpt-5"
When I request model "gpt-5"
Then I receive a 404 error with message "Model 'gpt-5' not found"

US-02: Capability-Based Routing

As a developer sending multimodal requests
I want requests to be routed only to backends that support the required capabilities
So that my vision/tool requests don't fail due to capability mismatch

Priority: P0 (Core functionality)

Acceptance Scenarios:

Given backend A has llama3 (no vision) and backend B has llava (vision)
When I send a request with image_url in messages
Then the request is routed to backend B
Given backend A has llama3 (no tools) and backend B has llama3 (tools)
When I send a request with tools array
Then the request is routed to backend B
Given no backend supports vision for model "llama3:8b"
When I send a vision request for "llama3:8b"
Then I receive a 400 error explaining capability mismatch

US-03: Load-Aware Routing

As a system administrator
I want requests distributed based on backend load and latency
So that no single backend becomes overwhelmed

Priority: P0 (Core functionality)

Acceptance Scenarios:

Given backends A (10 pending requests) and B (2 pending requests) both have llama3
When I request model "llama3:8b"
Then the request is more likely to route to backend B
Given backends A (50ms avg latency) and B (200ms avg latency)
When I request a model both support
Then backend A receives higher score

US-04: Model Aliases

As a developer migrating from OpenAI
I want to use familiar model names like "gpt-4" that map to local models
So that I don't need to change my client code

Priority: P1 (Enhanced functionality)

Acceptance Scenarios:

Given alias "gpt-4" → "llama3:70b" is configured
When I request model "gpt-4"
Then the request is routed to a backend with "llama3:70b"
Given alias "gpt-4" → "llama3:70b" but no backend has llama3:70b
When I request model "gpt-4"
Then I receive a 404 error mentioning both the alias and target model

US-05: Fallback Chains

As a system administrator
I want to configure fallback models when primary models are unavailable
So that requests succeed even when preferred backends are down

Priority: P1 (Enhanced functionality)

Acceptance Scenarios:

Given fallback chain "claude-3-opus" → ["llama3:70b", "mistral:7b"]
And no backend has claude-3-opus or llama3:70b
When I request model "claude-3-opus"
Then the request is routed to a backend with "mistral:7b"
Given all models in fallback chain are unavailable
When I request the primary model
Then I receive a 503 error listing the attempted models

US-06: Routing Strategies

As a system administrator
I want to choose different routing strategies for different use cases
So that I can optimize for my specific workload

Priority: P1 (Enhanced functionality)

Acceptance Scenarios:

Given strategy is "round_robin" with 3 healthy backends
When I send 6 requests
Then each backend receives exactly 2 requests
Given strategy is "priority_only" with backends at priority 1 and 2
When I send requests
Then all requests go to the priority 1 backend
Given strategy is "random"
When I send 100 requests to 3 backends
Then distribution is approximately even (each gets 25-45 requests)

Technical Design

Request Requirements Extraction

Requirements are extracted from the incoming ChatCompletionRequest:

pub struct RequestRequirements {
    /// Model name from request
    pub model: String,
    
    /// Estimated token count (characters / 4)
    pub estimated_tokens: u32,
    
    /// Request contains image_url in messages
    pub needs_vision: bool,
    
    /// Request has tools array
    pub needs_tools: bool,
    
    /// Request needs JSON mode (response_format.type == "json_object")
    pub needs_json_mode: bool,
}

impl RequestRequirements {
    pub fn from_request(request: &ChatCompletionRequest) -> Self;
}

Detection Logic:

Requirement	Detection Method
Vision	Any `messages[].content[].type == "image_url"`
Tools	`tools` array present and non-empty
JSON Mode	`response_format.type == "json_object"`
Token Estimate	`sum(len(m.content) for m in messages) / 4` where content is the text string (for multipart content, only text parts are counted)

Routing Decision Flow

┌──────────────────────────────────────────────────────────────┐
│                    select_backend(request)                    │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│ 1. Extract requirements from request                          │
│    - model_name, estimated_tokens                             │
│    - needs_vision, needs_tools, needs_json_mode               │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│ 2. Get candidate backends for model                           │
│    registry.get_backends_for_model(model_name)                │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│ 3. Filter by health status (Healthy only)                     │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│ 4. Filter by capabilities                                     │
│    - context_length >= estimated_tokens                       │
│    - supports_vision if needs_vision                          │
│    - supports_tools if needs_tools                            │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │ Any candidates? │
                    └─────────────────┘
                     │ No          │ Yes
                     ▼             │
┌────────────────────────────┐    │
│ 5a. Try alias resolution   │    │
│     If alias exists, retry │    │
│     with aliased model     │    │
└────────────────────────────┘    │
          │ No alias              │
          ▼                       │
┌────────────────────────────┐    │
│ 5b. Try fallback chain     │    │
│     For each fallback:     │    │
│     retry with that model  │    │
└────────────────────────────┘    │
          │ No fallback           │
          ▼                       │
┌────────────────────────────┐    │
│ Return NoBackendAvailable  │    │
│ error with details         │    │
└────────────────────────────┘    │
                                  │
                                  ▼
┌──────────────────────────────────────────────────────────────┐
│ 6. Apply routing strategy                                     │
│    - smart: score and select best                             │
│    - round_robin: next in rotation                            │
│    - priority_only: lowest priority number                    │
│    - random: random selection                                 │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│ Return selected backend                                       │
└──────────────────────────────────────────────────────────────┘

Scoring Function (Smart Strategy)

pub struct ScoringWeights {
    pub priority: u32,  // Default: 50
    pub load: u32,      // Default: 30
    pub latency: u32,   // Default: 20
}

impl Default for ScoringWeights {
    fn default() -> Self {
        Self { priority: 50, load: 30, latency: 20 }
    }
}

pub fn score(backend: &Backend, weights: &ScoringWeights) -> u32 {
    let priority_score = 100 - backend.priority.min(100);
    let load_score = 100 - backend.pending_requests().min(100);
    let latency_score = 100 - (backend.avg_latency_ms() / 10).min(100);
    
    (priority_score * weights.priority
     + load_score * weights.load
     + latency_score * weights.latency) / 100
}

Score Components:

Component	Calculation	Range	Weight
Priority	`100 - min(priority, 100)`	0-100	50%
Load	`100 - min(pending_requests, 100)`	0-100	30%
Latency	`100 - min(avg_latency_ms / 10, 100)`	0-100	20%

Routing Strategies

#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
pub enum RoutingStrategy {
    #[default]
    Smart,
    RoundRobin,
    PriorityOnly,
    Random,
}

Strategy	Selection Logic	Use Case
`Smart`	Score by priority + load + latency, select highest	Default, balanced
`RoundRobin`	Rotate through candidates in order	Even distribution
`PriorityOnly`	Always select lowest priority number	Dedicated primary
`Random`	Random selection from candidates	Testing, chaos

Error Types

#[derive(Debug, thiserror::Error)]
pub enum RoutingError {
    #[error("Model '{model}' not found")]
    ModelNotFound { model: String },
    
    #[error("No healthy backend available for model '{model}'")]
    NoHealthyBackend { model: String },
    
    #[error("No backend supports required capabilities for model '{model}': {missing:?}")]
    CapabilityMismatch { model: String, missing: Vec<String> },
    
    #[error("All backends in fallback chain unavailable: {chain:?}")]
    FallbackChainExhausted { chain: Vec<String> },
}

Router Struct

pub struct Router {
    /// Reference to backend registry
    registry: Arc<Registry>,
    
    /// Routing strategy
    strategy: RoutingStrategy,
    
    /// Scoring weights for smart strategy
    weights: ScoringWeights,
    
    /// Model aliases (alias → target)
    aliases: HashMap<String, String>,
    
    /// Fallback chains (model → [fallback1, fallback2, ...])
    fallbacks: HashMap<String, Vec<String>>,
    
    /// Round-robin counter (atomic for thread safety)
    round_robin_counter: AtomicU64,
}

Configuration

[routing]
# Routing strategy: smart, round_robin, priority_only, random
strategy = "smart"

# Maximum retry attempts on backend failure
max_retries = 2

[routing.weights]
# Scoring weights for smart strategy (must sum to 100)
priority = 50
load = 30
latency = 20

[routing.aliases]
# Model aliases for OpenAI compatibility
"gpt-4" = "llama3:70b"
"gpt-4-turbo" = "llama3:70b"
"gpt-3.5-turbo" = "llama3:8b"
"claude-3-opus" = "llama3:70b"
"claude-3-sonnet" = "mistral:7b"

[routing.fallbacks]
# Fallback chains when primary model unavailable
"llama3:70b" = ["llama3:8b", "mistral:7b"]
"claude-3-opus" = ["llama3:70b", "mistral:7b"]

Environment Variable Overrides:

Config	Environment Variable	Example
`routing.strategy`	`NEXUS_ROUTING_STRATEGY`	`round_robin`
`routing.max_retries`	`NEXUS_ROUTING_MAX_RETRIES`	`3`

API Integration

The router integrates with the existing API layer:

// In POST /v1/chat/completions handler
async fn chat_completions(
    State(state): State<AppState>,
    Json(request): Json<ChatCompletionRequest>,
) -> Result<Response, ApiError> {
    // Extract requirements
    let requirements = RequestRequirements::from_request(&request);
    
    // Select backend
    let backend = state.router.select_backend(&requirements)?;
    
    // Proxy request to backend
    proxy_request(&backend, request).await
}

Non-Functional Requirements

Performance

Metric	Target	Maximum
Routing decision time	< 1ms	2ms
Memory per alias	100 bytes	500 bytes
Memory per fallback chain	200 bytes	1KB

Concurrency

Routing decisions must be thread-safe
Multiple concurrent routing decisions allowed
Round-robin counter uses atomic operations
No locks during candidate scoring

Reliability

No external calls during routing (use cached registry data)
Graceful degradation when all backends unhealthy
Clear error messages for debugging

Edge Cases

Empty or Invalid States

Condition	Behavior
No backends registered	Return `ModelNotFound` error
All backends unhealthy	Return `NoHealthyBackend` error
Empty model name in request	Return 400 Bad Request
Unknown routing strategy	Use `Smart` as default

Alias and Fallback Edge Cases

Condition	Behavior
Circular alias (a→b→a)	Detect and return error (aliases are single-level)
Alias points to unavailable model	Try fallback chain for aliased model
Empty fallback chain	Treat as no fallback configured
Fallback model also has fallbacks	Do not chain fallbacks (single level only)

Capability Mismatches

Condition	Behavior
Vision request, no vision backends	Return `CapabilityMismatch` with "vision"
Tools request, no tools backends	Return `CapabilityMismatch` with "tools"
Context too long for all backends	Return `CapabilityMismatch` with "context_length"
Multiple missing capabilities	List all in error response

Scoring Edge Cases

Condition	Behavior
All backends same score	Return first candidate
Backend with priority > 100	Clamp to 100 in score calculation
No latency data yet	Use 0ms (best possible score)
Pending requests > 100	Clamp to 100 in score calculation

Testing Strategy

Unit Tests

Requirements extraction from various request types
Scoring function with different weights
Each routing strategy in isolation
Alias resolution (including circular detection)
Fallback chain traversal
Capability matching logic

Property-Based Tests

Score function always returns value in valid range
Round-robin distributes evenly over N iterations
Smart strategy always selects highest-scoring backend
Alias resolution terminates (no infinite loops)

Integration Tests

End-to-end routing through API
Routing with live registry updates
Fallback behavior when backends go down
Concurrent routing decisions

Performance Tests

Routing decision < 1ms with 100 backends
Routing decision < 1ms with 1000 models
No degradation under concurrent load

Dependencies

Internal

src/registry/mod.rs - Backend and model data
src/api/types.rs - ChatCompletionRequest type
src/config.rs - RoutingConfig

External Crates

None new (uses existing: thiserror, tracing)

File Structure

src/
├── routing/
│   ├── mod.rs           # Router struct and main logic
│   ├── requirements.rs  # RequestRequirements extraction
│   ├── scoring.rs       # Scoring function and weights
│   ├── strategies.rs    # RoutingStrategy implementations
│   └── error.rs         # RoutingError types
└── config.rs            # Add RoutingConfig

FilesExpand file tree

spec.md

Latest commit

History

spec.md

File metadata and controls

F06: Intelligent Router

Overview

What It Is

Goals

Non-Goals

User Stories

US-01: Basic Model Routing

US-02: Capability-Based Routing

US-03: Load-Aware Routing

US-04: Model Aliases

US-05: Fallback Chains

US-06: Routing Strategies

Technical Design

Request Requirements Extraction

Routing Decision Flow

Scoring Function (Smart Strategy)

Routing Strategies

Error Types

Router Struct

Configuration

API Integration

Non-Functional Requirements

Performance

Concurrency

Reliability

Edge Cases

Empty or Invalid States

Alias and Fallback Edge Cases

Capability Mismatches

Scoring Edge Cases

Testing Strategy

Unit Tests

Property-Based Tests

Integration Tests

Performance Tests

Dependencies

Internal

External Crates

File Structure

Acceptance Criteria Summary

References