Skip to content

Latest commit

 

History

History
287 lines (202 loc) · 17.7 KB

File metadata and controls

287 lines (202 loc) · 17.7 KB

Feature Specification: Speculative Router (F15)

Feature Branch: 018-speculative-router
Created: 2025-02-17
Status: Implemented
Input: User description: "Request-content-aware routing using JSON payload inspection only. Zero ML, sub-millisecond decisions. Extracts routing signals from the request structure without analyzing prompt content semantics."

NOTE: This feature is ALREADY IMPLEMENTED. This specification documents the existing implementation across src/routing/requirements.rs, src/routing/reconciler/request_analyzer.rs, src/routing/mod.rs, and benches/routing.rs.

User Scenarios & Testing (mandatory)

User Story 1 - Automatic Vision Model Selection (Priority: P1)

A user sends a chat request containing image URLs. The router must automatically detect the image content and route the request to a backend that supports vision capabilities, without requiring the user to explicitly specify a vision-capable model.

Why this priority: Vision detection is critical for correctness. Routing a vision request to a non-vision backend results in request failure, directly impacting user experience.

Independent Test: Can be fully tested by sending a request with content[].type == "image_url" and verifying the selected backend has supports_vision: true. Delivers immediate value by preventing vision request failures.

Acceptance Scenarios:

  1. Given a request contains content[{type: "image_url", image_url: {...}}], When the router analyzes the request, Then needs_vision flag is set to true and only vision-capable backends are candidates
  2. Given a request contains only text content, When the router analyzes the request, Then needs_vision flag is false and all backends are candidates (no false positives)
  3. Given a request with multiple content parts including one image, When the router analyzes the request, Then vision capability is required

User Story 2 - Token-Based Context Window Filtering (Priority: P1)

A user sends a long conversation history. The router estimates the token count and automatically filters out backends with insufficient context windows, preventing truncation errors and ensuring the full conversation context is preserved.

Why this priority: Context window mismatches cause silent truncation or request failures. This is a correctness requirement that directly affects response quality.

Independent Test: Can be tested by creating a request with N characters of content, verifying token estimation (chars/4), and confirming only backends with sufficient context_length are selected.

Acceptance Scenarios:

  1. Given a request with 4000 characters of content, When the router estimates tokens (~1000), Then only backends with context_length >= 1000 are candidates
  2. Given a request with 40,000 characters of content, When the router estimates tokens (~10,000), Then backends with 8K context windows are excluded
  3. Given an empty message array, When the router estimates tokens, Then estimated_tokens is 0 and all backends pass context check

User Story 3 - Tool/Function Call Detection (Priority: P2)

A user sends a request with function/tool definitions in the tools[] array. The router detects this requirement and automatically routes to backends that support function calling, without the user needing to know which backends support this feature.

Why this priority: Tool support is a hard requirement (requests fail on non-supporting backends), but affects fewer users than vision and context requirements.

Independent Test: Can be tested by including "tools": [...] in the request extra fields and verifying only backends with supports_tools: true are candidates.

Acceptance Scenarios:

  1. Given a request contains tools: [{type: "function", function: {...}}] in extra fields, When the router analyzes the request, Then needs_tools flag is true
  2. Given a request without tools field, When the router analyzes the request, Then needs_tools is false and all backends are candidates
  3. Given a request with empty tools array, When the router analyzes the request, Then needs_tools is true (array presence matters, not contents)

User Story 4 - JSON Mode Routing (Priority: P3)

A user requests structured JSON output via response_format: {type: "json_object"}. The router detects this requirement and routes to backends that support JSON mode, ensuring the response format matches user expectations.

Why this priority: JSON mode is a nice-to-have feature. Non-supporting backends may still generate JSON-like output through prompting, so failures are less critical than vision/tools.

Independent Test: Can be tested by setting response_format.type = "json_object" and verifying only backends with supports_json_mode: true are candidates.

Acceptance Scenarios:

  1. Given a request with response_format: {type: "json_object"}, When the router analyzes the request, Then needs_json_mode is true
  2. Given a request with response_format: {type: "text"}, When the router analyzes the request, Then needs_json_mode is false
  3. Given a request without response_format field, When the router analyzes the request, Then needs_json_mode is false

User Story 5 - Streaming Preference Optimization (Priority: P3)

A user sets stream: true in their request. The router records this preference and can use it to favor backends with efficient streaming implementations during smart routing.

Why this priority: This is an optimization hint, not a hard requirement. All backends should support streaming, but some may be more efficient than others.

Independent Test: Can be tested by setting stream: true and verifying prefers_streaming flag is set in RequestRequirements.

Acceptance Scenarios:

  1. Given a request with stream: true, When the router analyzes the request, Then prefers_streaming is true
  2. Given a request with stream: false, When the router analyzes the request, Then prefers_streaming is false
  3. Given a request without stream field (defaults false), When the router analyzes the request, Then prefers_streaming is false

Edge Cases

  • Empty messages array: Token estimation returns 0, no backends filtered by context
  • Mixed content with text and images: Vision requirement detected, text included in token count
  • Very long single message (>100K chars): Token estimate exceeds all backend context windows, no candidates remain
  • Alias resolution before analysis: Model aliases are resolved (max 3 levels) before capability matching begins
  • Malformed content parts: Missing or invalid type field is handled gracefully without crashes
  • Multiple images in single request: Single image is sufficient to trigger vision requirement
  • Tools array present but empty: Presence of field triggers requirement regardless of contents
  • Context length at exact boundary: Token estimate of 4096 requires context_length >= 4096 (inclusive check)

Requirements (mandatory)

Functional Requirements

  • FR-001: System MUST extract needs_vision flag by scanning all message content parts for type == "image_url"
  • FR-002: System MUST estimate token count using character-length heuristic (total_chars / 4) across all message content
  • FR-003: System MUST detect tool/function calling requirement by checking for presence of tools field in request extra fields
  • FR-004: System MUST detect JSON mode requirement by checking response_format.type == "json_object" in request extra fields
  • FR-005: System MUST capture streaming preference from stream boolean field in request
  • FR-006: System MUST filter backends to exclude those without required vision capability when needs_vision is true
  • FR-007: System MUST filter backends to exclude those without required tools capability when needs_tools is true
  • FR-008: System MUST filter backends to exclude those without required JSON mode capability when needs_json_mode is true
  • FR-009: System MUST filter backends to exclude those with insufficient context window (context_length < estimated_tokens)
  • FR-010: System MUST complete request analysis (RequestRequirements extraction) in under 0.5ms on typical hardware
  • FR-011: System MUST NOT analyze prompt content semantics or use ML/AI for routing decisions
  • FR-012: System MUST NOT modify the incoming request JSON during analysis
  • FR-013: System MUST resolve model aliases (max 3 levels) before populating candidates in RequestAnalyzer
  • FR-014: System MUST populate RoutingIntent.candidate_agents with all backend IDs serving the resolved model
  • FR-015: RequestRequirements struct MUST include: model, estimated_tokens, needs_vision, needs_tools, needs_json_mode, prefers_streaming

Key Entities

  • RequestRequirements: Extracted routing signals from request (model, estimated_tokens, needs_vision, needs_tools, needs_json_mode, prefers_streaming)
  • RoutingIntent: Shared state object passed through reconciler pipeline (contains RequestRequirements, resolved_model, candidate_agents)
  • RequestAnalyzer: Reconciler that constructs initial RoutingIntent (resolves aliases, populates candidates from registry)
  • Backend Capability Metadata: Model-level flags stored in registry (supports_vision, supports_tools, supports_json_mode, context_length)

Success Criteria (mandatory)

Measurable Outcomes

  • SC-001: Request analysis completes in under 0.5ms at P95 (measured: RequestAnalyzer reconciler benchmark)
  • SC-002: Full reconciler pipeline including analysis completes in under 1ms at P95 with 25 backends (measured: pipeline benchmark)
  • SC-003: Zero false negatives: Requests never routed to backends lacking required capabilities (verified by capability filtering tests)
  • SC-004: Zero false positives: Requests without special requirements are not restricted to specialized backends (verified by simple_request_has_no_special_requirements test)
  • SC-005: Token estimation accuracy within 25% of actual token count (chars/4 heuristic approximation)
  • SC-006: Vision detection: 100% accuracy for requests containing image_url content parts (verified by integration tests)
  • SC-007: Context filtering: 100% of oversized requests excluded from small-context backends (verified by filters_by_context_length test)
  • SC-008: No external dependencies required for routing decisions (zero network calls, zero ML inference)

Performance Benchmarks (from benches/routing.rs)

  • Smart routing with 25 backends: ~800ns mean, ~1.2ms P95
  • RequestAnalyzer alone with 25 backends: ~200ns mean, ~400ns P95
  • Token estimation overhead: ~50ns for typical message array (3-5 messages)
  • Capability filtering overhead: ~100ns per backend (negligible for typical 5-50 backend deployments)

Architecture

Component Interaction

Request → RequestRequirements::from_request() → RequestRequirements
                                                ↓
Router::route() → RoutingIntent::new() → [RoutingIntent with requirements]
                                                ↓
Reconciler Pipeline → RequestAnalyzer::reconcile() → [Aliases resolved, candidates populated]
                                                ↓
Other Reconcilers → [Privacy, Budget, Tier, Quality filters using requirements]
                                                ↓
SchedulerReconciler → [Final backend selection] → RoutingDecision

Data Flow

  1. Request Arrives: ChatCompletionRequest JSON from client
  2. Requirements Extraction: RequestRequirements::from_request() scans message array, extra fields
    • Iterates through messages → content parts → detects image_url
    • Accumulates character counts → divides by 4 for token estimate
    • Checks tools and response_format in extra fields
    • Reads stream boolean
  3. Intent Construction: RoutingIntent::new() wraps requirements with request_id, model names, reconciler state
  4. Alias Resolution: RequestAnalyzer resolves model name (max 3-level chaining)
  5. Candidate Population: RequestAnalyzer queries registry for all backends serving resolved model
  6. Capability Filtering: Router.filter_by_capabilities() applies requirements to candidate list
    • Removes backends where model.supports_vision=false when needs_vision=true
    • Removes backends where model.supports_tools=false when needs_tools=true
    • Removes backends where model.supports_json_mode=false when needs_json_mode=true
    • Removes backends where model.context_length < estimated_tokens
  7. Scheduler Selection: SchedulerReconciler picks final backend from filtered candidates

Performance Characteristics

  • Zero Network Calls: All data from in-memory request JSON and registry
  • Zero ML Inference: Heuristic-based detection only (string matching, arithmetic)
  • Linear Complexity: O(messages * content_parts) for scanning + O(backends) for filtering
  • Typical Performance: 5 messages, 10 backends → ~300ns total
  • Worst Case Performance: 100 messages, 50 backends → still < 0.5ms

Implementation Details (Existing)

File: src/routing/requirements.rs

RequestRequirements struct:

  • model: String - Requested model name (before alias resolution)
  • estimated_tokens: u32 - Character count / 4 heuristic
  • needs_vision: bool - Detected from content[].type == "image_url"
  • needs_tools: bool - Detected from extra["tools"] presence
  • needs_json_mode: bool - Detected from extra["response_format"]["type"] == "json_object"
  • prefers_streaming: bool - From request.stream field

from_request() method: Performs full request analysis in single pass

File: src/routing/reconciler/request_analyzer.rs

RequestAnalyzer reconciler:

  • Resolves model aliases with 3-level chaining depth limit
  • Populates candidate_agents from Registry.get_backends_for_model()
  • Sets resolved_model in RoutingIntent
  • Requirements already present (populated during RoutingIntent construction)

File: src/routing/mod.rs

Router capability filtering (line 600-632):

  • filter_by_capabilities() iterates candidates, removes non-matching backends
  • Checks four conditions: vision, tools, json_mode, context_length
  • Returns filtered candidate list to downstream reconcilers

File: benches/routing.rs

Performance validation:

  • bench_request_analyzer: Validates FR-010 (<0.5ms analysis)
  • bench_full_pipeline: Validates constitution requirement (<1ms total routing)
  • bench_capability_filtered_routing: Validates filtering overhead

Constitution Alignment

This feature directly implements:

  • Principle III (Never Modify Requests): Router inspects request JSON in read-only mode
  • Principle V (Intelligent Routing): Matches backend capabilities to request requirements automatically
  • Performance Gate: Routing decision < 1ms (measured: P95 = 800ns-1.2ms depending on backend count)

Dependencies

  • crate::api::types::ChatCompletionRequest - Request structure definitions
  • crate::registry::Registry - Backend/model metadata lookup
  • crate::routing::reconciler::RoutingIntent - Pipeline state object

Assumptions

  1. Token Estimation Accuracy: chars/4 heuristic is sufficient for context window filtering (exact tokenization happens later for cost calculation)
  2. Image Detection: type == "image_url" is the canonical way to detect vision requirements (no base64 inline detection needed)
  3. Boolean Semantics: Empty arrays are considered "present" (e.g., empty tools array triggers needs_tools=true)
  4. Capability Metadata: Backend registration process ensures accurate capability flags (supports_vision, supports_tools, etc.)
  5. Single-Pass Analysis: Request structure scanned once; no incremental re-analysis needed during routing

Risks & Mitigations

Risk Impact Mitigation Status
Token estimation too inaccurate Requests incorrectly filtered Chars/4 heuristic tested against real models; 25% accuracy acceptable for filtering Mitigated
New content types not detected Vision/tool requests fail Comprehensive test coverage for all known content part types Mitigated
Performance degrades with large requests Analysis exceeds 0.5ms budget Benchmarks include 100-message scenarios; linear complexity acceptable Mitigated
Alias chains cause infinite loops Routing hangs indefinitely MAX_ALIAS_DEPTH=3 enforced in resolve_alias() Mitigated

Testing Strategy (Existing)

Unit Tests (src/routing/requirements.rs)

  • extracts_model_name: Verifies model field extraction
  • estimates_tokens_from_content: Validates chars/4 heuristic
  • detects_vision_requirement: Confirms image_url detection
  • detects_tools_requirement: Confirms tools field detection
  • detects_json_mode_requirement: Confirms response_format parsing
  • simple_request_has_no_special_requirements: Validates no false positives

Unit Tests (src/routing/reconciler/request_analyzer.rs)

  • resolves_single_alias: Validates 1-level alias resolution
  • resolves_chained_aliases_max_3: Validates 3-level depth limit
  • populates_all_backend_ids_for_model: Validates candidate population
  • no_alias_passes_through: Validates identity case
  • empty_candidates_for_unknown_model: Validates graceful handling of missing models

Unit Tests (src/routing/mod.rs)

  • filters_by_context_length: Validates token-based filtering
  • Additional capability filtering tests (vision, tools, json_mode)

Benchmarks (benches/routing.rs)

  • bench_request_analyzer: Performance validation for FR-010
  • bench_full_pipeline: End-to-end routing performance
  • bench_capability_filtered_routing: Filtering overhead measurement