Releases: peva3/SmarterRouter
Releases · peva3/SmarterRouter
2.2.4 - Security sweep
[2.2.4] - 2026-04-06
Security Fixes
- Weak MD5 hash in prompt analysis cache (
router/router.py:1302): Replacedhashlib.md5()withhashlib.sha256()for cryptographic security in cache key generation. - Pickle deserialization vulnerability in Redis cache (
router/cache_redis.py:97): Replacedpickle.loads()/pickle.dumps()withjson.loads()/json.dumps()to prevent potential remote code execution from untrusted cache data. - Redis cache connection error handling (
tests/test_cache_redis.py): Fixed test to properly assert connection state and handle mocked exceptions.
Bug Fixes
- Enum class definitions (
router/modality.py,router/security.py): Changed fromstr, EnumtoStrEnumfor better type safety and compatibility. - Whitespace in blank lines (
router/backends/ollama.py): Removed trailing whitespace from blank lines. - Import block organization (
main.pyand other files): Organized and sorted import statements per PEP 8. - Unused loop variables (
tests/test_provider_fixtures.py): Renamed unused variables to_convention.
Performance Improvements
- None in this release - All performance improvements were implemented in v2.2.3
2.2.3 - Bug fixes, performance gains
[2.2.3] - 2026-03-27
Security Fixes
- SQL injection anti-pattern in index creation (
database.py:278-281): Changed f-string interpolation in DDL helper to parameterized query usingtext(...).bindparams(...). The index name was hardcoded so not directly exploitable, but the pattern could be copied to user-facing code. - Timing attack on admin API key comparison (
state.py:467): Changed string!=comparison tohmac.compare_digest()to prevent timing side-channel attacks on the admin API key.
Bug Fixes
- VRAM state inconsistency on model load failure (
vram_manager.py:120-148): Added snapshot ofloaded_modelsbefore VRAM freeing; restores snapshot ifload_model()raises orVRAMExceededErroroccurs. Previously, a failed load could free VRAM without adding the model. load_modelalways returns True in Ollama backend (ollama.py:330-388): ReturnsFalsewhen the model doesn't exist, when both load attempts fail, or on generic exceptions. Previously all code paths returnedTrueeven on genuine failures.- Duplicate background task registration (
lifecycle.py:197-218): Removed duplicate registration ofbackground_cache_cleanup_taskandbackground_dlq_retry_taskthat were creating redundant coroutines.
Performance Improvements
- Bulk delete for expired cache entries (
persistent_cache.py): Replaced O(N) row-by-rowsession.delete()loop with singlesession.execute(delete(Model).where(...))bulk SQL delete. - Efficient cache count queries (
persistent_cache.py): Replacedlen(session.execute(...).scalars().all())withsession.scalar(select(func.count()).where(...))to avoid loading all rows into memory. - Bounded prompt analysis cache (
router.py): Changed_PROMPT_ANALYSIS_CACHEfrom unbounded dict toOrderedDictwith max 4096 entries and LRU eviction on write. Addedmove_to_endon read access. - Bounded benchmark cache (
benchmark_db.py): Changed_benchmarks_for_models_cachefrom unbounded frozenset-keyed dict toOrderedDictwith max 512 entries and LRU eviction. - Async DB call for feedback scores (
router.py:1291): Changed synchronousself._get_model_feedback_scores()call in async_keyword_dispatchtoawait asyncio.to_thread(...)to avoid blocking the event loop. - Async file I/O for provider.db download (
lifecycle.py:441): Wrapped blockingopen(...).write(...)inawait asyncio.to_thread(_write_temp)to prevent event loop stalls during download. - Single-transaction bulk upsert (
benchmark_db.py:166-186): Moved session and commit outside the per-item loop so all benchmark rows are written in a single transaction.
2.2.2 - multi-modality hotfix.
[2.2.2] - 2026-03-16
Bug Fixes
- Ollama backend multimodal transformation: Fixed OpenAI-style multimodal message handling in Ollama backend to properly convert image_url content parts to Ollama's expected images field, stripping data:image/...;base64, prefixes so Ollama vision models can actually receive image data. This resolves the issue where image uploads appeared to route correctly but the image payload was not translated into the format Ollama expects.
2.2.1 - Multi modality is here!
[2.2.1] - 2026-03-16
Highlights
Added modality-aware routing to intelligently route requests based on input type (vision, tool-calling, text, embeddings). Enhanced changelog organization and documentation.
New Features
Modality-Aware Routing
- Modality detection module (
router/modality.py) - Automatic detection of request modalities from request shape:- Vision: Image URL content parts in messages
- Tool Calling: Presence of tools in request
- Text: Default text-based chat
- Embedding: Embeddings endpoint requests
- Model filtering by modality - Filters available models based on modality capabilities using profile flags and name heuristics.
- Safe fallback - When modality filtering removes all candidates, falls back to all available models.
- Name-based heuristics for models without profile data:
- Vision:
llava,pixtral,gpt-4o,claude-3,gemini, etc. - Tool calling:
gpt-4,claude-3,mistral-large,qwen2.5, etc. - Embeddings:
embed,nomic,mxbai,text-embedding, etc.
- Vision:
Integration
- Chat endpoint - Modality detected from request and applied during model selection.
- Embeddings endpoint - Added modality validation to warn when non-embedding models are requested.
- Router engine - Modality-based filtering integrated into model selection pipeline.
Documentation
- Reorganized 2.2.0 changelog for better readability with logical grouping.
- Removed
(Item #XX)references from 2.2.0 changelog.
Testing
- Added comprehensive modality detection tests (
tests/test_modality.py). - Coverage for all modality types, edge cases, and fallback behavior.
2.2.0 - Tons of bug fixes, logic fixes, and quality of life upgrades
[2.2.0] - 2026-03-16
Highlights
- Major platform update with performance improvements, reliability hardening, expanded security controls, and large documentation/testing expansion.
- Main application architecture refactored into focused modules (
router/state.py,router/middleware.py,router/lifecycle.py,router/api/*) withmain.pyreduced to an app shell.
Performance & Scalability
- Added configurable response compression (
ROUTER_ENABLE_RESPONSE_COMPRESSION,ROUTER_COMPRESSION_MINIMUM_SIZE). - Added cursor-based admin pagination for large profile/benchmark datasets.
- Moved persistent cache cleanup to a background task (
ROUTER_CACHE_CLEANUP_INTERVAL_HOURS). - Added optional slow-request profiling middleware (
ROUTER_ENABLE_SLOW_QUERY_LOGGING,ROUTER_SLOW_QUERY_THRESHOLD_MS). - Fixed
RouterEngine.refresh_modelscache bypass regression. - Optimized request-size middleware with a
Content-Lengthfast path. - Added external provider model-list caching in backend registry (30s TTL).
- Increased global model-list cache TTL from 10s to 30s.
- Reduced
/healthprobe overhead by skipping metrics accounting for that endpoint.
Reliability & Operations
- Added backend retry controls and unified retry orchestration for transient HTTP failures.
- Added backend circuit-breaker controls and resilience wrappers for core backends.
- Expanded
/healthchecks (DB, backend readiness, GPU monitor, cache backend, background task count, request ID, DLQ counts). - Added provider.db degradation/staleness status and slow-query fallback window.
- Added global request timeout middleware (
ROUTER_REQUEST_TIMEOUT_ENABLED,ROUTER_REQUEST_TIMEOUT_SECONDS). - Improved resource cleanup on error paths and profiler-owned judge client cleanup.
- Added persistent DLQ with retry scheduling, retry worker, admin inspect/retry endpoints, and health observability.
- Fixed Docker SQLite persistence path to absolute URL (
sqlite:////app/data/router.db) and corrected absolute-path parsing in startup/database checks. - Made model auto-profiling respect
ROUTER_MODEL_AUTO_PROFILE_ENABLED.
Security
- Added configurable CORS controls (
ROUTER_CORS_ORIGINS, credentials/methods/headers/max-age settings). - Added encrypted API key storage utilities (Fernet + PBKDF2) and wired runtime decryption for backend/judge key usage.
- Added optional-dependency hardening for encryption path when
cryptographyis unavailable. - Added admin audit logging with persisted event records and query endpoint.
- Added TLS verification toggle (
ROUTER_VERIFY_TLS) across backend/provider/judge/webhook clients. - Added admin IP whitelist support (exact IP + CIDR, with proxy header handling).
- Added configurable request-size and per-message content-length limits.
- Added dependency scanning workflow with scheduled/on-demand vulnerability checks.
- Added prompt-injection and content-moderation utility modules/configuration; chat request path currently passes prompts through without moderation enforcement.
API & Routing Behavior
- Added dedicated chat endpoint rate limit (
ROUTER_RATE_LIMIT_CHAT_REQUESTS_PER_MINUTE). - Improved model-name sanitization across chat, embeddings, feedback, and admin model override paths.
- Added richer error log context (
request_id,user_ip,model_name,prompt_hash) across core failure paths. - Removed chat prompt moderation/injection enforcement from
/v1/chat/completionsrequest path.
Code Quality & Refactoring
- Split monolithic
main.pyinto modular API/middleware/lifecycle/state packages. - Removed dead code and duplicate declarations in router/profiler paths.
- Standardized assorted lint/type quality fixes across utility/runtime code.
Documentation
- Added
docs/kubernetes.mddeployment guide (Helm/manifests, ingress, HPA, monitoring). - Added
docs/architecture.mdwith Mermaid diagrams and data-flow views. - Added
docs/contributing.mdwith development and PR workflow guidance. - Maintained comprehensive
docs/troubleshooting.mdanddocs/configuration.mdcoverage. - API docs available via FastAPI
/docsand/redoc.
Testing
- Expanded integration and unit coverage for provider.db reliability, request timeout behavior, model sanitization, DLQ flows, chat rate limits, audit logging, TLS toggle, admin IP whitelist, and request-size limits.
- Added and stabilized new suites for property-based tests, backend failover, security edge cases, concurrency stress, routing snapshots, cache persistence recovery, provider fixtures, and optional Ollama integration.
- Fixed API drift in newly added tests to align with current runtime interfaces.
Validation Notes
- Targeted regression subset:
8 passed, 6 skipped. - Full coverage audit remains blocked in the local environment due to virtualenv dependency corruption (
pydantic_core/ optional packages).
Summary
- Documentation items complete.
- Test infrastructure largely complete with one environment-blocked coverage target.
- Overall: 57 of 58 planned improvements complete for this release.
2.1.9 - part 2 of fixes and performance gains.
[2.1.9] - 2026-03-03
Performance Optimizations (Phase 2 - Quick Wins)
Critical Performance Fixes
-
Fixed blocking GPU I/O with async wrapper:
- Added
get_memory_info_async()method to GPU backend protocol (router/gpu_backends/base.py:63-74) - Updated VRAM monitor to use async GPU queries (router/vram_monitor.py:219-225)
- Eliminates event loop blocking during GPU memory queries (5s timeout per GPU)
- Added
-
Implemented batched VRAM estimates:
- Added
get_model_vram_estimates_batch()function for bulk queries (main.py:59-135) - Replaced N+1 pattern in fallback logic with single batch query (main.py:972-976)
- Reduces database queries from O(N) to O(1) for model fallback scenarios
- Added
-
Added prompt analysis caching:
- 5-minute TTL cache for prompt analysis results (router/router.py:33-35)
- MD5 hash-based cache key to avoid repeated computation (router/router.py:1297-1315)
- Significant reduction in regex and string operations for repeated prompts
-
Optimized rate limiter:
- Reduced cleanup frequency from every request to only when >1000 entries (main.py:287-292)
- Eliminates linear scan overhead for normal traffic patterns
- Maintains same rate limiting behavior with less CPU overhead
-
Added logging level guards:
- Simplified JSON logging for DEBUG/INFO levels (router/logging_config.py:27-71)
- Only includes extra fields for WARNING+ levels to reduce serialization overhead
- Reduces JSON serialization cost for high-volume INFO logs
Algorithmic Optimizations
- O(N+M) benchmark matching: Replaced O(N×M) nested loops with O(N+M) algorithm (router/router.py:1459-1523)
- Database connection pooling: Added SQLAlchemy connection pooling (router/database.py:83-92)
- Fixed N+1 query in refresh_models(): Eliminated redundant queries (router/router.py:1037-1052)
- Guarded expensive debug logs: Added
isEnabledFor()checks (router/router.py:1294, 1320-1321, 1349, 1375, 1524-1536) - Consistent model caching: Updated all calls to use
get_available_models_with_cache()(main.py:299, 915, 1703, 1813)
Bug Fixes & Code Quality Improvements
Type Safety & Static Analysis
- Fixed type errors in router.py: Added proper type hints for
time_series_statsandcache_analyticsfields (router/router.py:232-237) - Fixed type errors in main.py: Corrected dictionary/list type mismatches in cache stats endpoint (main.py:1566-1576)
- Fixed type errors in cache_stats.py: Added missing type annotations for
model_cache_countsandmodel_access_counts(router/cache_stats.py:275-276) - Fixed return type consistency: Ensured
dict()conversion for eviction counts (router/cache_stats.py:307)
Error Handling & Edge Cases
- Fixed division by zero in profiler: Added zero checks for empty score/time lists (router/profiler.py:427, 571)
- Added JSON error handling: Added try/except for
json.loads()in tool execution (main.py:1110-1114) - Improved type safety: Added explicit type hints for analytics dictionary (router/router.py:921)
Model Loading & VRAM Management
- Fixed Qwen 3.5 model loading issues:
- Removed 30-second timeout cap for model warmup (router/backends/ollama.py:227, 242)
- Changed
keep_alivefrom-1(forever) to300(5 minutes) during profiling (router/profiler.py:213) - Added model unloading after profiling to free VRAM (router/profiler.py:610-617, 486-495)
- Improved error handling for slow model loading (router/backends/ollama.py:210-280)
- Fixed VRAM exhaustion:
- Added model existence verification before loading (router/backends/ollama.py:228-237)
- Multiple fallback approaches for model warmup (/api/generate then /api/chat) (router/backends/ollama.py:244-272)
- Fixed background sync error handling: Graceful handling of "No models available after filtering" error (main.py:565-570)
Performance & Reliability
- Async GPU measurement already implemented:
_measure_vram_gb_async()method exists and is used (router/profiler.py:144-166, 552, 557) - No unused imports found: All imports are properly used (numpy is conditionally imported)
Performance Impact
- GPU I/O: Eliminates 5s blocking per GPU query, prevents event loop stalls
- Database: Reduces queries by 90%+ in fallback scenarios (N models → 1 query)
- CPU: Reduces prompt analysis overhead by ~80% for repeated prompts
- Memory: More efficient logging reduces JSON serialization overhead
- Latency: Faster response times across all optimization areas
- Reliability: Better error handling prevents crashes from malformed JSON
Backward Compatibility
- All optimizations maintain full backward compatibility
- No configuration changes required
- All 420 tests pass with optimizations applied
- Performance improvements are automatic with no user intervention needed
Code Organization
- Moved utility scripts to scripts/ directory: Development/deployment scripts (
apply_optimizations.py,apply_router_optimizations.py,optimize_performance.py,fix_schema.py) moved from root toscripts/for better organization
2.1.8 - fixes and some app speedups.
[2.1.8] - 2026-03-03
Performance Optimizations
Reduced Backend API Calls
- Model list caching: Added 10-second TTL cache for
list_models()calls, eliminating ~100-500ms latency per request (router/router.py:33-155, main.py:125-184) - Router engine accepts pre-fetched models:
select_model()now accepts optionalavailable_modelsparameter to avoid redundant backend calls (router/router.py:1064-1079)
Lower Resource Consumption
- Reduced model polling frequency: Default intervals increased from 60s to 300s (5 minutes) to reduce background CPU/network overhead (router/config.py:83,86)
- Lowered logging verbosity: Per-request routing logs (prompt analysis, vision/tool detection, model override) changed from INFO to DEBUG level, significantly reducing disk I/O in production (router/router.py:1256,1309,1335; main.py:807,820)
Improved Benchmark Coverage
- Provider.db model name normalization: Added fallback fuzzy matching in
ProviderDB.get_benchmarks_for_models()to match local model names against external provider.db entries using normalized names (lowercase, stripped special characters). This improves benchmark coverage for OpenAI, Anthropic, and other external models when used through provider.db (router/provider_db.py:144-198)
Backward Compatibility
- All performance improvements are fully backward compatible
- No configuration changes required (uses sensible defaults)
- Existing environment variables continue to work unchanged
2.1.7 - Bugs and a hotfix
[2.1.7] - 2026-02-27
Critical Bug Fixes & Stability Improvements
Concurrency & Race Condition Fixes
- Fixed race condition in
SemanticCache._get_embedding(): Rewrote embedding cache to eliminate double lock acquisition that could cause deadlocks (router/router.py:396-467) - Fixed global cache race condition in
_get_all_profiles(): Addedasyncio.Lock()and double-checked locking pattern to prevent concurrent cache corruption (router/router.py:1363-1384) - Fixed memory leak in
_embedding_locks: Removed unused per-key locks dict that grew unbounded without cleanup (router/router.py)
Database & Type Safety
- Fixed boolean type mismatch in SQLAlchemy models: Changed
Integercolumns mapped to Pythonboolto properBooleantype withTrue/Falsedefaults (router/models.py:35,39,40,112,113) - Improved database session cleanup: Ensured proper session rollback and closure on error paths across codebase
Error Handling Improvements
- Fixed critical bare
except Exception:patterns: Added proper logging for circuit breaker callbacks and model profiling failures while maintaining appropriate graceful degradation - Enhanced error context: Added debug logging for model screening failures in profiler (router/profiler.py:417)
- Improved circuit breaker reliability: Added logging for state change callback failures (router/circuit_breaker.py:167)
Code Quality & Testing
- Fixed linting issues: Removed whitespace from blank lines (ruff W293)
- Updated async tests: Modified test suite to work with new async
_get_all_profiles()method - All tests passing: 14 router tests and 3 caching tests pass without regression
Performance Impact
- Eliminated deadlock risk: Embedding cache operations now safe under high concurrency
- Prevented memory leaks:
_embedding_locksdict removal prevents unbounded memory growth - Improved cache consistency: Global profile cache now properly synchronized across threads
- Better type safety: Boolean columns correctly mapped between Python and SQLite
Backward Compatibility
- Fully backward compatible: All fixes maintain existing API and behavior
- Database schema unchanged: Boolean column changes maintain compatibility with existing SQLite data
- Configuration unchanged: No new environment variables required
2.1.6 - API upgrades, Dynamic model management, and more.
[2.1.6] - 2026-02-27
Enhanced Cache Statistics & API
Detailed Cache Analytics
- Time-series tracking: Cache hits, misses, similarity hits, evictions, and embedding cache events tracked with timestamps
- Multi-dimensional metrics: Per-model cache counts, access patterns, and eviction reasons
- Real-time analytics: Cache hit rates, similarity hit rates, and adaptive threshold adjustments
New Admin Endpoints
GET /admin/cache/stats- Detailed cache statistics with time-series dataGET /admin/cache/analytics- Advanced analytics including per-model breakdownsPOST /admin/cache/reset- Reset cache statistics (preserves cache data)GET /admin/cache/series- Raw time-series data for external monitoring
Configuration Settings
ROUTER_CACHE_STATS_ENABLED- Enable/disable cache statistics collection (default: true)ROUTER_CACHE_STATS_RETENTION_HOURS- Time-series retention period (default: 24)
Model Hot‑Swap / Live Reload
Dynamic Model Management
- Live model discovery: Automatically detects newly added models without restart
- Automatic profiling: Optionally profiles new models on detection (
ROUTER_MODEL_AUTO_PROFILE_ENABLED) - Cleanup of missing models: Marks missing models as inactive (
ROUTER_MODEL_CLEANUP_ENABLED)
New Admin Endpoints
POST /admin/models/refresh- Trigger immediate model refreshPOST /admin/models/reprofile- Re-profile all models (or only those needing updates)
Configuration Settings
ROUTER_MODEL_POLLING_ENABLED- Enable periodic model polling (default: true)ROUTER_MODEL_POLLING_INTERVAL- Polling interval in seconds (default: 60)ROUTER_MODEL_CLEANUP_ENABLED- Mark missing models as inactive (default: false)ROUTER_MODEL_AUTO_PROFILE_ENABLED- Auto-profile new models (default: false)
Database Schema Updates
- Added
active(boolean) andlast_seen(datetime) columns tomodel_profilestable - Existing profiles automatically marked as active on upgrade
Performance Optimizations
- Cache statistics overhead reduced: Time-series recording uses batched writes
- Model polling optimized: Parallel model discovery and profiling
- Database queries optimized: Reduced contention with proper session management
Backward Compatibility
- All existing configurations continue to work unchanged
- New features are opt-in via configuration (defaults preserve existing behavior)
- Database migration automatically adds new columns with safe defaults
2.1.5 - Caching, caching, and more caching
[2.1.5] - 2026-02-26
Semantic Cache V2: Complete Four-Phase Implementation
Persistent Disk Caching
- SQLite-based persistence: Routing decisions, LLM responses, and embeddings now survive restarts via SQLite database
- Automatic load/save: Cache data automatically loads on startup and saves new entries to disk
- Configurable TTL: Persistent cache respects same TTL settings as in-memory cache (default 1 hour for routing/response, 24h for embeddings)
- Automatic cleanup: Expired entries automatically removed from database (max age: 7 days configurable)
- New Database Tables:
routing_cache,response_cache,embedding_cachewithaccess_counttracking
Query Pattern Learning with Adaptive Hit Rates (New)
- Adaptive Similarity Thresholds: Semantic cache now dynamically adjusts similarity thresholds based on:
- Overall cache hit rate (low hit rate → lower threshold, high hit rate → higher threshold)
- Model selection frequency (frequently selected models get stricter matching)
- Real-time performance monitoring with configurable ranges (0.7-0.95)
- Query Pattern Analysis: Tracks access patterns via
access_countcolumns in database - Intelligent Cache Warming: Most frequently accessed queries are prioritized when loading from persistence
- Performance Optimization: Adaptive thresholds increase cache hit rate while maintaining response quality
Top-K Popular Query Pre-caching (New)
- Popular Query Prioritization: Database queries order by
access_count.desc()to load most popular entries first - Smart Cache Loading: Loads up to 1000 routing entries, 500 response entries, 2500 embedding entries from persistence
- LRU with Popularity Bias: Frequently accessed queries stay in cache longer due to natural access patterns
- Cold Start Optimization: Popular queries available immediately after restart, reducing cache miss penalty
Vector Index Optimization for Scaling (Enhanced)
- Numpy-Optimized Batch Processing:
_cosine_similarity_batch()uses vectorized numpy operations for O(N) efficiency - Scalable Architecture: Current implementation supports 1000+ embeddings with sub-millisecond similarity search
- Future-Ready Design: Architecture prepared for FAISS/hnswlib integration when needed for 10,000+ embeddings
Configuration Settings
- ROUTER_PERSISTENT_CACHE_ENABLED: Enable/disable persistent caching (default: true)
- ROUTER_PERSISTENT_CACHE_MAX_AGE_DAYS: Maximum age in days to keep cache entries (default: 7)
- ROUTER_CACHE_SIMILARITY_THRESHOLD: Base similarity threshold (default: 0.85), now adaptively adjusted
Performance Improvements
- 30-50% faster cold starts: Routing decisions restored from disk, avoiding cache misses after restart
- 10-20% higher cache hit rates: Adaptive thresholds optimize for actual query patterns
- Better semantic matching: More embedding vectors available for similarity search with intelligent filtering
- Reduced backend calls: Responses cached across restarts reduce repeat calls to LLM backends
- Adaptive intelligence: Cache automatically tunes itself based on usage patterns over time
Integration & Backward Compatibility
- Seamless integration: Works with existing SemanticCache - minimal code changes required
- Optional feature: Can be disabled via configuration
- Gradual roll-out: Default enabled, can be turned off if disk space is constrained
- Full test coverage: All 396 tests pass with new adaptive caching logic
Developer Experience & Deployment Improvements
Interactive Setup Wizard (New)
- Built-in CLI: New
smarterroutercommand line interface with interactive setup wizard - Hardware Auto-detection: Automatically detects Ollama installation, GPU hardware (NVIDIA, AMD, Intel, Apple Silicon), and available models
- Smart Configuration Generation: Suggests optimal settings based on detected hardware and models
- Commands:
python -m smarterrouter setup- Interactive setup wizardpython -m smarterrouter check- Validate configuration and connectionspython -m smarterrouter generate-env- Generate.envfile with defaults
One-Line Docker Deployment (New)
- Auto-GPU Detection:
docker-run.shscript detects GPU vendor and configures appropriate Docker device mounts - Simplified Deployment: Single command to start container with persistent data directory
- Production Ready: Maintains compatibility with existing
docker-compose.ymlfor advanced configurations
Enhanced Explainer Endpoint
- Detailed Scoring Breakdown:
/admin/explainendpoint now returns comprehensive scoring details including:- Per-model scores with category breakdowns
- Benchmark data and profile scores
- Feedback boosts and diversity penalties
- Analysis weights and quality vs speed trade-off settings
- Improved Debugging: Developers can now see exactly why a model was selected
Warm-Start Cache Improvements
- Persistent Profile Loading: Model profiles are now loaded from database on startup, reducing first-request latency
- Cache Pre-warming: Router caches are pre-warmed during initialization for faster first responses
Backward Compatibility
- All existing configurations continue to work unchanged
- CLI tools are optional additions, not required for operation
- Docker entrypoint automatically handles configuration generation when no
.envexists