Note: This document replaces the original ARCHITECTUREv1.md document that captured our initial design thinking. This revision reflects the production architecture vision that emerged after completing the POC.
Accelerate AI adoption by making LLM deployment accessible to every organization:
- Reduce time-to-production for LLM deployments through intelligent automation
- Lower the barrier to entry for customers new to GPU and model infrastructure
- Simplify Kubernetes for LLM-based solutions with guided configuration
Ensure our customers fulfill their AI goals and maximize ROI of their GenAI investments through deep visibility and personalized recommendations.
Deploying LLMs in production environments remains complex and error-prone:
- Application Developers struggle to translate business needs into infrastructure choices (model selection, GPU type, SLO targets)
- Cluster Admins face overprovisioning risks without guidance on aligning workloads to budgets and SLAs
- Trial-and-error is the default, extending time-to-production and inflating costs
NeuralNav streamlines the path from concept to production-ready LLM deployments through:
- Conversational requirements gathering - Natural language input eliminates need for infrastructure expertise
- SLO-driven capacity planning - Translate business needs into technical specifications automatically
- Multi-criteria recommendation ranking - Present optimal trade-offs between accuracy, cost, latency, and complexity
- Ready-to-use configurations - Generate production-ready Kubernetes manifests for immediate deployment
NeuralNav consists of five major components:
┌───────────────────────────────────────────────────────────────────────────┐
│ UI Layer (Presentation) │
│ Streamlit (current) → React (future) │
│ Conversational Interface | Spec Editor | Recommendation Viewer | Config │
└───────────────────────────────────────────────────────────────────────────┘
↕ API Gateway (FastAPI)
┌───────────────────────────────────────────────────────────────────────────┐
│ Backend Services │
├────────────────────┬────────────────────┬────────────────────┬────────────┤
│ Intent │ │ │ │
│ Extraction │ Specification │ Recommendation │ Config │
│ Service │ Service │ Service │ Service │
│ │ │ │ │
│ Conversation │ User Intent │ Specification │ Selected │
│ ↓ │ ↓ │ ↓ │ Recommend │
│ Structured │ Complete │ Ranked │ ↓ │
│ Intent │ Specification │ Recommendations │ YAML │
│ │ │ │ Files │
└────────────────────┴────────────────────┴────────────────────┴────────────┘
↕ Data Access Layer
┌───────────────────────────────────────────────────────────────────────────┐
│ Knowledge Base (Data Layer) │
│ Embedded DB: Performance Benchmarks, Accuracy Benchmarks │
│ JSON Files: SLO Templates, Model Catalog, Hardware Profiles │
└───────────────────────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────────────┐
│ User Interface (Presentation) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────┐ │
│ │Conversational│ │Specification │ │Recommendation│ │ Config │ │
│ │ Interface │ │ Editor │ │Viewer/Select │ │ Viewer │ │
│ │ │ │ │ │ │ │ │ │
│ │ Natural lang │ │ Edit SLOs, │ │ 5 ranked │ │ Display│ │
│ │ input, show │ │ priorities, │ │ views, trade │ │ YAML, │ │
│ │ intent │ │ thresholds │ │ off analysis │ │download│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └────────┘ │
│ │
│ User Conversation ────────────────────────────▶ Config Files │
└───────────────────────────────────────────────────────────────────┘
Current Implementation: Streamlit Future Migration: React + WebSocket backend
Provides unified interface across all system capabilities through four integrated views:
- Accepts natural language description of use case
- Displays extracted intent for user review
- Allows clarification and refinement through multi-turn dialogue
- Obtains user approval before proceeding to specification
- Displays auto-generated deployment specification
- Allows editing of multiple fields:
- SLO targets: TTFT, ITL, E2E latency thresholds and percentile (p90/p95/p99)
- Quality thresholds: Minimum accuracy score, maximum cost
- User priorities: Weights for accuracy, cost, latency, complexity in balanced ranking
- Future: Editing of traffic profile parameters (prompt/output tokens, QPS), power/cooling budgets
- Obtains user approval before sending to Recommendation Service
- Displays ranked recommendations in multiple views:
- List View: Top recommendations in each category (Best Accuracy, Lowest Cost, Lowest Latency, Simplest, Balanced)
- Future: Pareto frontier charts for two-dimensional trade-offs (e.g., cost vs. accuracy)
- Future: Multi-dimensional visualization (>2 dimensions)
- Allows user to select preferred configuration
- Provides detailed explanations for each recommendation
- Displays generated configuration files for selected solution
- Allows download or copy of configuration files
- Current Target: KServe on Kubernetes (InferenceService YAML)
- Future Options: Direct Linux server deployment, other orchestration platforms
- Upload benchmark JSON files into PostgreSQL (append mode, duplicates skipped)
- Reset the benchmark database (removes all benchmark data, preserves schema)
- View database statistics: total benchmarks, distinct models, hardware types, traffic profile distribution
- Enables remote management of NeuralNav deployments (e.g., on Kubernetes) without requiring shell access
Backend components interact with the UI via FastAPI REST endpoints.
┌───────────────────────────────────────────────────────────────────┐
│ Intent Extraction Service │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ LLM Processor │ │ Schema Validator│ │ Intent Builder │ │
│ │ │ │ │ │ │ │
│ │ Ollama/qwen2.5 │──▶│ Pydantic schema │──▶│ Structured │ │
│ │ Engineered │ │ validation │ │ intent object │ │
│ │ prompts │ │ │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ User Conversation ─────────────────────────▶ Structured Intent │
└───────────────────────────────────────────────────────────────────┘
Purpose: Transform natural language conversation into structured user intent
Input: User's conversational description of use case and requirements
Technology:
- Small LLM for structured extraction (currently Ollama with qwen2.5:7b)
- Pydantic schema for intent validation
- Engineered prompts for consistent extraction
Processing:
- Extract use case type (chatbot, code completion, summarization, etc.)
- Identify user count and scale requirements
- Determine user priorities (accuracy, cost, latency, complexity)
- Capture domain specialization needs (e.g., medical, legal, financial)
Output: Structured intent object with:
- Use case classification
- Expected user scale
- Priority preferences (accuracy vs. cost vs. latency)
- Subject matter domain (optional)
Interface: REST endpoint /api/v1/extract
Notes:
- May require LLM fine-tuning for improved extraction accuracy
- Alternative approaches to LLM-based extraction may be considered
┌───────────────────────────────────────────────────────────────────┐
│ Specification Service │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Traffic Profile │ │ SLO Target │ │ Benchmark │ │
│ │ Generator │ │ Mapper │ │ Selector │ │
│ │ │ │ │ │ │ │
│ │ Use case → │──▶│ Use case → │──▶│ Use case → │ │
│ │ tokens, QPS │ │ TTFT, ITL, E2E │ │ accuracy metrics│ │
│ │ │ │ targets │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ Structured Intent ───────────────────────▶ Complete Specification│
└───────────────────────────────────────────────────────────────────┘
Purpose: Generate complete deployment specification from user intent
Input: Structured intent from Intent Extraction Service
Processing:
- Traffic Profile Generation: Map use case → computational load shape
- Prompt token length (e.g., 512, 1024, 4096)
- Output token length (e.g., 256, 512, 1024)
- Expected queries per second (QPS) based on user count
- SLO Target Mapping: Assign latency requirements based on use case
- TTFT (Time to First Token) target
- ITL (Inter-Token Latency) target
- E2E (End-to-End) target
- Percentile selection (p90, p95, or p99)
- Throughput (requests per second) requirement
- Accuracy Benchmark Selection: Identify relevant quality metrics based on use case and subject matter
Output: Complete deployment specification including:
- Performance Requirements: TTFT, ITL, E2E latency targets, RPS throughput
- Workload Profile: Mean input/output tokens, distribution characteristics
- Accuracy Benchmarks: Use-case specific quality evaluation criteria (for scoring)
- Quality Thresholds: Minimum accuracy score, maximum cost (for filtering)
- User Priorities: Weights for accuracy, cost, latency, complexity trade-offs
- Future: Power budget, cooling budget thresholds
Interface: Called internally after Intent Extraction; specification displayed in UI Spec Editor for user review and modification
Design Principle: Specifications are always editable - users can adjust SLO targets, priorities, and constraints before proceeding to recommendations
┌───────────────────────────────────────────────────────────────────┐
│ Recommendation Service │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Config Finder │ │ Scorer │ │ Analyzer │ │
│ │ │ │ │ │ │ │
│ │ Query benchmarks│──▶│ Score configs │──▶│ Generate │ │
│ │ Filter by SLOs │ │ on 5 dimensions:│ │ ranked views: │ │
│ │ Calculate │ │ - Accuracy │ │ - Best Accuracy │ │
│ │ replicas │ │ - Cost │ │ - Lowest Cost │ │
│ │ │ │ - Latency │ │ - Lowest Latency│ │
│ │ │ │ - Complexity │ │ - Simplest │ │
│ │ │ │ - Balanced │ │ - Balanced │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ Specification ─────────────────────────────────▶ Ranked Options │
└───────────────────────────────────────────────────────────────────┘
Purpose: Find optimal model + GPU configurations that meet specification requirements
Input: User-approved specification from UI (potentially modified from auto-generated version)
Major Sub-Components:
Role: Query performance database and generate viable configurations
Processing:
- Query performance database for configurations meeting specification
requirements:
- Filter by exact traffic profile match (prompt tokens → output tokens)
- Enforce SLO compliance (TTFT p95 ≤ target, ITL p95 ≤ target, E2E p95 ≤ target)
- Enforce quality thresholds (accuracy ≥ minimum, cost ≤ maximum)
- Apply optional user constraints (included/excluded hardware, preferred models)
- Calculate replicas required to meet throughput:
replicas = ceil((required_QPS × 1.2) / QPS_per_GPU_config)- 20% safety headroom for production stability
- Generate complete solution:
- Hardware type (e.g., NVIDIA L4, A100-80GB, H100)
- Tensor parallelism degree (GPUs per instance)
- Number of replicas (independent instances)
- Expected latency metrics (TTFT, ITL, E2E)
- Predicted throughput (RPS)
Future Enhancement: Use simulator or analytical models to generate options without real benchmark data
Role: Evaluate each configuration on multiple criteria
Four Scoring Dimensions (each 0-100 scale):
-
Accuracy (Primary Factor)
- Use-case specific quality scores from benchmark databases
- Fallback: Parameter count heuristic (7B → 55, 70B → 85, 405B → 95)
- Source: Artificial Analysis benchmarks, weighted by use case
-
Cost (Primary Factor)
- Monthly operational cost (GPU hourly rate × GPU count × 730 hours)
- Non-linear scoring to create meaningful differentiation
- Inverse relationship: lower cost = higher score
-
Latency (Secondary Factor)
- SLO compliance headroom: how much better than target
- Important for user experience but all configs already meet SLOs
- Extra credit for exceeding requirements, but diminishing returns
-
Complexity (Secondary Factor)
- Deployment and operational simplicity
- Function of GPU count, tensor parallelism, replica count
- Fewer resources = higher score (easier to manage)
Future Primary Factors:
- Power requirements (datacenter constraints)
- Cooling requirements (thermal budget)
Note on Scoring Strategy:
- All four factors scored equally (0-100) in current implementation
- Balanced composite uses weighted combination with higher weights on primary factors (accuracy, cost)
- Future may formalize Primary/Secondary distinction with explicit weight tiers
Role: Rank solutions and present trade-offs
Processing:
-
Generate five ranked views of recommendations:
- Best Accuracy: Sorted by quality/capability
- Lowest Cost: Sorted by price efficiency
- Lowest Latency: Sorted by SLO headroom
- Simplest: Sorted by deployment complexity (fewest GPUs)
- Balanced: Sorted by weighted composite score (user priorities)
-
Provide detailed rationale for each recommendation:
- Why this configuration was selected
- Trade-offs vs. alternatives
- SLO compliance margin
Output: Ranked list of deployment recommendations with:
- Model + GPU configuration details
- Predicted performance metrics
- Cost estimates (hourly, monthly)
- Quality/complexity/latency scores
- Trade-off analysis vs. alternatives
Interface: REST endpoint /api/v1/ranked-recommend-from-spec
┌───────────────────────────────────────────────────────────────────┐
│ Configuration Service │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Template │ │ Renderer │ │ Validator │ │
│ │ Library │ │ │ │ │ │
│ │ │ │ │ │ │ │
│ │ KServe, vLLM, │──▶│ Jinja2 template │──▶│ Resource limits,│ │
│ │ HPA, Service- │ │ processing │ │ security checks │ │
│ │ Monitor │ │ │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ Selected Recommendation ───────────────────────▶ YAML Files │
└───────────────────────────────────────────────────────────────────┘
Purpose: Generate production-ready Kubernetes configuration files
Input: User-selected recommendation from UI
Processing:
- Generate Kubernetes manifests using Jinja2 templates:
- KServe InferenceService YAML
- vLLM runtime configuration
- Horizontal Pod Autoscaler (HPA) policies
- Prometheus ServiceMonitor for observability
- Validate resource requests and security settings
Output:
- Configuration files ready for user to deploy via
kubectl apply -f - Files available for download or copy from UI
Interface: REST endpoint /api/v1/deploy
Initial Target: KServe on Kubernetes Future Options: Direct Linux server deployment, other orchestration platforms
These are critical infrastructure that enable the services but are not domain components:
Purpose: Coordinate multi-service workflows and provide REST API
Technology: FastAPI
Responsibilities:
- Expose HTTP endpoints for UI to call services
- Orchestrate multi-step flows (conversation → recommendation → deployment)
- Manage workflow state (session-based currently, persistent store future)
- Database management endpoints (
/api/v1/db/status,/api/v1/db/upload-benchmarks,/api/v1/db/reset) for remote benchmark data management - Handle authentication and authorization (future)
Purpose: Store performance data, configuration, and deployment history
Technology: Hybrid storage approach
- Embedded Database: Benchmark data and deployment outcomes (currently PostgreSQL, migrating to DuckDB or SQLite for simpler deployment)
- JSON files: Configuration as code (SLO templates, model catalog, hardware profiles)
- Data Management: Benchmark data can be loaded, inspected, and reset via CLI, REST API, or UI Configuration tab
Collections:
- Performance Benchmarks: Performance metrics for (model, GPU, traffic profile) combinations
- Accuracy Benchmarks: Accuracy metrics for multiple use cases
- Use Case SLO Templates: Default targets for 9 standard use cases
- Model Catalog: Curated models with task compatibility metadata
- Hardware Profiles: GPU specifications and pricing
- Deployment Outcomes (future): Actual performance data for continuous learning
1. User Message
↓
2. Intent Extraction Service
- Natural language → structured intent
- Extract: use case, user count, priorities, domain
↓
3. Specification Service
- Intent → traffic profile (prompt/output tokens, QPS)
- Use case → SLO targets (TTFT, ITL, E2E)
- Select accuracy benchmarks
↓
4. UI Specification Editor
- Display auto-generated specification
- User reviews and edits (SLOs, priorities, constraints)
- Obtain approval
↓
5. Recommendation Service
a. Config Finder: Query benchmarks, calculate replicas
b. Scorer: Evaluate on 4 dimensions (accuracy, cost, latency, complexity)
c. Analyzer: Generate 5 ranked views
↓
6. UI Recommendation Viewer
- Display ranked options (Best Accuracy, Lowest Cost, etc.)
- Show trade-off analysis
- User selects preferred configuration
↓
7. Configuration Service
- Generate Kubernetes YAML manifests
- User downloads or copies configuration files
- User deploys to cluster via kubectl apply -f
1. Deployed Service
- Collect actual performance metrics (latency, throughput, traffic patterns)
- Store telemetry data
↓
2. Telemetry UI (Future Component)
- Load and display deployment telemetry
- Identify optimization opportunities
- User triggers re-evaluation
↓
3. Intent Extraction Service (Alternate API)
- Telemetry data → structured intent
- Extract actual traffic patterns, observed latencies, real QPS
↓
4. Specification Service
- Generate specification based on actual observed behavior
- More accurate than initial estimates
↓
5. [Joins Primary Path]
- Recommendation Service → UI Recommendation Viewer → Configuration Service
- User can compare current deployment vs. optimized recommendations
Rationale for Telemetry Path:
- Initial conversational path relies on assumptions (use case, request rate, traffic profile)
- Real deployments may evolve over time (user behavior changes, workload patterns shift)
- Telemetry data provides ground truth for optimization
- Enables continuous improvement: deploy → measure → optimize → redeploy
Design Note: Telemetry path is an alternate entry point into the system. Both paths converge at the Specification Service and share the same recommendation and deployment workflows.
- React Migration: Professional frontend with improved state management, WebSocket-based real-time updates
- Advanced Visualizations:
- Pareto frontier charts for two-dimensional trade-offs
- Multi-dimensional radar charts (>2 dimensions)
- Interactive cost/latency sliders with live recommendation updates
- Telemetry UI: Interface for loading and analyzing actual deployment metrics
- Observability Feedback Loop: Store actual performance → improve future recommendations
- Drift Detection: Alert when deployed service deviates from predicted behavior
- Power/Cooling Budgets: Datacenter constraint modeling (power consumption, thermal limits)
- Statistical Traffic Distributions: Full distributions (mean, variance, percentiles) instead of point estimates
- Advanced SLO Types: Streaming latency, first-chunk time, custom user-perceived metrics
- Simulator-Based Planning: Generate recommendations without real benchmark data (analytical models, discrete event simulation)
- Multi-Model Deployments: Ensemble recommendations, A/B testing configurations, multi-model systems on heterogeneous hardware
- Agentic System Support: Recommendations for agent architectures with multiple coordinated models
- Fine-Tuning Integration: Suggest when custom fine-tuning improves quality vs. larger generic models
- Additional Runtimes: Support for llm-d and other inference engines beyond vLLM
- Multi-Platform Support: Configuration templates for direct Linux server, cloud-managed services, edge deployments
- Advanced Kubernetes Features: Templates for custom operators, advanced autoscaling (KEDA), multi-cluster configurations
- External Access Patterns: Ingress/Route generation, TLS certificate configuration
- Bring Your Own Data: Allow users to upload custom performance and/or accuracy benchmarks to support new models, hardware, and traffic profiles
NeuralNav is designed to support pluggable components, allowing customization for different environments and requirements.
- Support multiple LLM backends for intent extraction and other LLM tasks
- Options: Local models (Ollama), cloud APIs (OpenAI, Anthropic), user-provided endpoints
- Standard interface: OpenAI-compatible API for interoperability
- Model Catalog: User-provided or curated model lists
- Benchmarks: Custom performance and accuracy data (see "Bring Your Own Data" above)
- SLO Templates: Custom use case definitions with organization-specific defaults
- GPU specifications, pricing, power consumption, and cooling requirements
- User-provided data for proprietary or on-premises hardware
- Integration with external data sources for cloud pricing updates
- Conversational-First: All complexity hidden behind natural language (no forms or wizards)
- Specification Editability: Users can review and modify auto-generated specs before committing
- Multi-Criteria Optimization: Balanced recommendations across accuracy, cost, latency, and complexity
- SLO-Driven Planning: Recommendations guaranteed to meet user's latency and throughput requirements
- Explainability: Clear rationale for why each recommendation was made, trade-off analysis
- Threshold-Based Filtering: Editable accuracy and cost thresholds ensure all options meet minimum quality and budget requirements
- Ready-to-Deploy Configurations: Generate complete configuration files, minimizing manual YAML editing
| Term | Definition |
|---|---|
| E2E | End-to-End latency - total time from request submission to complete response |
| HPA | Horizontal Pod Autoscaler - Kubernetes component that scales pods based on metrics |
| InferenceService | KServe custom resource for deploying ML models on Kubernetes |
| ITL | Inter-Token Latency - time between successive tokens in streaming response |
| KServe | Kubernetes-native platform for serving ML models |
| p90/p95/p99 | Percentile thresholds - the value below which 90%/95%/99% of observations fall (selectable in specification) |
| QPS | Queries Per Second - request throughput rate |
| RPS | Requests Per Second - synonymous with QPS |
| SLO | Service Level Objective - target performance threshold (e.g., "TTFT p95 < 200ms") |
| Tensor Parallelism | Splitting a model across multiple GPUs to handle larger models or increase throughput |
| Traffic Profile | Expected workload shape: input token count, output token count, request rate |
| TTFT | Time to First Token - latency from request to first token of streaming response |
| vLLM | High-performance LLM inference engine with PagedAttention |
This architecture document describes the production vision for NeuralNav. The current POC implementation has successfully validated the core workflow and demonstrates the viability of the conversational approach to LLM deployment planning.
The separation of Intent Extraction Service and Specification Service reflects the logical architecture - these are distinct conceptual steps with different responsibilities. Current implementation combines them for simplicity, but the interface boundaries are preserved to enable future separation if needed.
Detailed implementation specifics and data schemas will be similar to those described in ARCHITECTURE.md, but some changes may be made and described in future versions of this document.
The following areas are not addressed in this version but may be considered for future iterations:
- Security - Authentication, authorization, and data privacy considerations
- Error Handling - Failure modes and recovery strategies (e.g., LLM extraction failures, no matching configurations)
- NeuralNav Deployment - How NeuralNav itself is deployed and scaled
- Multi-Tenancy - User/organization isolation and separate knowledge bases
- Data Versioning - Strategy for updating model catalogs, benchmarks, and SLO templates over time
- Non-Functional Requirements - System response times, availability targets, and operational characteristics