Confidently navigate LLM deployments from concept to production.
The system addresses a critical challenge: how do you translate business requirements into the right model and infrastructure choices without expensive trial-and-error?
NeuralNav guides you from concept to production LLM deployments through SLO-driven capacity planning. Conversationally define your requirements—NeuralNav translates them into traffic profiles, performance targets, and cost constraints. Get intelligent model and GPU recommendations based on real benchmarks. Explore alternatives, compare tradeoffs, deploy with one click, and monitor actual performance—staying on course as your needs evolve.
The code in this repository implements the NeuralNav Phase 2 MVP with production-grade data management. Phase 1 (POC) demonstrated the end-to-end workflow with synthetic data. Phase 2 adds PostgreSQL for benchmark storage, a traffic profile framework aligned with GuideLLM standards, experience-driven SLO mapping, and p95 percentile targets for conservative guarantees.
- 🗣️ Conversational Requirements Gathering - Describe your use case in natural language
- 📊 SLO-Driven Capacity Planning - Translate business needs into technical specifications (traffic profiles, latency targets, cost constraints)
- 🎯 Intelligent Recommendations - Get optimal model + GPU configurations backed by real benchmark data
- 🔍 What-If Analysis - Explore alternatives and compare cost vs. latency tradeoffs
- ⚡ One-Click Deployment - Generate production-ready KServe/vLLM YAML and deploy to Kubernetes
- 📈 Performance Monitoring - Track actual deployment status and test inference in real-time
- 💻 GPU-Free Development - vLLM simulator enables local testing without GPU hardware
- Extract Intent - LLM-powered analysis converts your description into structured requirements
- Map to Traffic Profile - Match use case to one of 4 GuideLLM benchmark configurations
- Set SLO Targets - Auto-generate TTFT (p95), ITL (p95), and E2E (p95) targets based on experience class
- Query Benchmarks - Exact match on (model, GPU, traffic profile) from PostgreSQL database
- Filter by SLOs - Find configurations meeting all p95 latency targets
- Plan Capacity - Calculate required replicas based on throughput requirements
- Generate & Deploy - Create validated Kubernetes YAML and deploy to local or production clusters
- Monitor & Validate - Track deployment status and test inference endpoints
Required before running make setup:
- macOS or Linux (Windows via WSL2)
- Docker Desktop (must be running)
- Python 3.13 -
brew install python@3.13 - uv -
curl -LsSf https://astral.sh/uv/install.sh | sh - Ollama -
brew install ollama - kubectl -
brew install kubectl - KIND -
brew install kind
Get up and running in 4 commands:
make setup # Install dependencies, pull Ollama model
make start # Start all services (DB + Ollama + Backend + UI)
make db-load-blis # Load BLIS benchmark data
make cluster-start # Optional: Create local KIND cluster with vLLM simulator for testing deploymentsThen open http://localhost:8501 in your browser.
Note: PostgreSQL runs as a Docker container (neuralnav-postgres) with benchmark data. All db-load-* commands append to existing data. Use make db-reset first for a clean database. Benchmark data can also be uploaded and managed via the UI's Configuration tab or the REST API (/api/v1/db/upload-benchmarks).
Stop everything:
make stop # Stop Backend + UI (leaves Ollama and DB running)
make stop-all # Stop all services including Ollama and DB
make cluster-stop # Delete cluster (optional)- Describe your use case in the chat interface
- Example: "I need a customer service chatbot for 5000 users with low latency"
- Analyze use case - Click "Analyze Use Case" to extract intent
- Generate specification - Click "Generate Specification" to create traffic profile and SLO targets
- Review specification - Edit SLO targets, priorities, or constraints if needed
- Generate recommendations - Click "Generate Recommendations" to find optimal configurations
- Select a recommendation - Review ranked options and click "Select"
- Deploy - Go to the "Deployment" tab to review, copy, or download generated deployment files
See docs/ARCHITECTURE.md for detailed system design.
- ✅ Foundation: Project structure, synthetic data, LLM client (Ollama)
- ✅ Core Recommendation Engine: Intent extraction, traffic profiling, model recommendation, capacity planning
- ✅ FastAPI Backend: REST endpoints, orchestration workflow, knowledge base access
- ✅ Streamlit UI: Chat interface, recommendation display, specification editor
- ✅ Deployment Automation: YAML generation (KServe/vLLM/HPA/ServiceMonitor), Kubernetes deployment
- ✅ Local Kubernetes: KIND cluster support, KServe installation, cluster management
- ✅ vLLM Simulator: GPU-free development mode with realistic latency simulation
- ✅ Monitoring & Testing: Real-time deployment status, inference testing UI, cluster observability
- ✅ Database Management: Upload/reset benchmark data via REST API or UI Configuration tab (no shell access needed)
| Component | Technology |
|---|---|
| Backend | FastAPI, Pydantic |
| Frontend | Streamlit |
| LLM | Ollama (qwen2.5:7b) |
| Data | PostgreSQL |
| YAML Generation | Jinja2 templates |
| Kubernetes | KIND (local), KServe v0.14.0 |
| Deployment | kubectl |
make help # Show all available commands
make start # Start all services (DB + Ollama + Backend + UI)
make stop # Stop Backend + UI (leaves Ollama and DB running)
make stop-all # Stop everything including Ollama and DB
make restart # Restart all services
make logs-backend # Show backend logs
make logs-ui # Show UI logs
# Database (PostgreSQL)
make db-start # Start PostgreSQL (initializes schema on first run)
make db-load-blis # Load BLIS benchmark data (appends)
make db-load-estimated # Load estimated performance data (appends)
make db-load-interpolated # Load interpolated benchmark data (appends)
make db-load-guidellm # Load benchmark data created with GuideLLM (not included in repo yet) (appends)
make db-reset # Reset database (remove all data and reinitialize)
make db-shell # Open PostgreSQL shell
make db-query-models # Query available models in database
make db-query-traffic # Query traffic patterns in database
# Kubernetes
make cluster-status # Check Kubernetes cluster status
make clean-deployments # Delete all InferenceServices
# Testing
make test # Run all tests (requires DB and Ollama)
make test-unit # Run unit tests (no external dependencies)
make test-db # Run database tests (requires PostgreSQL)
make test-integration # Run integration tests (requires Ollama and DB)
make clean # Remove generated filesNeuralNav includes a GPU-free simulator for local development:
- No GPU required - Run deployments on any laptop
- OpenAI-compatible API -
/v1/completionsand/v1/chat/completions - Realistic latency - Uses benchmark data to simulate TTFT/ITL
- Fast deployment - Pods become Ready in ~10-15 seconds
The deployment mode defaults to production (real vLLM with GPUs). Switch between production and simulator modes at runtime using the Configuration tab in the UI, or via the REST API:
GET /api/v1/deployment-mode- Check current modePUT /api/v1/deployment-mode- Set mode ({"mode": "simulator"}or{"mode": "production"})
See docs/DEVELOPER_GUIDE.md for details.
- Developer Guide - Development workflows, testing, debugging
- Architecture - Detailed system design and component specifications
- Traffic and SLOs - Traffic profile framework and experience-driven SLOs (Phase 2)
- Logging Guide - Logging system and debugging
- Claude Code Guidance - AI assistant instructions for contributors
Phase 2 MVP improvements (now complete):
- ✅ PostgreSQL Database - Production-grade benchmark storage with psycopg2
- ✅ Traffic Profile Framework - 4 GuideLLM standard configurations: (512→256), (1024→1024), (4096→512), (10240→1536)
- ✅ Experience-Driven SLOs - 9 use cases mapped to 5 experience classes (instant, conversational, interactive, deferred, batch)
- ✅ p95 Percentiles - More conservative SLO guarantees (changed from p90)
- ✅ ITL Terminology - Inter-Token Latency instead of TPOT (Time Per Output Token)
- ✅ Exact Traffic Matching - No fuzzy matching, exact (prompt_tokens, output_tokens) queries
- ✅ Pre-calculated E2E - E2E latency stored in benchmarks for accuracy
- ✅ Enhanced SLO Filtering - Find configurations meeting all p95 targets
- Production-Grade Ingress - External access with TLS, authentication, rate limiting
- Production GPU Validation - End-to-end testing with real GPU clusters
- Feedback Loop - Actual metrics → benchmark updates
- Statistical Traffic Models - Full distributions (not just point estimates)
- Multi-Dimensional Benchmarks - Concurrency, batching, KV cache effects
- Security Hardening - YAML validation, RBAC, network policies
- Multi-Tenancy - Namespaces, resource quotas, isolation
- Advanced Simulation - SimPy, Monte Carlo for what-if analysis
We are early in development of this project, but contributions are welcome.
See CLAUDE.md for AI assistant guidance when making changes.
This project is licensed under Apache License 2.0. See the LICENSE file for details.