This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
This repository contains the architecture design for NeuralNav, an open-source system that guides users from concept to production-ready LLM deployments through a conversational AI and intelligent capacity planning.
Key Principle: The core functionality is complete and working end-to-end. The project is preparing for release.
-
docs/ARCHITECTURE.md: Comprehensive system architecture document
- 9 core components with technology recommendations
- Enhanced data schemas for SLO-driven deployment planning
- Phase 1 (3-month) vs Phase 2+ implementation strategy
- Knowledge Base schemas with 7 data collections
-
docs/architecture-diagram.md: Visual architecture representations
- Mermaid component diagrams
- Sequence diagrams showing end-to-end flows
- State machine for workflow orchestration
- Entity-relationship diagrams for data models
-
src/neuralnav/: Python package (PyPA src layout)
- api/: FastAPI REST API layer
app.py: FastAPI app factorydependencies.py: Singleton dependency injection- routes/: Modular endpoint handlers (health, intent, specification, recommendation, configuration, reference_data, database)
- intent_extraction/: Intent Extraction Service
extractor.py: LLM-powered intent extraction from natural languageservice.py: IntentExtractionService facade
- specification/: Specification Service
traffic_profile.py: Traffic profile and SLO target generationservice.py: SpecificationService facade
- recommendation/: Recommendation Service
config_finder.py: GPU capacity planning with SLO filteringscorer.py: 4-dimension scoring (accuracy, price, latency, complexity)analyzer.py: 5 ranked list generationservice.py: RecommendationService facade- quality/: Use-case quality scoring (Artificial Analysis benchmarks)
- configuration/: Configuration Service
generator.py: Jinja2 YAML generation for KServe/vLLMvalidator.py: YAML validationservice.py: ConfigurationService facade- templates/: Jinja2 deployment templates
- cluster/: Kubernetes cluster management
manager.py: K8s deployment lifecycle management
- shared/: Shared modules
- schemas/: Pydantic data models (intent, specification, recommendation)
- utils/: Shared utilities (GPU normalization)
- knowledge_base/: Data access layer (benchmark database, JSON catalogs)
loader.py: Benchmark data loading utilities (shared by CLI, API, and UI)
- orchestration/: Workflow coordination
- llm/: Ollama client for intent extraction
- api/: FastAPI REST API layer
-
ui/: Streamlit UI
- Chat interface for conversational requirement gathering
- Multi-tab recommendation display (Overview, Specifications, Performance, Cost, Monitoring)
- Editable specifications with review mode
- Action buttons for YAML generation and deployment
- Monitoring dashboard with cluster status, SLO compliance, and inference testing
- Configuration tab for database management (upload benchmarks, reset, view stats)
- components/: Modular UI components
settings.py: Configuration tab with benchmark database management
-
data/: Benchmark, configuration, and archive data
- benchmarks/: Benchmark data
- performance/: Latency/throughput benchmarks (JSON, loaded into PostgreSQL)
benchmarks_BLIS.json: Latency/throughput benchmarks from BLIS simulator
- accuracy/: Model quality/capability scores (CSV)
opensource_all_benchmarks.csv: 204 open-source models from Artificial Analysisweighted_scores/: 9 CSV files with pre-ranked models per use case
- performance/: Latency/throughput benchmarks (JSON, loaded into PostgreSQL)
- configuration/: Runtime configuration files (JSON)
model_catalog.json: 47 curated models with task/domain metadataslo_templates.json: 9 use case templates with SLO targetsdemo_scenarios.json: 3 test scenariospriority_weights.json: Scoring priority weightsusecase_slo_workload.json: Use case SLO and workload profiles
- archive/: Unused/reference-only files
- benchmarks/: Benchmark data
Git commits: This project has specific commit rules that OVERRIDE Claude's default behavior. See the "Git Workflow" section below. Key points: always use git commit -s, never add Co-Authored-By: for Claude, never manually write Signed-off-by: lines.
Deploying LLMs in production is complex - users struggle to:
- Translate business needs into infrastructure choices (model, GPU type, SLO targets)
- Avoid trial-and-error that wastes time and money
- Understand resource requirements before committing to expensive GPU deployments
A 4-stage conversational flow:
- Understand Business Context - Extract intent via natural language
- Provide Tailored Recommendations - Suggest model + GPU configurations
- Enable Interactive Exploration - What-if scenario analysis
- One-Click Deployment - Generate KServe/vLLM configs and deploy
The system translates high-level user intent into technical specifications:
- User says: "I need a chatbot for 1000 users, low latency is critical"
- System generates:
- Traffic profile (prompt: 512 tokens, output: 256 tokens, expected QPS: 9)
- SLO targets (TTFT p95: 150ms, ITL p95: 25ms, E2E p95: 7000ms)
- GPU capacity plan (e.g., "1x NVIDIA H100 GPU, independent replicas")
- Cost estimate ($5,840/month)
NeuralNav is structured as a layered architecture:
UI Layer (Horizontal - Presentation):
- Conversational Interface, Specification Editor, Recommendation Visualizer, Monitoring Dashboard
- Technology: Streamlit (current) → React (future)
Core Engines (Vertical - Backend Services):
- Intent & Specification Engine - Transform conversation into complete deployment spec
- LLM-powered intent extraction (Ollama qwen2.5:7b)
- Use case → traffic profile mapping (4 GuideLLM standards)
- SLO template lookup and specification generation
- Recommendation Engine - Find optimal model + GPU configurations
- Multi-criteria scoring (accuracy, price, latency, complexity)
- Capacity planning (GPU count, deployment topology)
- SLO compliance filtering with near-miss tolerance
- Ranked lists generation (5 views: best accuracy, lowest cost, etc.)
- Deployment Engine - Generate and deploy Kubernetes configs
- YAML generation (Jinja2 templates)
- K8s deployment lifecycle management
- Observability Engine - Monitor deployed services
- Health monitoring and inference testing (current)
- Performance tracking and feedback loop (future)
Infrastructure (Not numbered as core engines):
- API Gateway (FastAPI) - Coordinates workflow between UI and engines
- Knowledge Base (Data Layer) - Hybrid storage:
- PostgreSQL: Benchmarks, deployment outcomes
- JSON files: SLO templates, model catalog, hardware profiles
Development Tools:
- vLLM Simulator - GPU-free development and testing
- Model Benchmarks (PostgreSQL): TTFT/ITL/E2E/throughput benchmarks for (model, GPU, tensor_parallel) combinations (source: BLIS simulator)
- Use Case SLO Templates (JSON): 9 use cases mapped to 4 GuideLLM traffic profiles with experience-driven SLO targets
- Model Catalog (JSON): 47 curated, approved models with task/domain metadata
- Model Quality Scores (CSV): Use-case specific scores from Artificial Analysis benchmarks (204 models)
- Use Case Configs (JSON): Benchmark weights, SLO targets, and workload profiles per use case
- Deployment Outcomes (PostgreSQL, future): Actual performance data for feedback loop
The recommendation engine uses multi-criteria scoring to rank configurations:
4 Scoring Dimensions (each 0-100 scale):
- Accuracy/Quality: Use-case specific model capability from Artificial Analysis benchmarks
- Source:
data/benchmarks/accuracy/weighted_scores/*.csv - Fallback: Parameter count heuristic if model not in benchmark data
- Source:
- Price: Cost efficiency (inverse of monthly cost, normalized)
- Latency: SLO compliance and headroom from performance benchmark database
- Complexity: Deployment simplicity (fewer GPUs = higher score)
Default Weights: 40% accuracy, 40% price, 10% latency, 10% complexity
5 Ranked Views:
best_accuracy: Sorted by model capabilitylowest_cost: Sorted by price efficiencylowest_latency: Sorted by SLO headroomsimplest: Sorted by deployment complexitybalanced: Sorted by weighted composite score
Key Files:
src/neuralnav/recommendation/scorer.py- Calculates 4 scoressrc/neuralnav/recommendation/quality/usecase_scorer.py- Artificial Analysis benchmark scoringsrc/neuralnav/recommendation/analyzer.py- Generates 5 ranked listssrc/neuralnav/recommendation/config_finder.py- Orchestrates scoring during capacity planning
Requirements: Python 3.11+ (3.13 recommended on macOS), uv, Docker or Podman, kubectl, kind, ollama.
This project uses uv (by Astral) for Python package management. Do not use pip or pip install.
- Install dependencies:
uv sync --extra ui --extra dev(reads frompyproject.toml+uv.lock) - Run Python commands:
uv run python ...(not barepython) - Run tools:
uv run pytest,uv run ruff,uv run uvicorn, etc. - Add a dependency:
uv add <package>(updatespyproject.tomlanduv.lock) - Source of truth:
pyproject.tomldefines all dependencies; there is no top-levelrequirements.txt
Note: ui/requirements.txt and simulator/requirements.txt exist separately for their Docker builds.
make setup # Full setup (prereqs + backend + UI + Ollama)
make setup-backend # Python env only (uv sync --extra ui --extra dev)make start # Start all (DB + Ollama + Backend + UI)
make stop # Stop Backend + UI (leaves DB and Ollama running)
make stop-all # Stop everything
make health # Check all service healthService URLs: UI http://localhost:8501, Backend http://localhost:8000 (Swagger at /docs), Ollama http://localhost:11434, DB postgresql://postgres:neuralnav@localhost:5432/neuralnav
make test-unit # Unit tests only (no DB or Ollama needed)
make test-db # Database tests (requires PostgreSQL with data)
make test-integration # Integration tests (requires Ollama + DB)
make test # All tests
# Run a single test file or test function:
cd src && uv run pytest ../tests/path/to/test_file.py -v
cd src && uv run pytest ../tests/path/to/test_file.py::test_function_name -vTest markers: @pytest.mark.unit, @pytest.mark.database, @pytest.mark.integration. Tests run from the src/ directory (cd src && uv run pytest ../tests/).
make lint # Ruff linter (src/ and ui/)
make format # Ruff auto-format
make typecheck # Mypy type checking (src/ and ui/)CI runs on PRs to main: ruff check + format check on src/ and tests/, mypy on src/, unit tests on Python 3.11 and 3.12 with coverage. All must pass.
make db-start # Start PostgreSQL container (auto-creates schema)
make db-stop # Stop PostgreSQL
make db-reset # Remove and reinitialize
make db-load-blis # Load BLIS benchmark data
make db-load-estimated # Load estimated performance data
make db-shell # Open psql shellmake docker-up # Start all services via Docker Compose
make docker-down # Stop all
make docker-up-dev # Development mode with live reloadmake cluster-start # Create KIND cluster + load simulator image
make cluster-stop # Delete cluster
make cluster-status # Show status
make clean-deployments # Delete all InferenceServicesmake build-backend # Build backend Docker image
make build-simulator # Build vLLM simulator imageContainer runtime auto-detects Docker or Podman. Override with CONTAINER_TOOL=podman make ....
docs/ARCHITECTURE.md and docs/architecture-diagram.md must stay synchronized:
- If you change component descriptions in ARCHITECTURE.md, update architecture-diagram.md diagrams
- If you add/remove components, update both files
- Components are referenced by name (not numbered) for clarity and flexibility
-
Phase 1 uses Python for all components (rapid development, stack consistency)
- Go migration for Deployment Automation Engine is a possible future option (see Possible Future Enhancements in ARCHITECTURE.md)
-
Phase 1 uses point estimates for traffic (avg prompt length, avg QPS)
- Benchmarks collected using vLLM default configuration (dynamic batching enabled)
- Phase 2 adds full statistical distributions (mean, variance, tail) and multi-dimensional benchmarks
-
SLO metrics use p95 percentiles (Phase 2):
- TTFT (Time to First Token): p95 - pre-calculated in benchmarks
- ITL (Inter-Token Latency): p95 - pre-calculated in benchmarks (replaces TPOT terminology)
- E2E Latency: p95 - pre-calculated in benchmarks from actual measurements
- Throughput: requests/sec and tokens/sec
- Rationale:
- p95 is more conservative than p90, providing better UX guarantees
- E2E latency is measured directly from benchmarks under realistic load conditions
- Benchmarks are organized around 4 GuideLLM traffic profiles for exact matching
-
Editable specifications: Users must be able to review and modify auto-generated specs before deployment
-
Feedback loop: Actual deployment outcomes feed back into Knowledge Base to improve future recommendations
- Use "NeuralNav" as the project name
- Use "TTFT" for Time to First Token (not "time-to-first-token")
- Use "ITL" for Inter-Token Latency (Phase 2 terminology, replaces TPOT)
- Use "SLO" for Service Level Objective
- Use "E2E" for End-to-End latency
- Use "p95" for 95th percentile metrics (Phase 2 standard, more conservative than p90)
- GPU configurations: "2x NVIDIA L4" or "4x A100-80GB" (not "2 L4s")
All API endpoints must follow these rules:
- Prefix: Every route file uses
APIRouter(prefix="/api/v1"). Individual route decorators use relative paths (e.g.,@router.post("/recommend")), not full paths. - Health check exception:
/healthstays at root with no prefix (standard for load balancer probes). This is the only endpoint outside/api/v1/. - Versioning: All endpoints are under
/api/v1/. When a v2 is needed, add new route files withprefix="/api/v2". - Naming: Use kebab-case for multi-word paths (e.g.,
/deploy-to-cluster,/ranked-recommend-from-spec). - When adding a new route file: Set
prefix="/api/v1"on theAPIRouterand use relative paths in all decorators. Register the router insrc/neuralnav/api/routes/__init__.pyand include it insrc/neuralnav/api/app.py.
Adding a new use case template:
- Add corresponding entry to
data/configuration/slo_templates.json - Create weighted scores CSV in
data/benchmarks/accuracy/weighted_scores/ - Add use case to
UseCaseQualityScorer.USE_CASE_FILESinusecase_quality_scorer.py - Update
docs/USE_CASE_METHODOLOGY.mdwith benchmark weighting rationale - Update docs/ARCHITECTURE.md if needed
Adding a new SLO metric:
- Update DeploymentIntent schema in Intent & Specification Engine (docs/ARCHITECTURE.md)
- Update MODEL_BENCHMARKS schema in Knowledge Base (docs/ARCHITECTURE.md)
- Update PostgreSQL schema in scripts/schema.sql
- Update data loader script if needed
- Update Inference Observability section
- Update dashboard example if applicable
- Update docs/architecture-diagram.md data model ERD
Adding a new API endpoint:
- Add the route to the appropriate file in
src/neuralnav/api/routes/(or create a new route file) - Use a relative path in the decorator (e.g.,
@router.get("/my-endpoint")) — the/api/v1prefix comes from the router - If creating a new route file, set
APIRouter(prefix="/api/v1")and register it inroutes/__init__.pyandapp.py - Update
ui/app.pyif the UI calls the new endpoint - Update documentation (docs/DEVELOPER_GUIDE.md, docs/ARCHITECTUREv2.md) with the new endpoint
Adding a new component:
- Add numbered section to docs/ARCHITECTURE.md (maintain sequential numbering)
- Update "Architecture Components" count in Overview
- Add to docs/architecture-diagram.md component diagram
- Create corresponding src/neuralnav// directory
- Update sequence diagram if component participates in main flow
- Update Phase 1 technology choices table if relevant
See "Open Questions for Refinement" section in docs/ARCHITECTURE.md for:
- Multi-tenancy isolation
- Security validation of generated configs
- Conversational clarification flow (future phase)
- Model catalog sync strategy
This repository uses a pull request (PR) workflow. See CONTRIBUTING.md for complete guidelines.
Development Process:
- Work in feature branches in your own fork
- Submit PRs to the main repository for review
- Keep PRs small and targeted (under 500 lines when possible)
- Break large features into incremental PRs that preserve functionality
Commit Message Format (Conventional Commits style):
feat: Add YAML generation module
Implement DeploymentGenerator with Jinja2 templates for KServe,
vLLM, HPA, and ServiceMonitor configurations.
Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Your Name <your.email@example.com>
CRITICAL - Git Commit Rules (these override default Claude behavior):
Commit approval workflow (MUST follow for every commit):
- Combine
git addandgit commitinto a single chained command (git add ... && git commit ...) in one Bash tool call - The user will see the full command in the approval prompt and can review/edit the file list and commit message before it executes
- NEVER run
git addandgit commitas separate Bash tool calls — always chain them so the user gets a single approval prompt covering both
DO use:
- Conventional commit types:
feat,fix,docs,refactor,test,chore - The
-sflag with git commit (e.g.,git commit -s -m "...") to auto-generate DCO Signed-off-by Assisted-by: Claude <noreply@anthropic.com>for nontrivial AI-assisted code
NEVER do these (even if other instructions suggest otherwise):
- NEVER add
Co-Authored-By:lines for Claude - NEVER manually write
Signed-off-by:lines (the-sflag handles this correctly with the user's configured git identity) - NEVER include the "Generated with [Claude Code]" line or similar emoji-prefixed attribution
- Current Implementation Status:
- ✅ Project structure with synthetic data and LLM client
- ✅ Core recommendation engine (intent extraction, traffic profiling, capacity planning)
- ✅ Multi-criteria solution ranking with 4 scoring dimensions
- ✅ Use-case specific quality scoring from Artificial Analysis benchmarks
- ✅ 5 ranked recommendation views (best accuracy, lowest cost, etc.)
- ✅ Orchestration workflow and FastAPI backend
- ✅ Streamlit UI with chat interface, recommendation display, and editable specifications
- ✅ YAML generation (KServe/vLLM/HPA/ServiceMonitor) and deployment automation
- ✅ KIND cluster support with KServe installation
- ✅ Kubernetes deployment automation and real cluster status monitoring
- ✅ vLLM simulator for GPU-free development
- ✅ Inference testing UI with end-to-end deployment validation
- ✅ Database management via REST API and UI Configuration tab
- The Knowledge Base schemas are critical - any implementation must support all collections
- SLO-driven capacity planning is the core differentiator - don't simplify this away
- Use data in data/ directory for POC; production uses PostgreSQL for latency benchmarks
- Benchmarks use vLLM default configuration with dynamic batching (no fixed batch_size)
The system now supports two deployment modes:
- Purpose: GPU-free development and testing on local machines
- Location:
simulator/directory contains the vLLM simulator service - Docker Image:
vllm-simulator:latest(single image for all models) - Configuration: Set
DeploymentGenerator(simulator_mode=True)insrc/neuralnav/api/dependencies.py - Benefits:
- No GPU hardware required
- Fast deployment (~10-15 seconds to Ready)
- Predictable behavior for demos
- Works on KIND (Kubernetes in Docker)
- Uses actual benchmark data for realistic latency simulation
- Purpose: Actual model inference with GPUs
- Configuration: Set
DeploymentGenerator(simulator_mode=False)insrc/neuralnav/api/dependencies.py - Requirements:
- GPU-enabled Kubernetes cluster
- NVIDIA GPU Operator installed
- HuggingFace token secret for model downloads
- Sufficient GPU resources (based on recommendations)
- Behavior:
- Downloads actual models from HuggingFace
- Real GPU inference
- Production-grade performance
Use Simulator Mode for:
- Local development and testing
- UI/UX iteration
- Workflow validation
- Demos and presentations
- CI/CD testing (no GPU required)
Use Real vLLM Mode for:
- Production deployments
- Performance benchmarking
- Model quality validation
- GPU utilization testing
The deployment template (src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2) uses Jinja2 conditionals:
{% if simulator_mode %}- Usesvllm-simulator:latest, no GPU resources, fast health checks{% else %}- Usesvllm/vllm-openai:v0.6.2, requests GPUs, longer health checks
Single codebase supports both modes - just toggle the flag!