CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This repository contains the architecture design for NeuralNav, an open-source system that guides users from concept to production-ready LLM deployments through a conversational AI and intelligent capacity planning.

Key Principle: The core functionality is complete and working end-to-end. The project is preparing for release.

Repository Structure

docs/ARCHITECTURE.md: Comprehensive system architecture document
- 9 core components with technology recommendations
- Enhanced data schemas for SLO-driven deployment planning
- Phase 1 (3-month) vs Phase 2+ implementation strategy
- Knowledge Base schemas with 7 data collections
docs/architecture-diagram.md: Visual architecture representations
- Mermaid component diagrams
- Sequence diagrams showing end-to-end flows
- State machine for workflow orchestration
- Entity-relationship diagrams for data models
src/neuralnav/: Python package (PyPA src layout)
- api/: FastAPI REST API layer
  - app.py: FastAPI app factory
  - dependencies.py: Singleton dependency injection
  - routes/: Modular endpoint handlers (health, intent, specification, recommendation, configuration, reference_data, database)
- intent_extraction/: Intent Extraction Service
  - extractor.py: LLM-powered intent extraction from natural language
  - service.py: IntentExtractionService facade
- specification/: Specification Service
  - traffic_profile.py: Traffic profile and SLO target generation
  - service.py: SpecificationService facade
- recommendation/: Recommendation Service
  - config_finder.py: GPU capacity planning with SLO filtering
  - scorer.py: 4-dimension scoring (accuracy, price, latency, complexity)
  - analyzer.py: 5 ranked list generation
  - service.py: RecommendationService facade
  - quality/: Use-case quality scoring (Artificial Analysis benchmarks)
- configuration/: Configuration Service
  - generator.py: Jinja2 YAML generation for KServe/vLLM
  - validator.py: YAML validation
  - service.py: ConfigurationService facade
  - templates/: Jinja2 deployment templates
- cluster/: Kubernetes cluster management
  - manager.py: K8s deployment lifecycle management
- shared/: Shared modules
  - schemas/: Pydantic data models (intent, specification, recommendation)
  - utils/: Shared utilities (GPU normalization)
- knowledge_base/: Data access layer (benchmark database, JSON catalogs)
  - loader.py: Benchmark data loading utilities (shared by CLI, API, and UI)
- orchestration/: Workflow coordination
- llm/: Ollama client for intent extraction
ui/: Streamlit UI
- Chat interface for conversational requirement gathering
- Multi-tab recommendation display (Overview, Specifications, Performance, Cost, Monitoring)
- Editable specifications with review mode
- Action buttons for YAML generation and deployment
- Monitoring dashboard with cluster status, SLO compliance, and inference testing
- Configuration tab for database management (upload benchmarks, reset, view stats)
- components/: Modular UI components
  - settings.py: Configuration tab with benchmark database management
data/: Benchmark, configuration, and archive data
- benchmarks/: Benchmark data
  - performance/: Latency/throughput benchmarks (JSON, loaded into PostgreSQL)
    - benchmarks_BLIS.json: Latency/throughput benchmarks from BLIS simulator
  - accuracy/: Model quality/capability scores (CSV)
    - opensource_all_benchmarks.csv: 204 open-source models from Artificial Analysis
    - weighted_scores/: 9 CSV files with pre-ranked models per use case
- configuration/: Runtime configuration files (JSON)
  - model_catalog.json: 47 curated models with task/domain metadata
  - slo_templates.json: 9 use case templates with SLO targets
  - demo_scenarios.json: 3 test scenarios
  - priority_weights.json: Scoring priority weights
  - usecase_slo_workload.json: Use case SLO and workload profiles
- archive/: Unused/reference-only files

Important Behavioral Notes for Claude

Git commits: This project has specific commit rules that OVERRIDE Claude's default behavior. See the "Git Workflow" section below. Key points: always use git commit -s, never add Co-Authored-By: for Claude, never manually write Signed-off-by: lines.

Architecture Key Concepts

Problem Being Solved

Deploying LLMs in production is complex - users struggle to:

Translate business needs into infrastructure choices (model, GPU type, SLO targets)
Avoid trial-and-error that wastes time and money
Understand resource requirements before committing to expensive GPU deployments

Solution Approach

A 4-stage conversational flow:

Understand Business Context - Extract intent via natural language
Provide Tailored Recommendations - Suggest model + GPU configurations
Enable Interactive Exploration - What-if scenario analysis
One-Click Deployment - Generate KServe/vLLM configs and deploy

Core Innovation: SLO-Driven Capacity Planning

The system translates high-level user intent into technical specifications:

User says: "I need a chatbot for 1000 users, low latency is critical"
System generates:
- Traffic profile (prompt: 512 tokens, output: 256 tokens, expected QPS: 9)
- SLO targets (TTFT p95: 150ms, ITL p95: 25ms, E2E p95: 7000ms)
- GPU capacity plan (e.g., "1x NVIDIA H100 GPU, independent replicas")
- Cost estimate ($5,840/month)

Architecture Overview

NeuralNav is structured as a layered architecture:

UI Layer (Horizontal - Presentation):

Conversational Interface, Specification Editor, Recommendation Visualizer, Monitoring Dashboard
Technology: Streamlit (current) → React (future)

Core Engines (Vertical - Backend Services):

Intent & Specification Engine - Transform conversation into complete deployment spec
- LLM-powered intent extraction (Ollama qwen2.5:7b)
- Use case → traffic profile mapping (4 GuideLLM standards)
- SLO template lookup and specification generation
Recommendation Engine - Find optimal model + GPU configurations
- Multi-criteria scoring (accuracy, price, latency, complexity)
- Capacity planning (GPU count, deployment topology)
- SLO compliance filtering with near-miss tolerance
- Ranked lists generation (5 views: best accuracy, lowest cost, etc.)
Deployment Engine - Generate and deploy Kubernetes configs
- YAML generation (Jinja2 templates)
- K8s deployment lifecycle management
Observability Engine - Monitor deployed services
- Health monitoring and inference testing (current)
- Performance tracking and feedback loop (future)

Infrastructure (Not numbered as core engines):

API Gateway (FastAPI) - Coordinates workflow between UI and engines
Knowledge Base (Data Layer) - Hybrid storage:
- PostgreSQL: Benchmarks, deployment outcomes
- JSON files: SLO templates, model catalog, hardware profiles

Development Tools:

vLLM Simulator - GPU-free development and testing

Critical Data Collections (Knowledge Base)

Model Benchmarks (PostgreSQL): TTFT/ITL/E2E/throughput benchmarks for (model, GPU, tensor_parallel) combinations (source: BLIS simulator)
Use Case SLO Templates (JSON): 9 use cases mapped to 4 GuideLLM traffic profiles with experience-driven SLO targets
Model Catalog (JSON): 47 curated, approved models with task/domain metadata
Model Quality Scores (CSV): Use-case specific scores from Artificial Analysis benchmarks (204 models)
Use Case Configs (JSON): Benchmark weights, SLO targets, and workload profiles per use case
Deployment Outcomes (PostgreSQL, future): Actual performance data for feedback loop

Solution Ranking System

The recommendation engine uses multi-criteria scoring to rank configurations:

4 Scoring Dimensions (each 0-100 scale):

Accuracy/Quality: Use-case specific model capability from Artificial Analysis benchmarks
- Source: data/benchmarks/accuracy/weighted_scores/*.csv
- Fallback: Parameter count heuristic if model not in benchmark data
Price: Cost efficiency (inverse of monthly cost, normalized)
Latency: SLO compliance and headroom from performance benchmark database
Complexity: Deployment simplicity (fewer GPUs = higher score)

Default Weights: 40% accuracy, 40% price, 10% latency, 10% complexity

5 Ranked Views:

best_accuracy: Sorted by model capability
lowest_cost: Sorted by price efficiency
lowest_latency: Sorted by SLO headroom
simplest: Sorted by deployment complexity
balanced: Sorted by weighted composite score

Key Files:

src/neuralnav/recommendation/scorer.py - Calculates 4 scores
src/neuralnav/recommendation/quality/usecase_scorer.py - Artificial Analysis benchmark scoring
src/neuralnav/recommendation/analyzer.py - Generates 5 ranked lists
src/neuralnav/recommendation/config_finder.py - Orchestrates scoring during capacity planning

Development Environment

This project uses uv (by Astral) for Python package management. Do not use pip or pip install.

Install dependencies: uv sync (reads from pyproject.toml + uv.lock)
Run Python commands: uv run python ... (not bare python)
Run tools: uv run pytest, uv run ruff, uv run uvicorn, etc.
Add a dependency: uv add <package> (updates pyproject.toml and uv.lock)
Source of truth: pyproject.toml defines all dependencies; there is no top-level requirements.txt
Makefile targets already use uv — see make setup-backend, make start-backend, etc.

Note: ui/requirements.txt and simulator/requirements.txt exist separately for their Docker builds.

Working with This Repository

When Modifying Architecture Documents

docs/ARCHITECTURE.md and docs/architecture-diagram.md must stay synchronized:

If you change component descriptions in ARCHITECTURE.md, update architecture-diagram.md diagrams
If you add/remove components, update both files
Components are referenced by name (not numbered) for clarity and flexibility

Key Architectural Decisions to Preserve

Phase 1 uses Python for all components (rapid development, stack consistency)
- Go migration for Deployment Automation Engine is a possible future option (see Possible Future Enhancements in ARCHITECTURE.md)
Phase 1 uses point estimates for traffic (avg prompt length, avg QPS)
- Benchmarks collected using vLLM default configuration (dynamic batching enabled)
- Phase 2 adds full statistical distributions (mean, variance, tail) and multi-dimensional benchmarks
SLO metrics use p95 percentiles (Phase 2):
- TTFT (Time to First Token): p95 - pre-calculated in benchmarks
- ITL (Inter-Token Latency): p95 - pre-calculated in benchmarks (replaces TPOT terminology)
- E2E Latency: p95 - pre-calculated in benchmarks from actual measurements
- Throughput: requests/sec and tokens/sec
- Rationale:
  - p95 is more conservative than p90, providing better UX guarantees
  - E2E latency is measured directly from benchmarks under realistic load conditions
  - Benchmarks are organized around 4 GuideLLM traffic profiles for exact matching
Editable specifications: Users must be able to review and modify auto-generated specs before deployment
Feedback loop: Actual deployment outcomes feed back into Knowledge Base to improve future recommendations

Terminology Consistency

Use "NeuralNav" as the project name
Use "TTFT" for Time to First Token (not "time-to-first-token")
Use "ITL" for Inter-Token Latency (Phase 2 terminology, replaces TPOT)
Use "SLO" for Service Level Objective
Use "E2E" for End-to-End latency
Use "p95" for 95th percentile metrics (Phase 2 standard, more conservative than p90)
GPU configurations: "2x NVIDIA L4" or "4x A100-80GB" (not "2 L4s")

API Endpoint Conventions

All API endpoints must follow these rules:

Prefix: Every route file uses APIRouter(prefix="/api/v1"). Individual route decorators use relative paths (e.g., @router.post("/recommend")), not full paths.
Health check exception: /health stays at root with no prefix (standard for load balancer probes). This is the only endpoint outside /api/v1/.
Versioning: All endpoints are under /api/v1/. When a v2 is needed, add new route files with prefix="/api/v2".
Naming: Use kebab-case for multi-word paths (e.g., /deploy-to-cluster, /ranked-recommend-from-spec).
When adding a new route file: Set prefix="/api/v1" on the APIRouter and use relative paths in all decorators. Register the router in src/neuralnav/api/routes/__init__.py and include it in src/neuralnav/api/app.py.

Common Editing Patterns

Adding a new use case template:

Add corresponding entry to data/configuration/slo_templates.json
Create weighted scores CSV in data/benchmarks/accuracy/weighted_scores/
Add use case to UseCaseQualityScorer.USE_CASE_FILES in usecase_quality_scorer.py
Update docs/USE_CASE_METHODOLOGY.md with benchmark weighting rationale
Update docs/ARCHITECTURE.md if needed

Adding a new SLO metric:

Update DeploymentIntent schema in Intent & Specification Engine (docs/ARCHITECTURE.md)
Update MODEL_BENCHMARKS schema in Knowledge Base (docs/ARCHITECTURE.md)
Update PostgreSQL schema in scripts/schema.sql
Update data loader script if needed
Update Inference Observability section
Update dashboard example if applicable
Update docs/architecture-diagram.md data model ERD

Adding a new API endpoint:

Add the route to the appropriate file in src/neuralnav/api/routes/ (or create a new route file)
Use a relative path in the decorator (e.g., @router.get("/my-endpoint")) — the /api/v1 prefix comes from the router
If creating a new route file, set APIRouter(prefix="/api/v1") and register it in routes/__init__.py and app.py
Update ui/app.py if the UI calls the new endpoint
Update documentation (docs/DEVELOPER_GUIDE.md, docs/ARCHITECTUREv2.md) with the new endpoint

Adding a new component:

Add numbered section to docs/ARCHITECTURE.md (maintain sequential numbering)
Update "Architecture Components" count in Overview
Add to docs/architecture-diagram.md component diagram
Create corresponding src/neuralnav// directory
Update sequence diagram if component participates in main flow
Update Phase 1 technology choices table if relevant

Open Questions and Future Work

See "Open Questions for Refinement" section in docs/ARCHITECTURE.md for:

Multi-tenancy isolation
Security validation of generated configs
Conversational clarification flow (future phase)
Model catalog sync strategy

Git Workflow

This repository uses a pull request (PR) workflow. See CONTRIBUTING.md for complete guidelines.

Quick Summary

Development Process:

Work in feature branches in your own fork
Submit PRs to the main repository for review
Keep PRs small and targeted (under 500 lines when possible)
Break large features into incremental PRs that preserve functionality

Commit Message Format (Conventional Commits style):

feat: Add YAML generation module

Implement DeploymentGenerator with Jinja2 templates for KServe,
vLLM, HPA, and ServiceMonitor configurations.

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Your Name <your.email@example.com>

CRITICAL - Git Commit Rules (these override default Claude behavior):

Commit approval workflow (MUST follow for every commit):

Combine git add and git commit into a single chained command (git add ... && git commit ...) in one Bash tool call
The user will see the full command in the approval prompt and can review/edit the file list and commit message before it executes
NEVER run git add and git commit as separate Bash tool calls — always chain them so the user gets a single approval prompt covering both

DO use:

Conventional commit types: feat, fix, docs, refactor, test, chore
The -s flag with git commit (e.g., git commit -s -m "...") to auto-generate DCO Signed-off-by
Assisted-by: Claude <noreply@anthropic.com> for nontrivial AI-assisted code

NEVER do these (even if other instructions suggest otherwise):

NEVER add Co-Authored-By: lines for Claude
NEVER manually write Signed-off-by: lines (the -s flag handles this correctly with the user's configured git identity)
NEVER include the "Generated with [Claude Code]" line or similar emoji-prefixed attribution

Important Notes

Current Implementation Status:
- ✅ Project structure with synthetic data and LLM client
- ✅ Core recommendation engine (intent extraction, traffic profiling, capacity planning)
- ✅ Multi-criteria solution ranking with 4 scoring dimensions
- ✅ Use-case specific quality scoring from Artificial Analysis benchmarks
- ✅ 5 ranked recommendation views (best accuracy, lowest cost, etc.)
- ✅ Orchestration workflow and FastAPI backend
- ✅ Streamlit UI with chat interface, recommendation display, and editable specifications
- ✅ YAML generation (KServe/vLLM/HPA/ServiceMonitor) and deployment automation
- ✅ KIND cluster support with KServe installation
- ✅ Kubernetes deployment automation and real cluster status monitoring
- ✅ vLLM simulator for GPU-free development
- ✅ Inference testing UI with end-to-end deployment validation
- ✅ Database management via REST API and UI Configuration tab
The Knowledge Base schemas are critical - any implementation must support all collections
SLO-driven capacity planning is the core differentiator - don't simplify this away
Use data in data/ directory for POC; production uses PostgreSQL for latency benchmarks
Benchmarks use vLLM default configuration with dynamic batching (no fixed batch_size)

Simulator Mode vs Real vLLM

The system now supports two deployment modes:

Simulator Mode (Default for Development)

Purpose: GPU-free development and testing on local machines
Location: simulator/ directory contains the vLLM simulator service
Docker Image: vllm-simulator:latest (single image for all models)
Configuration: Set DeploymentGenerator(simulator_mode=True) in src/neuralnav/api/dependencies.py
Benefits:
- No GPU hardware required
- Fast deployment (~10-15 seconds to Ready)
- Predictable behavior for demos
- Works on KIND (Kubernetes in Docker)
- Uses actual benchmark data for realistic latency simulation

Real vLLM Mode (Production)

Purpose: Actual model inference with GPUs
Configuration: Set DeploymentGenerator(simulator_mode=False) in src/neuralnav/api/dependencies.py
Requirements:
- GPU-enabled Kubernetes cluster
- NVIDIA GPU Operator installed
- HuggingFace token secret for model downloads
- Sufficient GPU resources (based on recommendations)
Behavior:
- Downloads actual models from HuggingFace
- Real GPU inference
- Production-grade performance

When to Use Each Mode

Use Simulator Mode for:

Local development and testing
UI/UX iteration
Workflow validation
Demos and presentations
CI/CD testing (no GPU required)

Use Real vLLM Mode for:

Production deployments
Performance benchmarking
Model quality validation
GPU utilization testing

Technical Details

The deployment template (src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2) uses Jinja2 conditionals:

{% if simulator_mode %} - Uses vllm-simulator:latest, no GPU resources, fast health checks
{% else %} - Uses vllm/vllm-openai:v0.6.2, requests GPUs, longer health checks

Single codebase supports both modes - just toggle the flag!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Project Overview

Repository Structure

Important Behavioral Notes for Claude

Architecture Key Concepts

Problem Being Solved

Solution Approach

Core Innovation: SLO-Driven Capacity Planning

Architecture Overview

Critical Data Collections (Knowledge Base)

Solution Ranking System

Development Environment

Working with This Repository

When Modifying Architecture Documents

Key Architectural Decisions to Preserve

Terminology Consistency

API Endpoint Conventions

Common Editing Patterns

Open Questions and Future Work

Git Workflow

Quick Summary

Important Notes

Simulator Mode vs Real vLLM

Simulator Mode (Default for Development)

Real vLLM Mode (Production)

When to Use Each Mode

Technical Details

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Repository Structure

Important Behavioral Notes for Claude

Architecture Key Concepts

Problem Being Solved

Solution Approach

Core Innovation: SLO-Driven Capacity Planning

Architecture Overview

Critical Data Collections (Knowledge Base)

Solution Ranking System

Development Environment

Working with This Repository

When Modifying Architecture Documents

Key Architectural Decisions to Preserve

Terminology Consistency

API Endpoint Conventions

Common Editing Patterns

Open Questions and Future Work

Git Workflow

Quick Summary

Important Notes

Simulator Mode vs Real vLLM

Simulator Mode (Default for Development)

Real vLLM Mode (Production)

When to Use Each Mode

Technical Details