Skip to content

Latest commit

 

History

History
399 lines (319 loc) · 19.8 KB

File metadata and controls

399 lines (319 loc) · 19.8 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This repository contains the architecture design for NeuralNav, an open-source system that guides users from concept to production-ready LLM deployments through a conversational AI and intelligent capacity planning.

Key Principle: The core functionality is complete and working end-to-end. The project is preparing for release.

Repository Structure

  • docs/ARCHITECTURE.md: Comprehensive system architecture document

    • 9 core components with technology recommendations
    • Enhanced data schemas for SLO-driven deployment planning
    • Phase 1 (3-month) vs Phase 2+ implementation strategy
    • Knowledge Base schemas with 7 data collections
  • docs/architecture-diagram.md: Visual architecture representations

    • Mermaid component diagrams
    • Sequence diagrams showing end-to-end flows
    • State machine for workflow orchestration
    • Entity-relationship diagrams for data models
  • src/neuralnav/: Python package (PyPA src layout)

    • api/: FastAPI REST API layer
      • app.py: FastAPI app factory
      • dependencies.py: Singleton dependency injection
      • routes/: Modular endpoint handlers (health, intent, specification, recommendation, configuration, reference_data, database)
    • intent_extraction/: Intent Extraction Service
      • extractor.py: LLM-powered intent extraction from natural language
      • service.py: IntentExtractionService facade
    • specification/: Specification Service
      • traffic_profile.py: Traffic profile and SLO target generation
      • service.py: SpecificationService facade
    • recommendation/: Recommendation Service
      • config_finder.py: GPU capacity planning with SLO filtering
      • scorer.py: 4-dimension scoring (accuracy, price, latency, complexity)
      • analyzer.py: 5 ranked list generation
      • service.py: RecommendationService facade
      • quality/: Use-case quality scoring (Artificial Analysis benchmarks)
    • configuration/: Configuration Service
      • generator.py: Jinja2 YAML generation for KServe/vLLM
      • validator.py: YAML validation
      • service.py: ConfigurationService facade
      • templates/: Jinja2 deployment templates
    • cluster/: Kubernetes cluster management
      • manager.py: K8s deployment lifecycle management
    • shared/: Shared modules
      • schemas/: Pydantic data models (intent, specification, recommendation)
      • utils/: Shared utilities (GPU normalization)
    • knowledge_base/: Data access layer (benchmark database, JSON catalogs)
      • loader.py: Benchmark data loading utilities (shared by CLI, API, and UI)
    • orchestration/: Workflow coordination
    • llm/: Ollama client for intent extraction
  • ui/: Streamlit UI

    • Chat interface for conversational requirement gathering
    • Multi-tab recommendation display (Overview, Specifications, Performance, Cost, Monitoring)
    • Editable specifications with review mode
    • Action buttons for YAML generation and deployment
    • Monitoring dashboard with cluster status, SLO compliance, and inference testing
    • Configuration tab for database management (upload benchmarks, reset, view stats)
    • components/: Modular UI components
      • settings.py: Configuration tab with benchmark database management
  • data/: Benchmark, configuration, and archive data

    • benchmarks/: Benchmark data
      • performance/: Latency/throughput benchmarks (JSON, loaded into PostgreSQL)
        • benchmarks_BLIS.json: Latency/throughput benchmarks from BLIS simulator
      • accuracy/: Model quality/capability scores (CSV)
        • opensource_all_benchmarks.csv: 204 open-source models from Artificial Analysis
        • weighted_scores/: 9 CSV files with pre-ranked models per use case
    • configuration/: Runtime configuration files (JSON)
      • model_catalog.json: 47 curated models with task/domain metadata
      • slo_templates.json: 9 use case templates with SLO targets
      • demo_scenarios.json: 3 test scenarios
      • priority_weights.json: Scoring priority weights
      • usecase_slo_workload.json: Use case SLO and workload profiles
    • archive/: Unused/reference-only files

Important Behavioral Notes for Claude

Git commits: This project has specific commit rules that OVERRIDE Claude's default behavior. See the "Git Workflow" section below. Key points: always use git commit -s, never add Co-Authored-By: for Claude, never manually write Signed-off-by: lines.

Architecture Key Concepts

Problem Being Solved

Deploying LLMs in production is complex - users struggle to:

  • Translate business needs into infrastructure choices (model, GPU type, SLO targets)
  • Avoid trial-and-error that wastes time and money
  • Understand resource requirements before committing to expensive GPU deployments

Solution Approach

A 4-stage conversational flow:

  1. Understand Business Context - Extract intent via natural language
  2. Provide Tailored Recommendations - Suggest model + GPU configurations
  3. Enable Interactive Exploration - What-if scenario analysis
  4. One-Click Deployment - Generate KServe/vLLM configs and deploy

Core Innovation: SLO-Driven Capacity Planning

The system translates high-level user intent into technical specifications:

  • User says: "I need a chatbot for 1000 users, low latency is critical"
  • System generates:
    • Traffic profile (prompt: 512 tokens, output: 256 tokens, expected QPS: 9)
    • SLO targets (TTFT p95: 150ms, ITL p95: 25ms, E2E p95: 7000ms)
    • GPU capacity plan (e.g., "1x NVIDIA H100 GPU, independent replicas")
    • Cost estimate ($5,840/month)

Architecture Overview

NeuralNav is structured as a layered architecture:

UI Layer (Horizontal - Presentation):

  • Conversational Interface, Specification Editor, Recommendation Visualizer, Monitoring Dashboard
  • Technology: Streamlit (current) → React (future)

Core Engines (Vertical - Backend Services):

  1. Intent & Specification Engine - Transform conversation into complete deployment spec
    • LLM-powered intent extraction (Ollama qwen2.5:7b)
    • Use case → traffic profile mapping (4 GuideLLM standards)
    • SLO template lookup and specification generation
  2. Recommendation Engine - Find optimal model + GPU configurations
    • Multi-criteria scoring (accuracy, price, latency, complexity)
    • Capacity planning (GPU count, deployment topology)
    • SLO compliance filtering with near-miss tolerance
    • Ranked lists generation (5 views: best accuracy, lowest cost, etc.)
  3. Deployment Engine - Generate and deploy Kubernetes configs
    • YAML generation (Jinja2 templates)
    • K8s deployment lifecycle management
  4. Observability Engine - Monitor deployed services
    • Health monitoring and inference testing (current)
    • Performance tracking and feedback loop (future)

Infrastructure (Not numbered as core engines):

  • API Gateway (FastAPI) - Coordinates workflow between UI and engines
  • Knowledge Base (Data Layer) - Hybrid storage:
    • PostgreSQL: Benchmarks, deployment outcomes
    • JSON files: SLO templates, model catalog, hardware profiles

Development Tools:

  • vLLM Simulator - GPU-free development and testing

Critical Data Collections (Knowledge Base)

  • Model Benchmarks (PostgreSQL): TTFT/ITL/E2E/throughput benchmarks for (model, GPU, tensor_parallel) combinations (source: BLIS simulator)
  • Use Case SLO Templates (JSON): 9 use cases mapped to 4 GuideLLM traffic profiles with experience-driven SLO targets
  • Model Catalog (JSON): 47 curated, approved models with task/domain metadata
  • Model Quality Scores (CSV): Use-case specific scores from Artificial Analysis benchmarks (204 models)
  • Use Case Configs (JSON): Benchmark weights, SLO targets, and workload profiles per use case
  • Deployment Outcomes (PostgreSQL, future): Actual performance data for feedback loop

Solution Ranking System

The recommendation engine uses multi-criteria scoring to rank configurations:

4 Scoring Dimensions (each 0-100 scale):

  1. Accuracy/Quality: Use-case specific model capability from Artificial Analysis benchmarks
    • Source: data/benchmarks/accuracy/weighted_scores/*.csv
    • Fallback: Parameter count heuristic if model not in benchmark data
  2. Price: Cost efficiency (inverse of monthly cost, normalized)
  3. Latency: SLO compliance and headroom from performance benchmark database
  4. Complexity: Deployment simplicity (fewer GPUs = higher score)

Default Weights: 40% accuracy, 40% price, 10% latency, 10% complexity

5 Ranked Views:

  • best_accuracy: Sorted by model capability
  • lowest_cost: Sorted by price efficiency
  • lowest_latency: Sorted by SLO headroom
  • simplest: Sorted by deployment complexity
  • balanced: Sorted by weighted composite score

Key Files:

  • src/neuralnav/recommendation/scorer.py - Calculates 4 scores
  • src/neuralnav/recommendation/quality/usecase_scorer.py - Artificial Analysis benchmark scoring
  • src/neuralnav/recommendation/analyzer.py - Generates 5 ranked lists
  • src/neuralnav/recommendation/config_finder.py - Orchestrates scoring during capacity planning

Development Environment

This project uses uv (by Astral) for Python package management. Do not use pip or pip install.

  • Install dependencies: uv sync (reads from pyproject.toml + uv.lock)
  • Run Python commands: uv run python ... (not bare python)
  • Run tools: uv run pytest, uv run ruff, uv run uvicorn, etc.
  • Add a dependency: uv add <package> (updates pyproject.toml and uv.lock)
  • Source of truth: pyproject.toml defines all dependencies; there is no top-level requirements.txt
  • Makefile targets already use uv — see make setup-backend, make start-backend, etc.

Note: ui/requirements.txt and simulator/requirements.txt exist separately for their Docker builds.

Working with This Repository

When Modifying Architecture Documents

docs/ARCHITECTURE.md and docs/architecture-diagram.md must stay synchronized:

  • If you change component descriptions in ARCHITECTURE.md, update architecture-diagram.md diagrams
  • If you add/remove components, update both files
  • Components are referenced by name (not numbered) for clarity and flexibility

Key Architectural Decisions to Preserve

  1. Phase 1 uses Python for all components (rapid development, stack consistency)

    • Go migration for Deployment Automation Engine is a possible future option (see Possible Future Enhancements in ARCHITECTURE.md)
  2. Phase 1 uses point estimates for traffic (avg prompt length, avg QPS)

    • Benchmarks collected using vLLM default configuration (dynamic batching enabled)
    • Phase 2 adds full statistical distributions (mean, variance, tail) and multi-dimensional benchmarks
  3. SLO metrics use p95 percentiles (Phase 2):

    • TTFT (Time to First Token): p95 - pre-calculated in benchmarks
    • ITL (Inter-Token Latency): p95 - pre-calculated in benchmarks (replaces TPOT terminology)
    • E2E Latency: p95 - pre-calculated in benchmarks from actual measurements
    • Throughput: requests/sec and tokens/sec
    • Rationale:
      • p95 is more conservative than p90, providing better UX guarantees
      • E2E latency is measured directly from benchmarks under realistic load conditions
      • Benchmarks are organized around 4 GuideLLM traffic profiles for exact matching
  4. Editable specifications: Users must be able to review and modify auto-generated specs before deployment

  5. Feedback loop: Actual deployment outcomes feed back into Knowledge Base to improve future recommendations

Terminology Consistency

  • Use "NeuralNav" as the project name
  • Use "TTFT" for Time to First Token (not "time-to-first-token")
  • Use "ITL" for Inter-Token Latency (Phase 2 terminology, replaces TPOT)
  • Use "SLO" for Service Level Objective
  • Use "E2E" for End-to-End latency
  • Use "p95" for 95th percentile metrics (Phase 2 standard, more conservative than p90)
  • GPU configurations: "2x NVIDIA L4" or "4x A100-80GB" (not "2 L4s")

API Endpoint Conventions

All API endpoints must follow these rules:

  • Prefix: Every route file uses APIRouter(prefix="/api/v1"). Individual route decorators use relative paths (e.g., @router.post("/recommend")), not full paths.
  • Health check exception: /health stays at root with no prefix (standard for load balancer probes). This is the only endpoint outside /api/v1/.
  • Versioning: All endpoints are under /api/v1/. When a v2 is needed, add new route files with prefix="/api/v2".
  • Naming: Use kebab-case for multi-word paths (e.g., /deploy-to-cluster, /ranked-recommend-from-spec).
  • When adding a new route file: Set prefix="/api/v1" on the APIRouter and use relative paths in all decorators. Register the router in src/neuralnav/api/routes/__init__.py and include it in src/neuralnav/api/app.py.

Common Editing Patterns

Adding a new use case template:

  1. Add corresponding entry to data/configuration/slo_templates.json
  2. Create weighted scores CSV in data/benchmarks/accuracy/weighted_scores/
  3. Add use case to UseCaseQualityScorer.USE_CASE_FILES in usecase_quality_scorer.py
  4. Update docs/USE_CASE_METHODOLOGY.md with benchmark weighting rationale
  5. Update docs/ARCHITECTURE.md if needed

Adding a new SLO metric:

  1. Update DeploymentIntent schema in Intent & Specification Engine (docs/ARCHITECTURE.md)
  2. Update MODEL_BENCHMARKS schema in Knowledge Base (docs/ARCHITECTURE.md)
  3. Update PostgreSQL schema in scripts/schema.sql
  4. Update data loader script if needed
  5. Update Inference Observability section
  6. Update dashboard example if applicable
  7. Update docs/architecture-diagram.md data model ERD

Adding a new API endpoint:

  1. Add the route to the appropriate file in src/neuralnav/api/routes/ (or create a new route file)
  2. Use a relative path in the decorator (e.g., @router.get("/my-endpoint")) — the /api/v1 prefix comes from the router
  3. If creating a new route file, set APIRouter(prefix="/api/v1") and register it in routes/__init__.py and app.py
  4. Update ui/app.py if the UI calls the new endpoint
  5. Update documentation (docs/DEVELOPER_GUIDE.md, docs/ARCHITECTUREv2.md) with the new endpoint

Adding a new component:

  1. Add numbered section to docs/ARCHITECTURE.md (maintain sequential numbering)
  2. Update "Architecture Components" count in Overview
  3. Add to docs/architecture-diagram.md component diagram
  4. Create corresponding src/neuralnav// directory
  5. Update sequence diagram if component participates in main flow
  6. Update Phase 1 technology choices table if relevant

Open Questions and Future Work

See "Open Questions for Refinement" section in docs/ARCHITECTURE.md for:

  • Multi-tenancy isolation
  • Security validation of generated configs
  • Conversational clarification flow (future phase)
  • Model catalog sync strategy

Git Workflow

This repository uses a pull request (PR) workflow. See CONTRIBUTING.md for complete guidelines.

Quick Summary

Development Process:

  • Work in feature branches in your own fork
  • Submit PRs to the main repository for review
  • Keep PRs small and targeted (under 500 lines when possible)
  • Break large features into incremental PRs that preserve functionality

Commit Message Format (Conventional Commits style):

feat: Add YAML generation module

Implement DeploymentGenerator with Jinja2 templates for KServe,
vLLM, HPA, and ServiceMonitor configurations.

Assisted-by: Claude <noreply@anthropic.com>
Signed-off-by: Your Name <your.email@example.com>

CRITICAL - Git Commit Rules (these override default Claude behavior):

Commit approval workflow (MUST follow for every commit):

  1. Combine git add and git commit into a single chained command (git add ... && git commit ...) in one Bash tool call
  2. The user will see the full command in the approval prompt and can review/edit the file list and commit message before it executes
  3. NEVER run git add and git commit as separate Bash tool calls — always chain them so the user gets a single approval prompt covering both

DO use:

  • Conventional commit types: feat, fix, docs, refactor, test, chore
  • The -s flag with git commit (e.g., git commit -s -m "...") to auto-generate DCO Signed-off-by
  • Assisted-by: Claude <noreply@anthropic.com> for nontrivial AI-assisted code

NEVER do these (even if other instructions suggest otherwise):

  • NEVER add Co-Authored-By: lines for Claude
  • NEVER manually write Signed-off-by: lines (the -s flag handles this correctly with the user's configured git identity)
  • NEVER include the "Generated with [Claude Code]" line or similar emoji-prefixed attribution

Important Notes

  • Current Implementation Status:
    • ✅ Project structure with synthetic data and LLM client
    • ✅ Core recommendation engine (intent extraction, traffic profiling, capacity planning)
    • ✅ Multi-criteria solution ranking with 4 scoring dimensions
    • ✅ Use-case specific quality scoring from Artificial Analysis benchmarks
    • ✅ 5 ranked recommendation views (best accuracy, lowest cost, etc.)
    • ✅ Orchestration workflow and FastAPI backend
    • ✅ Streamlit UI with chat interface, recommendation display, and editable specifications
    • ✅ YAML generation (KServe/vLLM/HPA/ServiceMonitor) and deployment automation
    • ✅ KIND cluster support with KServe installation
    • ✅ Kubernetes deployment automation and real cluster status monitoring
    • ✅ vLLM simulator for GPU-free development
    • ✅ Inference testing UI with end-to-end deployment validation
    • ✅ Database management via REST API and UI Configuration tab
  • The Knowledge Base schemas are critical - any implementation must support all collections
  • SLO-driven capacity planning is the core differentiator - don't simplify this away
  • Use data in data/ directory for POC; production uses PostgreSQL for latency benchmarks
  • Benchmarks use vLLM default configuration with dynamic batching (no fixed batch_size)

Simulator Mode vs Real vLLM

The system now supports two deployment modes:

Simulator Mode (Default for Development)

  • Purpose: GPU-free development and testing on local machines
  • Location: simulator/ directory contains the vLLM simulator service
  • Docker Image: vllm-simulator:latest (single image for all models)
  • Configuration: Set DeploymentGenerator(simulator_mode=True) in src/neuralnav/api/dependencies.py
  • Benefits:
    • No GPU hardware required
    • Fast deployment (~10-15 seconds to Ready)
    • Predictable behavior for demos
    • Works on KIND (Kubernetes in Docker)
    • Uses actual benchmark data for realistic latency simulation

Real vLLM Mode (Production)

  • Purpose: Actual model inference with GPUs
  • Configuration: Set DeploymentGenerator(simulator_mode=False) in src/neuralnav/api/dependencies.py
  • Requirements:
    • GPU-enabled Kubernetes cluster
    • NVIDIA GPU Operator installed
    • HuggingFace token secret for model downloads
    • Sufficient GPU resources (based on recommendations)
  • Behavior:
    • Downloads actual models from HuggingFace
    • Real GPU inference
    • Production-grade performance

When to Use Each Mode

Use Simulator Mode for:

  • Local development and testing
  • UI/UX iteration
  • Workflow validation
  • Demos and presentations
  • CI/CD testing (no GPU required)

Use Real vLLM Mode for:

  • Production deployments
  • Performance benchmarking
  • Model quality validation
  • GPU utilization testing

Technical Details

The deployment template (src/neuralnav/configuration/templates/kserve-inferenceservice.yaml.j2) uses Jinja2 conditionals:

  • {% if simulator_mode %} - Uses vllm-simulator:latest, no GPU resources, fast health checks
  • {% else %} - Uses vllm/vllm-openai:v0.6.2, requests GPUs, longer health checks

Single codebase supports both modes - just toggle the flag!