redhat-et
diff --git a/‎CLAUDE.md‎
Lines changed: 52 additions & 18 deletions b/‎CLAUDE.md‎
Lines changed: 52 additions & 18 deletions
diff --git a/‎Makefile‎
Lines changed: 3 additions & 3 deletions b/‎Makefile‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎backend/TESTING.md‎
Lines changed: 4 additions & 4 deletions b/‎backend/TESTING.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎backend/src/api/app.py‎
Lines changed: 66 additions & 0 deletions b/‎backend/src/api/app.py‎
Lines changed: 66 additions & 0 deletions
diff --git a/‎backend/src/api/dependencies.py‎
Lines changed: 103 additions & 0 deletions b/‎backend/src/api/dependencies.py‎
Lines changed: 103 additions & 0 deletions
@@ -23,17 +23,34 @@ This repository contains the architecture design for **NeuralNav**, an open-sour
   - Entity-relationship diagrams for data models
 
 - **backend/**: Python backend implementation
-  - **api/**: FastAPI REST endpoints with CORS support
-  - **context_intent/**: Intent extraction, traffic profiles, Pydantic schemas
-  - **recommendation/**: Multi-criteria scoring and ranking
-    - `solution_scorer.py`: 4-dimension scoring (accuracy, price, latency, complexity)
-    - `model_evaluator.py`: Use-case fit scoring
-    - `usecase_quality_scorer.py`: Artificial Analysis benchmark integration
-    - `ranking_service.py`: 5 ranked list generation
-    - `capacity_planner.py`: GPU capacity planning with SLO filtering
-  - **knowledge_base/**: Data access (benchmark database, JSON catalogs)
+  - **api/**: FastAPI REST API layer
+    - `app.py`: FastAPI app factory
+    - `dependencies.py`: Singleton dependency injection
+    - **routes/**: Modular endpoint handlers (health, intent, specification, recommendation, configuration, reference_data)
+  - **intent_extraction/**: Intent Extraction Service
+    - `extractor.py`: LLM-powered intent extraction from natural language
+    - `service.py`: IntentExtractionService facade
+  - **specification/**: Specification Service
+    - `traffic_profile.py`: Traffic profile and SLO target generation
+    - `service.py`: SpecificationService facade
+  - **recommendation/**: Recommendation Service
+    - `config_finder.py`: GPU capacity planning with SLO filtering
+    - `scorer.py`: 4-dimension scoring (accuracy, price, latency, complexity)
+    - `analyzer.py`: 5 ranked list generation
+    - `service.py`: RecommendationService facade
+    - **quality/**: Use-case quality scoring (Artificial Analysis benchmarks)
+  - **configuration/**: Configuration Service
+    - `generator.py`: Jinja2 YAML generation for KServe/vLLM
+    - `validator.py`: YAML validation
+    - `service.py`: ConfigurationService facade
+    - **templates/**: Jinja2 deployment templates
+  - **cluster/**: Kubernetes cluster management
+    - `manager.py`: K8s deployment lifecycle management
+  - **shared/**: Shared modules
+    - **schemas/**: Pydantic data models (intent, specification, recommendation)
+    - **utils/**: Shared utilities (GPU normalization)
+  - **knowledge_base/**: Data access layer (benchmark database, JSON catalogs)
   - **orchestration/**: Workflow coordination
-  - **deployment/**: Jinja2 templates for KServe/vLLM YAML generation
   - **llm/**: Ollama client for intent extraction
 
 - **ui/**: Streamlit UI
@@ -148,11 +165,11 @@ The recommendation engine uses **multi-criteria scoring** to rank configurations
 - `balanced`: Sorted by weighted composite score
 
 **Key Files**:
-- `backend/src/recommendation/solution_scorer.py` - Calculates 4 scores
-- `backend/src/recommendation/model_evaluator.py` - Legacy accuracy scoring (use-case fit)
-- `backend/src/recommendation/usecase_quality_scorer.py` - Artificial Analysis benchmark scoring
-- `backend/src/recommendation/ranking_service.py` - Generates 5 ranked lists
-- `backend/src/recommendation/capacity_planner.py` - Orchestrates scoring during capacity planning
+
+- `backend/src/recommendation/scorer.py` - Calculates 4 scores
+- `backend/src/recommendation/quality/usecase_scorer.py` - Artificial Analysis benchmark scoring
+- `backend/src/recommendation/analyzer.py` - Generates 5 ranked lists
+- `backend/src/recommendation/config_finder.py` - Orchestrates scoring during capacity planning
 
 ## Working with This Repository
 
@@ -196,6 +213,16 @@ The recommendation engine uses **multi-criteria scoring** to rank configurations
 - Use "**p95**" for 95th percentile metrics (Phase 2 standard, more conservative than p90)
 - GPU configurations: "2x NVIDIA L4" or "4x A100-80GB" (not "2 L4s")
 
+### API Endpoint Conventions
+
+All API endpoints **must** follow these rules:
+
+- **Prefix**: Every route file uses `APIRouter(prefix="/api/v1")`. Individual route decorators use relative paths (e.g., `@router.post("/recommend")`), **not** full paths.
+- **Health check exception**: `/health` stays at root with no prefix (standard for load balancer probes). This is the only endpoint outside `/api/v1/`.
+- **Versioning**: All endpoints are under `/api/v1/`. When a v2 is needed, add new route files with `prefix="/api/v2"`.
+- **Naming**: Use kebab-case for multi-word paths (e.g., `/deploy-to-cluster`, `/ranked-recommend-from-spec`).
+- **When adding a new route file**: Set `prefix="/api/v1"` on the `APIRouter` and use relative paths in all decorators. Register the router in `backend/src/api/routes/__init__.py` and include it in `backend/src/api/app.py`.
+
 ### Common Editing Patterns
 
 **Adding a new use case template**:
@@ -214,6 +241,13 @@ The recommendation engine uses **multi-criteria scoring** to rank configurations
 5. Update dashboard example if applicable
 6. Update docs/architecture-diagram.md data model ERD
 
+**Adding a new API endpoint**:
+1. Add the route to the appropriate file in `backend/src/api/routes/` (or create a new route file)
+2. Use a relative path in the decorator (e.g., `@router.get("/my-endpoint")`) — the `/api/v1` prefix comes from the router
+3. If creating a new route file, set `APIRouter(prefix="/api/v1")` and register it in `routes/__init__.py` and `app.py`
+4. Update `ui/app.py` if the UI calls the new endpoint
+5. Update documentation (docs/DEVELOPER_GUIDE.md, docs/ARCHITECTUREv2.md) with the new endpoint
+
 **Adding a new component**:
 1. Add numbered section to docs/ARCHITECTURE.md (maintain sequential numbering)
 2. Update "Architecture Components" count in Overview
@@ -294,7 +328,7 @@ The system now supports two deployment modes:
 - **Purpose**: GPU-free development and testing on local machines
 - **Location**: `simulator/` directory contains the vLLM simulator service
 - **Docker Image**: `vllm-simulator:latest` (single image for all models)
-- **Configuration**: Set `DeploymentGenerator(simulator_mode=True)` in `backend/src/api/routes.py`
+- **Configuration**: Set `DeploymentGenerator(simulator_mode=True)` in `backend/src/api/dependencies.py`
 - **Benefits**:
   - No GPU hardware required
   - Fast deployment (~10-15 seconds to Ready)
@@ -304,7 +338,7 @@ The system now supports two deployment modes:
 
 ### Real vLLM Mode (Production)
 - **Purpose**: Actual model inference with GPUs
-- **Configuration**: Set `DeploymentGenerator(simulator_mode=False)` in `backend/src/api/routes.py`
+- **Configuration**: Set `DeploymentGenerator(simulator_mode=False)` in `backend/src/api/dependencies.py`
 - **Requirements**:
   - GPU-enabled Kubernetes cluster
   - NVIDIA GPU Operator installed
@@ -332,7 +366,7 @@ The system now supports two deployment modes:
 
 ### Technical Details
 
-The deployment template (`backend/src/deployment/templates/kserve-inferenceservice.yaml.j2`) uses Jinja2 conditionals:
+The deployment template (`backend/src/configuration/templates/kserve-inferenceservice.yaml.j2`) uses Jinja2 conditionals:
 - `{% if simulator_mode %}` - Uses `vllm-simulator:latest`, no GPU resources, fast health checks
 - `{% else %}` - Uses `vllm/vllm-openai:v0.6.2`, requests GPUs, longer health checks
 
 
@@ -186,7 +186,7 @@ start-backend: ## Start FastAPI backend
 		printf "$(YELLOW)Backend already running (PID: $$(cat $(BACKEND_PID)))$(NC)\n"; \
 	else \
 		cd $(BACKEND_DIR) && \
-		( uv run uvicorn src.api.routes:app --reload --host 0.0.0.0 --port 8000 > ../$(LOG_DIR)/backend.log 2>&1 & echo $$! > ../$(BACKEND_PID) ); \
+		( uv run uvicorn src.api.app:app --reload --host 0.0.0.0 --port 8000 > ../$(LOG_DIR)/backend.log 2>&1 & echo $$! > ../$(BACKEND_PID) ); \
 		sleep 2; \
 		printf "$(GREEN)✓ Backend started (PID: $$(cat $(BACKEND_PID)))$(NC)\n"; \
 	fi
@@ -215,12 +215,12 @@ stop: ## Stop all services
 	fi
 	@# Kill any remaining NeuralNav processes by pattern matching
 	@pkill -f "streamlit run ui/app.py" 2>/dev/null || true
-	@pkill -f "uvicorn src.api.routes:app" 2>/dev/null || true
+	@pkill -f "uvicorn src.api.app:app" 2>/dev/null || true
 	@# Give processes time to exit gracefully
 	@sleep 1
 	@# Force kill if still running
 	@pkill -9 -f "streamlit run ui/app.py" 2>/dev/null || true
-	@pkill -9 -f "uvicorn src.api.routes:app" 2>/dev/null || true
+	@pkill -9 -f "uvicorn src.api.app:app" 2>/dev/null || true
 	@printf "$(GREEN)✓ All NeuralNav services stopped$(NC)\n"
 	@# Don't stop Ollama as it might be used by other apps
 	@printf "$(YELLOW)Note: Ollama left running (use 'pkill ollama' to stop manually)$(NC)\n"
 
@@ -36,7 +36,7 @@ cd backend
 source venv/bin/activate
 
 python -c "
-from src.context_intent.extractor import IntentExtractor
+from src.intent_extraction import IntentExtractor
 
 extractor = IntentExtractor()
 intent = extractor.extract_intent(
@@ -55,8 +55,8 @@ print(f'  Cost Priority: {intent.cost_priority}')
 
 ```bash
 python -c "
-from src.context_intent.schema import DeploymentIntent
-from src.recommendation.traffic_profile import TrafficProfileGenerator
+from src.shared.schemas import DeploymentIntent
+from src.specification import TrafficProfileGenerator
 
 intent = DeploymentIntent(
     use_case='chatbot_conversational',
@@ -85,7 +85,7 @@ print(f'  E2E p90: {slo.e2e_p90_target_ms}ms')
 
 ```bash
 python -c "
-from src.context_intent.schema import DeploymentIntent
+from src.shared.schemas import DeploymentIntent
 from src.recommendation.model_evaluator import ModelEvaluator
 
 intent = DeploymentIntent(
 
@@ -0,0 +1,66 @@
+"""FastAPI application factory for NeuralNav API."""
+
+import logging
+import os
+
+from fastapi import FastAPI
+from fastapi.middleware.cors import CORSMiddleware
+
+from .routes import (
+    configuration_router,
+    health_router,
+    intent_router,
+    recommendation_router,
+    reference_data_router,
+    specification_router,
+)
+
+# Configure logging
+debug_mode = os.getenv("NEURALNAV_DEBUG", "false").lower() == "true"
+log_level = logging.DEBUG if debug_mode else logging.INFO
+logging.basicConfig(
+    level=log_level,
+    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+    datefmt="%Y-%m-%d %H:%M:%S",
+)
+logger = logging.getLogger(__name__)
+
+
+def create_app() -> FastAPI:
+    """Create and configure the FastAPI application."""
+    app = FastAPI(
+        title="NeuralNav API",
+        description="API for LLM deployment recommendations",
+        version="0.1.0",
+    )
+
+    # Add CORS middleware
+    app.add_middleware(
+        CORSMiddleware,
+        allow_origins=["*"],  # In production, specify actual origins
+        allow_credentials=True,
+        allow_methods=["*"],
+        allow_headers=["*"],
+    )
+
+    # Include all routers
+    app.include_router(health_router)
+    app.include_router(intent_router)
+    app.include_router(specification_router)
+    app.include_router(recommendation_router)
+    app.include_router(configuration_router)
+    app.include_router(reference_data_router)
+
+    logger.info(f"NeuralNav API starting with log level: {logging.getLevelName(log_level)}")
+
+    return app
+
+
+# Create the app instance for uvicorn
+app = create_app()
+
+
+if __name__ == "__main__":
+    import uvicorn
+
+    uvicorn.run(app, host="0.0.0.0", port=8000)
@@ -0,0 +1,103 @@
+"""Shared dependencies for API routes.
+
+This module provides singleton instances and dependency injection
+for the API routes. All shared state is initialized here.
+"""
+
+import logging
+import os
+
+from ..cluster import KubernetesClusterManager, KubernetesDeploymentError
+from ..configuration import DeploymentGenerator, YAMLValidator
+from ..knowledge_base.model_catalog import ModelCatalog
+from ..knowledge_base.slo_templates import SLOTemplateRepository
+from ..orchestration.workflow import RecommendationWorkflow
+
+# Configure logging
+debug_mode = os.getenv("NEURALNAV_DEBUG", "false").lower() == "true"
+log_level = logging.DEBUG if debug_mode else logging.INFO
+logging.basicConfig(
+    level=log_level,
+    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+    datefmt="%Y-%m-%d %H:%M:%S",
+)
+logger = logging.getLogger(__name__)
+
+# Singleton instances
+_workflow: RecommendationWorkflow | None = None
+_model_catalog: ModelCatalog | None = None
+_slo_repo: SLOTemplateRepository | None = None
+_deployment_generator: DeploymentGenerator | None = None
+_yaml_validator: YAMLValidator | None = None
+_cluster_manager: KubernetesClusterManager | None = None
+
+
+def get_workflow() -> RecommendationWorkflow:
+    """Get the recommendation workflow singleton."""
+    global _workflow
+    if _workflow is None:
+        _workflow = RecommendationWorkflow()
+    return _workflow
+
+
+def get_model_catalog() -> ModelCatalog:
+    """Get the model catalog singleton."""
+    global _model_catalog
+    if _model_catalog is None:
+        _model_catalog = ModelCatalog()
+    return _model_catalog
+
+
+def get_slo_repo() -> SLOTemplateRepository:
+    """Get the SLO template repository singleton."""
+    global _slo_repo
+    if _slo_repo is None:
+        _slo_repo = SLOTemplateRepository()
+    return _slo_repo
+
+
+def get_deployment_generator() -> DeploymentGenerator:
+    """Get the deployment generator singleton."""
+    global _deployment_generator
+    if _deployment_generator is None:
+        # Use simulator mode by default (no GPU required for development)
+        _deployment_generator = DeploymentGenerator(simulator_mode=True)
+    return _deployment_generator
+
+
+def get_yaml_validator() -> YAMLValidator:
+    """Get the YAML validator singleton."""
+    global _yaml_validator
+    if _yaml_validator is None:
+        _yaml_validator = YAMLValidator()
+    return _yaml_validator
+
+
+def get_cluster_manager(namespace: str = "default") -> KubernetesClusterManager | None:
+    """Get or create a cluster manager.
+
+    Returns None if cluster is not accessible.
+    """
+    global _cluster_manager
+    if _cluster_manager is None:
+        try:
+            _cluster_manager = KubernetesClusterManager(namespace=namespace)
+            logger.info("Kubernetes cluster manager initialized successfully")
+        except KubernetesDeploymentError as e:
+            logger.warning(f"Kubernetes cluster not accessible: {e}")
+            return None
+    return _cluster_manager
+
+
+def get_cluster_manager_or_raise(namespace: str = "default") -> KubernetesClusterManager:
+    """Get or create a cluster manager, raising an exception if not accessible."""
+    manager = get_cluster_manager(namespace)
+    if manager is None:
+        try:
+            return KubernetesClusterManager(namespace=namespace)
+        except KubernetesDeploymentError as e:
+            from fastapi import HTTPException
+            raise HTTPException(
+                status_code=503, detail=f"Kubernetes cluster not accessible: {str(e)}"
+            ) from e
+    return manager