llm-d-incubation
diff --git a/‎CLAUDE.md‎
Lines changed: 71 additions & 21 deletions b/‎CLAUDE.md‎
Lines changed: 71 additions & 21 deletions
diff --git a/‎Makefile‎
Lines changed: 1 addition & 1 deletion b/‎Makefile‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 2 additions & 2 deletions b/‎README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎backend/src/llm/ollama_client.py‎
Lines changed: 1 addition & 1 deletion b/‎backend/src/llm/ollama_client.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎backend/src/recommendation/model_evaluator.py‎
Lines changed: 63 additions & 23 deletions b/‎backend/src/recommendation/model_evaluator.py‎
Lines changed: 63 additions & 23 deletions
diff --git a/‎backend/src/recommendation/solution_scorer.py‎
Lines changed: 45 additions & 6 deletions b/‎backend/src/recommendation/solution_scorer.py‎
Lines changed: 45 additions & 6 deletions
@@ -23,10 +23,18 @@ This repository contains the architecture design for **Compass**, an open-source
   - Entity-relationship diagrams for data models
 
 - **backend/**: Python backend implementation
-  - Component modules: context_intent, recommendation, knowledge_base, orchestration, api
-  - LLM integration with Ollama client
-  - FastAPI REST endpoints with CORS support
-  - Pydantic schemas for type safety
+  - **api/**: FastAPI REST endpoints with CORS support
+  - **context_intent/**: Intent extraction, traffic profiles, Pydantic schemas
+  - **recommendation/**: Multi-criteria scoring and ranking
+    - `solution_scorer.py`: 4-dimension scoring (accuracy, price, latency, complexity)
+    - `model_evaluator.py`: Use-case fit scoring
+    - `usecase_quality_scorer.py`: Artificial Analysis benchmark integration
+    - `ranking_service.py`: 5 ranked list generation
+    - `capacity_planner.py`: GPU capacity planning with SLO filtering
+  - **knowledge_base/**: Data access (benchmark database, JSON catalogs)
+  - **orchestration/**: Workflow coordination
+  - **deployment/**: Jinja2 templates for KServe/vLLM YAML generation
+  - **llm/**: Ollama client for intent extraction
 
 - **ui/**: Streamlit UI
   - Chat interface for conversational requirement gathering
@@ -35,11 +43,18 @@ This repository contains the architecture design for **Compass**, an open-source
   - Action buttons for YAML generation and deployment
   - Monitoring dashboard with cluster status, SLO compliance, and inference testing
 
-- **data/**: Synthetic benchmark and catalog data for POC
-  - benchmarks.json: 24 model+GPU combinations with vLLM performance data
-  - model_catalog.json: 10 approved models with metadata
-  - slo_templates.json: 7 use case templates
-  - demo_scenarios.json: 3 test scenarios
+- **data/**: Benchmark and catalog data
+  - **model_catalog.json**: 47 curated models with task/domain metadata
+  - **slo_templates.json**: 9 use case templates with SLO targets
+  - **benchmarks/models/**: Model benchmark data
+    - `opensource_all_benchmarks.csv`: 204 open-source models from Artificial Analysis
+    - `model_pricing.csv`: GPU pricing data
+  - **business_context/use_case/**: Use-case specific quality scoring
+    - `weighted_scores/`: 9 CSV files with pre-ranked models per use case
+    - `configs/`: Use case configuration files (weights, SLOs, workloads)
+    - `USE_CASE_METHODOLOGY.md`: Explains benchmark weighting strategy
+  - **benchmarks_BLIS.json**: Latency/throughput benchmarks from BLIS simulator (loaded into PostgreSQL)
+  - **demo_scenarios.json**: 3 test scenarios
 
 ## Architecture Key Concepts
 
@@ -75,13 +90,14 @@ Compass is structured as a layered architecture:
 
 **Core Engines** (Vertical - Backend Services):
 1. **Intent & Specification Engine** - Transform conversation into complete deployment spec
-   - LLM-powered intent extraction (Ollama llama3.1:8b)
+   - LLM-powered intent extraction (Ollama qwen2.5:7b)
    - Use case → traffic profile mapping (4 GuideLLM standards)
    - SLO template lookup and specification generation
 2. **Recommendation Engine** - Find optimal model + GPU configurations
-   - Model selection and ranking
+   - Multi-criteria scoring (accuracy, price, latency, complexity)
    - Capacity planning (GPU count, deployment topology)
-   - SLO compliance filtering
+   - SLO compliance filtering with near-miss tolerance
+   - Ranked lists generation (5 views: best accuracy, lowest cost, etc.)
 3. **Deployment Engine** - Generate and deploy Kubernetes configs
    - YAML generation (Jinja2 templates)
    - K8s deployment lifecycle management
@@ -99,11 +115,41 @@ Compass is structured as a layered architecture:
 - **vLLM Simulator** - GPU-free development and testing
 
 ### Critical Data Collections (Knowledge Base)
-- **Model Benchmarks** (PostgreSQL): TTFT/ITL/E2E/throughput for (model, GPU, traffic_profile) combinations
+- **Model Benchmarks** (PostgreSQL): TTFT/ITL/E2E/throughput benchmarks for (model, GPU, tensor_parallel) combinations (source: BLIS simulator)
 - **Use Case SLO Templates** (JSON): 9 use cases mapped to 4 GuideLLM traffic profiles with experience-driven SLO targets
-- **Model Catalog** (JSON): 40 curated, approved models with task/domain metadata
+- **Model Catalog** (JSON): 47 curated, approved models with task/domain metadata
+- **Model Quality Scores** (CSV): Use-case specific scores from Artificial Analysis benchmarks (204 models)
+- **Use Case Configs** (JSON): Benchmark weights, SLO targets, and workload profiles per use case
 - **Deployment Outcomes** (PostgreSQL, future): Actual performance data for feedback loop
 
+### Solution Ranking System
+
+The recommendation engine uses **multi-criteria scoring** to rank configurations:
+
+**4 Scoring Dimensions** (each 0-100 scale):
+1. **Accuracy/Quality**: Use-case specific model capability from Artificial Analysis benchmarks
+   - Source: `data/business_context/use_case/weighted_scores/*.csv`
+   - Fallback: Parameter count heuristic if model not in benchmark data
+2. **Price**: Cost efficiency (inverse of monthly cost, normalized)
+3. **Latency**: SLO compliance and headroom from performance benchmark database
+4. **Complexity**: Deployment simplicity (fewer GPUs = higher score)
+
+**Default Weights**: 40% accuracy, 40% price, 10% latency, 10% complexity
+
+**5 Ranked Views**:
+- `best_accuracy`: Sorted by model capability
+- `lowest_cost`: Sorted by price efficiency
+- `lowest_latency`: Sorted by SLO headroom
+- `simplest`: Sorted by deployment complexity
+- `balanced`: Sorted by weighted composite score
+
+**Key Files**:
+- `backend/src/recommendation/solution_scorer.py` - Calculates 4 scores
+- `backend/src/recommendation/model_evaluator.py` - Legacy accuracy scoring (use-case fit)
+- `backend/src/recommendation/usecase_quality_scorer.py` - Artificial Analysis benchmark scoring
+- `backend/src/recommendation/ranking_service.py` - Generates 5 ranked lists
+- `backend/src/recommendation/capacity_planner.py` - Orchestrates scoring during capacity planning
+
 ## Working with This Repository
 
 ### When Modifying Architecture Documents
@@ -149,10 +195,11 @@ Compass is structured as a layered architecture:
 ### Common Editing Patterns
 
 **Adding a new use case template**:
-1. Add to Intent & Specification Engine's USE_CASE_TEMPLATES in docs/ARCHITECTURE.md
-2. Add corresponding entry to data/slo_templates.json
-3. Update Knowledge Base → Use Case SLO Templates schema in docs/ARCHITECTURE.md
-4. Update examples if relevant
+1. Add corresponding entry to `data/slo_templates.json`
+2. Create weighted scores CSV in `data/business_context/use_case/weighted_scores/`
+3. Add use case to `UseCaseQualityScorer.USE_CASE_FILES` in `usecase_quality_scorer.py`
+4. Update `USE_CASE_METHODOLOGY.md` with benchmark weighting rationale
+5. Update docs/ARCHITECTURE.md if needed
 
 **Adding a new SLO metric**:
 1. Update DeploymentIntent schema in Intent & Specification Engine (docs/ARCHITECTURE.md)
@@ -213,17 +260,20 @@ Signed-off-by: Your Name <your.email@example.com>
 
 - **Current Implementation Status**:
   - ✅ Project structure with synthetic data and LLM client
-  - ✅ Core recommendation engine (intent extraction, traffic profiling, model recommendation, capacity planning)
+  - ✅ Core recommendation engine (intent extraction, traffic profiling, capacity planning)
+  - ✅ Multi-criteria solution ranking with 4 scoring dimensions
+  - ✅ Use-case specific quality scoring from Artificial Analysis benchmarks
+  - ✅ 5 ranked recommendation views (best accuracy, lowest cost, etc.)
   - ✅ Orchestration workflow and FastAPI backend
   - ✅ Streamlit UI with chat interface, recommendation display, and editable specifications
   - ✅ YAML generation (KServe/vLLM/HPA/ServiceMonitor) and deployment automation
   - ✅ KIND cluster support with KServe installation
   - ✅ Kubernetes deployment automation and real cluster status monitoring
   - ✅ vLLM simulator for GPU-free development
   - ✅ Inference testing UI with end-to-end deployment validation
-- The Knowledge Base schemas are critical - any implementation must support all 7 collections
+- The Knowledge Base schemas are critical - any implementation must support all collections
 - SLO-driven capacity planning is the core differentiator - don't simplify this away
-- Use synthetic data in data/ directory for POC; production would use a database (e.g., PostgreSQL)
+- Use data in data/ directory for POC; production uses PostgreSQL for latency benchmarks
 - Benchmarks use vLLM default configuration with dynamic batching (no fixed batch_size)
 
 ## Simulator Mode vs Real vLLM
 
@@ -29,7 +29,7 @@ SIMULATOR_IMAGE ?= vllm-simulator
 SIMULATOR_TAG ?= latest
 SIMULATOR_FULL_IMAGE := $(REGISTRY)/$(REGISTRY_ORG)/$(SIMULATOR_IMAGE):$(SIMULATOR_TAG)
 
-OLLAMA_MODEL ?= llama3.1:8b
+OLLAMA_MODEL ?= qwen2.5:7b
 KIND_CLUSTER_NAME ?= compass-poc
 
 BACKEND_DIR := backend
 
@@ -101,7 +101,7 @@ Compass implements an **8-component architecture** with:
 - **Recommendation Engine** - Traffic profiling, model scoring, capacity planning
 - **Deployment Automation** - YAML generation and Kubernetes deployment
 - **Knowledge Base** - Benchmarks, SLO templates, model catalog
-- **LLM Backend** - Ollama (llama3.1:8b) for conversational AI
+- **LLM Backend** - Ollama (qwen2.5:7b) for conversational AI and business context extraction
 - **Orchestration** - Multi-step workflow coordination
 - **Inference Observability** - Real-time deployment monitoring
 
@@ -127,7 +127,7 @@ See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed system design.
 |-----------|-----------|
 | Backend | FastAPI, Pydantic |
 | Frontend | Streamlit |
-| LLM | Ollama (llama3.1:8b) |
+| LLM | Ollama (qwen2.5:7b) |
 | Data | **PostgreSQL (Phase 2)**, psycopg2, JSON (Phase 1 - deprecated) |
 | YAML Generation | Jinja2 templates |
 | Kubernetes | KIND (local), KServe v0.13.0 |
 
@@ -19,7 +19,7 @@
 class OllamaClient:
     """Client for interacting with Ollama LLM service."""
 
-    def __init__(self, model: str = "llama3.1:8b", host: str | None = None):
+    def __init__(self, model: str = "qwen2.5:7b", host: str | None = None):
         """
         Initialize Ollama client.
 
 
@@ -1,24 +1,50 @@
-"""Model recommendation engine."""
+"""Model recommendation engine.
+
+INTEGRATION NOTE (Yuval's Quality Scoring):
+This module integrates use-case-specific quality scores from Artificial Analysis
+benchmarks (weighted_scores CSVs). Quality scoring uses actual benchmark data
+(MMLU-Pro, LiveCodeBench, IFBench, etc.) rather than model size heuristics.
+
+Andre's latency/throughput benchmarks from PostgreSQL are kept as-is.
+The final recommendation combines: Yuval's quality + Andre's latency/cost/complexity.
+"""
 
 import logging
+from typing import Optional
 
 from ..context_intent.schema import DeploymentIntent
 from ..knowledge_base.model_catalog import ModelCatalog, ModelInfo
 
 logger = logging.getLogger(__name__)
 
+# Try to import use-case quality scorer (Yuval's contribution)
+try:
+    from .usecase_quality_scorer import score_model_quality, get_quality_scorer
+    USE_CASE_QUALITY_AVAILABLE = True
+    logger.info("Use-case quality scorer loaded (Artificial Analysis benchmarks)")
+except ImportError:
+    USE_CASE_QUALITY_AVAILABLE = False
+    logger.warning("Use-case quality scorer not available, using size-based heuristics")
 
-class ModelEvaluator:
-    """Evaluate models for deployment intent and calculate accuracy scores."""
 
-    def __init__(self, catalog: ModelCatalog | None = None):
+class ModelEvaluator:
+    """Evaluate models for deployment intent and calculate accuracy scores.
+    
+    Quality Scoring (updated):
+    - If use-case quality data is available: Uses Artificial Analysis benchmark scores
+      weighted by use case (e.g., code_completion uses LiveCodeBench 35%, SciCode 30%)
+    - Fallback: Uses model size heuristics if quality data unavailable
+    """
+
+    def __init__(self, catalog: "Optional[ModelCatalog]" = None):
         """
         Initialize model evaluator.
 
         Args:
             catalog: Model catalog (creates default if not provided)
         """
         self.catalog = catalog or ModelCatalog()
+        self._quality_scorer = get_quality_scorer() if USE_CASE_QUALITY_AVAILABLE else None
 
     def score_model(self, model: ModelInfo, intent: DeploymentIntent) -> float:
         """
@@ -33,40 +59,54 @@ def score_model(self, model: ModelInfo, intent: DeploymentIntent) -> float:
         """
         score = 0.0
 
-        # 1. Use case match (40 points)
-        if intent.use_case in model.recommended_for:
-            score += 40
-        elif any(task in model.supported_tasks for task in ["chat", "instruction_following"]):
-            score += 20  # Generic capability
+        # 1. Use case quality match (50 points) - ENHANCED with Artificial Analysis data
+        quality_score = self._get_usecase_quality_score(model.name, intent.use_case)
+        score += 50 * (quality_score / 100)  # Normalize to 50 points max
 
-        # 2. Domain specialization match (20 points)
-        domain_overlap = set(intent.domain_specialization) & set(model.domain_specialization)
-        if domain_overlap:
-            score += 20 * (len(domain_overlap) / len(intent.domain_specialization))
+        # 2. Domain specialization match (15 points)
+        if intent.domain_specialization:
+            domain_overlap = set(intent.domain_specialization) & set(model.domain_specialization)
+            if domain_overlap:
+                score += 15 * (len(domain_overlap) / len(intent.domain_specialization))
 
-        # 3. Latency requirement vs model size (20 points)
-        # Smaller models are better for low latency
+        # 3. Latency requirement vs model size (20 points) - Andre's logic preserved
         size_score = self._score_model_size_for_latency(
             model.size_parameters, intent.latency_requirement
         )
         score += 20 * size_score
 
-        # 4. Budget constraint (10 points)
-        # Smaller models are more cost-effective
+        # 4. Budget constraint (10 points) - Andre's logic preserved
         budget_score = self._score_model_for_budget(model.size_parameters, intent.budget_constraint)
         score += 10 * budget_score
 
-        # 5. Context length requirement (10 points)
-        # Longer context is better for some use cases
-        if intent.use_case in ["summarization", "qa_retrieval"]:
+        # 5. Context length requirement (5 points)
+        if intent.use_case in ["summarization", "qa_retrieval", "document_analysis_rag", 
+                               "long_document_summarization", "research_legal_analysis"]:
             if model.context_length >= 32000:
-                score += 10
-            elif model.context_length >= 8192:
                 score += 5
+            elif model.context_length >= 8192:
+                score += 2.5
 
-        logger.debug(f"Scored {model.name}: {score:.1f}")
+        logger.debug(f"Scored {model.name}: {score:.1f} (quality: {quality_score:.1f})")
         return score
 
+    def _get_usecase_quality_score(self, model_name: str, use_case: str) -> float:
+        """
+        Get use-case-specific quality score from Artificial Analysis benchmarks.
+        
+        Args:
+            model_name: Model name
+            use_case: Use case identifier
+            
+        Returns:
+            Quality score 0-100 (from weighted benchmarks or fallback heuristic)
+        """
+        if self._quality_scorer:
+            return self._quality_scorer.get_quality_score(model_name, use_case)
+        
+        # Fallback to simple heuristic if quality scorer not available
+        return 60.0  # Default moderate score
+
     def _score_model_size_for_latency(self, size_str: str, latency_requirement: str) -> float:
         """
         Score model size appropriateness for latency requirement.
 
@@ -1,17 +1,29 @@
 """Solution scoring for multi-criteria recommendation ranking.
 
 Scores deployment configurations on 4 criteria (0-100 scale):
-- Accuracy: Model capability based on parameter count
-- Price: Cost efficiency (inverse of cost, normalized)
-- Latency: SLO compliance and headroom
+- Accuracy/Quality: Model capability (from Artificial Analysis benchmarks or param count fallback)
+- Price: Cost efficiency (inverse of cost, normalized)  
+- Latency: SLO compliance and headroom (from Andre's PostgreSQL benchmarks)
 - Complexity: Deployment simplicity (fewer GPUs = simpler)
+
+INTEGRATION NOTE:
+- Quality scoring: Uses Yuval's weighted_scores CSVs (Artificial Analysis benchmarks)
+- Latency/Price/Complexity: Uses Andre's scoring logic and benchmark data
 """
 
 import logging
 import re
+from typing import Optional
 
 logger = logging.getLogger(__name__)
 
+# Try to import use-case quality scorer
+try:
+    from .usecase_quality_scorer import score_model_quality
+    USE_CASE_QUALITY_AVAILABLE = True
+except ImportError:
+    USE_CASE_QUALITY_AVAILABLE = False
+
 
 class SolutionScorer:
     """Score deployment configurations on 4 criteria (0-100 scale)."""
@@ -55,9 +67,36 @@ class SolutionScorer:
         "complexity": 0.10,
     }
 
-    def score_accuracy(self, model_size_str: str) -> int:
+    def score_accuracy(self, model_size_str: str, model_name: Optional[str] = None, 
+                        use_case: Optional[str] = None) -> int:
+        """
+        Score model accuracy/quality.
+        
+        Priority:
+        1. Use-case specific benchmark score (Artificial Analysis data) if available
+        2. Fallback to model size-based heuristic (Andre's original logic)
+
+        Args:
+            model_size_str: Model size string (e.g., "8B", "70B", "8x7B")
+            model_name: Optional model name for use-case-specific scoring
+            use_case: Optional use case for benchmark-based scoring
+
+        Returns:
+            Score 0-100
+        """
+        # Try use-case-specific quality scoring first (Yuval's contribution)
+        if USE_CASE_QUALITY_AVAILABLE and model_name and use_case:
+            quality_score = score_model_quality(model_name, use_case)
+            if quality_score > 0:
+                logger.debug(f"Quality score for {model_name} ({use_case}): {quality_score:.1f}")
+                return int(quality_score)
+        
+        # Fallback to size-based heuristic (Andre's original logic)
+        return self._score_accuracy_by_size(model_size_str)
+    
+    def _score_accuracy_by_size(self, model_size_str: str) -> int:
         """
-        Score model accuracy based on parameter count tier.
+        Score model accuracy based on parameter count tier (fallback).
 
         Args:
             model_size_str: Model size string (e.g., "8B", "70B", "8x7B")
@@ -200,7 +239,7 @@ def score_balanced(
         price_score: int,
         latency_score: int,
         complexity_score: int,
-        weights: dict[str, float] | None = None,
+        weights: Optional[dict] = None,
     ) -> float:
         """
         Calculate weighted composite score.