Skip to content

Commit 4be89e3

Browse files
authored
Merge pull request #37 from anfredette/merge-yuvalluria
Merge merge-yuvalluria branch into main
2 parents d45fe71 + 3340d3f commit 4be89e3

26 files changed

+8673
-2540
lines changed

CLAUDE.md

Lines changed: 71 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,18 @@ This repository contains the architecture design for **Compass**, an open-source
2323
- Entity-relationship diagrams for data models
2424

2525
- **backend/**: Python backend implementation
26-
- Component modules: context_intent, recommendation, knowledge_base, orchestration, api
27-
- LLM integration with Ollama client
28-
- FastAPI REST endpoints with CORS support
29-
- Pydantic schemas for type safety
26+
- **api/**: FastAPI REST endpoints with CORS support
27+
- **context_intent/**: Intent extraction, traffic profiles, Pydantic schemas
28+
- **recommendation/**: Multi-criteria scoring and ranking
29+
- `solution_scorer.py`: 4-dimension scoring (accuracy, price, latency, complexity)
30+
- `model_evaluator.py`: Use-case fit scoring
31+
- `usecase_quality_scorer.py`: Artificial Analysis benchmark integration
32+
- `ranking_service.py`: 5 ranked list generation
33+
- `capacity_planner.py`: GPU capacity planning with SLO filtering
34+
- **knowledge_base/**: Data access (benchmark database, JSON catalogs)
35+
- **orchestration/**: Workflow coordination
36+
- **deployment/**: Jinja2 templates for KServe/vLLM YAML generation
37+
- **llm/**: Ollama client for intent extraction
3038

3139
- **ui/**: Streamlit UI
3240
- Chat interface for conversational requirement gathering
@@ -35,11 +43,18 @@ This repository contains the architecture design for **Compass**, an open-source
3543
- Action buttons for YAML generation and deployment
3644
- Monitoring dashboard with cluster status, SLO compliance, and inference testing
3745

38-
- **data/**: Synthetic benchmark and catalog data for POC
39-
- benchmarks.json: 24 model+GPU combinations with vLLM performance data
40-
- model_catalog.json: 10 approved models with metadata
41-
- slo_templates.json: 7 use case templates
42-
- demo_scenarios.json: 3 test scenarios
46+
- **data/**: Benchmark and catalog data
47+
- **model_catalog.json**: 47 curated models with task/domain metadata
48+
- **slo_templates.json**: 9 use case templates with SLO targets
49+
- **benchmarks/models/**: Model benchmark data
50+
- `opensource_all_benchmarks.csv`: 204 open-source models from Artificial Analysis
51+
- `model_pricing.csv`: GPU pricing data
52+
- **business_context/use_case/**: Use-case specific quality scoring
53+
- `weighted_scores/`: 9 CSV files with pre-ranked models per use case
54+
- `configs/`: Use case configuration files (weights, SLOs, workloads)
55+
- `USE_CASE_METHODOLOGY.md`: Explains benchmark weighting strategy
56+
- **benchmarks_BLIS.json**: Latency/throughput benchmarks from BLIS simulator (loaded into PostgreSQL)
57+
- **demo_scenarios.json**: 3 test scenarios
4358

4459
## Architecture Key Concepts
4560

@@ -75,13 +90,14 @@ Compass is structured as a layered architecture:
7590

7691
**Core Engines** (Vertical - Backend Services):
7792
1. **Intent & Specification Engine** - Transform conversation into complete deployment spec
78-
- LLM-powered intent extraction (Ollama llama3.1:8b)
93+
- LLM-powered intent extraction (Ollama qwen2.5:7b)
7994
- Use case → traffic profile mapping (4 GuideLLM standards)
8095
- SLO template lookup and specification generation
8196
2. **Recommendation Engine** - Find optimal model + GPU configurations
82-
- Model selection and ranking
97+
- Multi-criteria scoring (accuracy, price, latency, complexity)
8398
- Capacity planning (GPU count, deployment topology)
84-
- SLO compliance filtering
99+
- SLO compliance filtering with near-miss tolerance
100+
- Ranked lists generation (5 views: best accuracy, lowest cost, etc.)
85101
3. **Deployment Engine** - Generate and deploy Kubernetes configs
86102
- YAML generation (Jinja2 templates)
87103
- K8s deployment lifecycle management
@@ -99,11 +115,41 @@ Compass is structured as a layered architecture:
99115
- **vLLM Simulator** - GPU-free development and testing
100116

101117
### Critical Data Collections (Knowledge Base)
102-
- **Model Benchmarks** (PostgreSQL): TTFT/ITL/E2E/throughput for (model, GPU, traffic_profile) combinations
118+
- **Model Benchmarks** (PostgreSQL): TTFT/ITL/E2E/throughput benchmarks for (model, GPU, tensor_parallel) combinations (source: BLIS simulator)
103119
- **Use Case SLO Templates** (JSON): 9 use cases mapped to 4 GuideLLM traffic profiles with experience-driven SLO targets
104-
- **Model Catalog** (JSON): 40 curated, approved models with task/domain metadata
120+
- **Model Catalog** (JSON): 47 curated, approved models with task/domain metadata
121+
- **Model Quality Scores** (CSV): Use-case specific scores from Artificial Analysis benchmarks (204 models)
122+
- **Use Case Configs** (JSON): Benchmark weights, SLO targets, and workload profiles per use case
105123
- **Deployment Outcomes** (PostgreSQL, future): Actual performance data for feedback loop
106124

125+
### Solution Ranking System
126+
127+
The recommendation engine uses **multi-criteria scoring** to rank configurations:
128+
129+
**4 Scoring Dimensions** (each 0-100 scale):
130+
1. **Accuracy/Quality**: Use-case specific model capability from Artificial Analysis benchmarks
131+
- Source: `data/business_context/use_case/weighted_scores/*.csv`
132+
- Fallback: Parameter count heuristic if model not in benchmark data
133+
2. **Price**: Cost efficiency (inverse of monthly cost, normalized)
134+
3. **Latency**: SLO compliance and headroom from performance benchmark database
135+
4. **Complexity**: Deployment simplicity (fewer GPUs = higher score)
136+
137+
**Default Weights**: 40% accuracy, 40% price, 10% latency, 10% complexity
138+
139+
**5 Ranked Views**:
140+
- `best_accuracy`: Sorted by model capability
141+
- `lowest_cost`: Sorted by price efficiency
142+
- `lowest_latency`: Sorted by SLO headroom
143+
- `simplest`: Sorted by deployment complexity
144+
- `balanced`: Sorted by weighted composite score
145+
146+
**Key Files**:
147+
- `backend/src/recommendation/solution_scorer.py` - Calculates 4 scores
148+
- `backend/src/recommendation/model_evaluator.py` - Legacy accuracy scoring (use-case fit)
149+
- `backend/src/recommendation/usecase_quality_scorer.py` - Artificial Analysis benchmark scoring
150+
- `backend/src/recommendation/ranking_service.py` - Generates 5 ranked lists
151+
- `backend/src/recommendation/capacity_planner.py` - Orchestrates scoring during capacity planning
152+
107153
## Working with This Repository
108154

109155
### When Modifying Architecture Documents
@@ -149,10 +195,11 @@ Compass is structured as a layered architecture:
149195
### Common Editing Patterns
150196

151197
**Adding a new use case template**:
152-
1. Add to Intent & Specification Engine's USE_CASE_TEMPLATES in docs/ARCHITECTURE.md
153-
2. Add corresponding entry to data/slo_templates.json
154-
3. Update Knowledge Base → Use Case SLO Templates schema in docs/ARCHITECTURE.md
155-
4. Update examples if relevant
198+
1. Add corresponding entry to `data/slo_templates.json`
199+
2. Create weighted scores CSV in `data/business_context/use_case/weighted_scores/`
200+
3. Add use case to `UseCaseQualityScorer.USE_CASE_FILES` in `usecase_quality_scorer.py`
201+
4. Update `USE_CASE_METHODOLOGY.md` with benchmark weighting rationale
202+
5. Update docs/ARCHITECTURE.md if needed
156203

157204
**Adding a new SLO metric**:
158205
1. Update DeploymentIntent schema in Intent & Specification Engine (docs/ARCHITECTURE.md)
@@ -213,17 +260,20 @@ Signed-off-by: Your Name <your.email@example.com>
213260

214261
- **Current Implementation Status**:
215262
- ✅ Project structure with synthetic data and LLM client
216-
- ✅ Core recommendation engine (intent extraction, traffic profiling, model recommendation, capacity planning)
263+
- ✅ Core recommendation engine (intent extraction, traffic profiling, capacity planning)
264+
- ✅ Multi-criteria solution ranking with 4 scoring dimensions
265+
- ✅ Use-case specific quality scoring from Artificial Analysis benchmarks
266+
- ✅ 5 ranked recommendation views (best accuracy, lowest cost, etc.)
217267
- ✅ Orchestration workflow and FastAPI backend
218268
- ✅ Streamlit UI with chat interface, recommendation display, and editable specifications
219269
- ✅ YAML generation (KServe/vLLM/HPA/ServiceMonitor) and deployment automation
220270
- ✅ KIND cluster support with KServe installation
221271
- ✅ Kubernetes deployment automation and real cluster status monitoring
222272
- ✅ vLLM simulator for GPU-free development
223273
- ✅ Inference testing UI with end-to-end deployment validation
224-
- The Knowledge Base schemas are critical - any implementation must support all 7 collections
274+
- The Knowledge Base schemas are critical - any implementation must support all collections
225275
- SLO-driven capacity planning is the core differentiator - don't simplify this away
226-
- Use synthetic data in data/ directory for POC; production would use a database (e.g., PostgreSQL)
276+
- Use data in data/ directory for POC; production uses PostgreSQL for latency benchmarks
227277
- Benchmarks use vLLM default configuration with dynamic batching (no fixed batch_size)
228278

229279
## Simulator Mode vs Real vLLM

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ SIMULATOR_IMAGE ?= vllm-simulator
2929
SIMULATOR_TAG ?= latest
3030
SIMULATOR_FULL_IMAGE := $(REGISTRY)/$(REGISTRY_ORG)/$(SIMULATOR_IMAGE):$(SIMULATOR_TAG)
3131

32-
OLLAMA_MODEL ?= llama3.1:8b
32+
OLLAMA_MODEL ?= qwen2.5:7b
3333
KIND_CLUSTER_NAME ?= compass-poc
3434

3535
BACKEND_DIR := backend

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ Compass implements an **8-component architecture** with:
101101
- **Recommendation Engine** - Traffic profiling, model scoring, capacity planning
102102
- **Deployment Automation** - YAML generation and Kubernetes deployment
103103
- **Knowledge Base** - Benchmarks, SLO templates, model catalog
104-
- **LLM Backend** - Ollama (llama3.1:8b) for conversational AI
104+
- **LLM Backend** - Ollama (qwen2.5:7b) for conversational AI and business context extraction
105105
- **Orchestration** - Multi-step workflow coordination
106106
- **Inference Observability** - Real-time deployment monitoring
107107

@@ -127,7 +127,7 @@ See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for detailed system design.
127127
|-----------|-----------|
128128
| Backend | FastAPI, Pydantic |
129129
| Frontend | Streamlit |
130-
| LLM | Ollama (llama3.1:8b) |
130+
| LLM | Ollama (qwen2.5:7b) |
131131
| Data | **PostgreSQL (Phase 2)**, psycopg2, JSON (Phase 1 - deprecated) |
132132
| YAML Generation | Jinja2 templates |
133133
| Kubernetes | KIND (local), KServe v0.13.0 |

backend/src/llm/ollama_client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
class OllamaClient:
2020
"""Client for interacting with Ollama LLM service."""
2121

22-
def __init__(self, model: str = "llama3.1:8b", host: str | None = None):
22+
def __init__(self, model: str = "qwen2.5:7b", host: str | None = None):
2323
"""
2424
Initialize Ollama client.
2525

backend/src/recommendation/model_evaluator.py

Lines changed: 63 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,50 @@
1-
"""Model recommendation engine."""
1+
"""Model recommendation engine.
2+
3+
INTEGRATION NOTE (Yuval's Quality Scoring):
4+
This module integrates use-case-specific quality scores from Artificial Analysis
5+
benchmarks (weighted_scores CSVs). Quality scoring uses actual benchmark data
6+
(MMLU-Pro, LiveCodeBench, IFBench, etc.) rather than model size heuristics.
7+
8+
Andre's latency/throughput benchmarks from PostgreSQL are kept as-is.
9+
The final recommendation combines: Yuval's quality + Andre's latency/cost/complexity.
10+
"""
211

312
import logging
13+
from typing import Optional
414

515
from ..context_intent.schema import DeploymentIntent
616
from ..knowledge_base.model_catalog import ModelCatalog, ModelInfo
717

818
logger = logging.getLogger(__name__)
919

20+
# Try to import use-case quality scorer (Yuval's contribution)
21+
try:
22+
from .usecase_quality_scorer import score_model_quality, get_quality_scorer
23+
USE_CASE_QUALITY_AVAILABLE = True
24+
logger.info("Use-case quality scorer loaded (Artificial Analysis benchmarks)")
25+
except ImportError:
26+
USE_CASE_QUALITY_AVAILABLE = False
27+
logger.warning("Use-case quality scorer not available, using size-based heuristics")
1028

11-
class ModelEvaluator:
12-
"""Evaluate models for deployment intent and calculate accuracy scores."""
1329

14-
def __init__(self, catalog: ModelCatalog | None = None):
30+
class ModelEvaluator:
31+
"""Evaluate models for deployment intent and calculate accuracy scores.
32+
33+
Quality Scoring (updated):
34+
- If use-case quality data is available: Uses Artificial Analysis benchmark scores
35+
weighted by use case (e.g., code_completion uses LiveCodeBench 35%, SciCode 30%)
36+
- Fallback: Uses model size heuristics if quality data unavailable
37+
"""
38+
39+
def __init__(self, catalog: "Optional[ModelCatalog]" = None):
1540
"""
1641
Initialize model evaluator.
1742
1843
Args:
1944
catalog: Model catalog (creates default if not provided)
2045
"""
2146
self.catalog = catalog or ModelCatalog()
47+
self._quality_scorer = get_quality_scorer() if USE_CASE_QUALITY_AVAILABLE else None
2248

2349
def score_model(self, model: ModelInfo, intent: DeploymentIntent) -> float:
2450
"""
@@ -33,40 +59,54 @@ def score_model(self, model: ModelInfo, intent: DeploymentIntent) -> float:
3359
"""
3460
score = 0.0
3561

36-
# 1. Use case match (40 points)
37-
if intent.use_case in model.recommended_for:
38-
score += 40
39-
elif any(task in model.supported_tasks for task in ["chat", "instruction_following"]):
40-
score += 20 # Generic capability
62+
# 1. Use case quality match (50 points) - ENHANCED with Artificial Analysis data
63+
quality_score = self._get_usecase_quality_score(model.name, intent.use_case)
64+
score += 50 * (quality_score / 100) # Normalize to 50 points max
4165

42-
# 2. Domain specialization match (20 points)
43-
domain_overlap = set(intent.domain_specialization) & set(model.domain_specialization)
44-
if domain_overlap:
45-
score += 20 * (len(domain_overlap) / len(intent.domain_specialization))
66+
# 2. Domain specialization match (15 points)
67+
if intent.domain_specialization:
68+
domain_overlap = set(intent.domain_specialization) & set(model.domain_specialization)
69+
if domain_overlap:
70+
score += 15 * (len(domain_overlap) / len(intent.domain_specialization))
4671

47-
# 3. Latency requirement vs model size (20 points)
48-
# Smaller models are better for low latency
72+
# 3. Latency requirement vs model size (20 points) - Andre's logic preserved
4973
size_score = self._score_model_size_for_latency(
5074
model.size_parameters, intent.latency_requirement
5175
)
5276
score += 20 * size_score
5377

54-
# 4. Budget constraint (10 points)
55-
# Smaller models are more cost-effective
78+
# 4. Budget constraint (10 points) - Andre's logic preserved
5679
budget_score = self._score_model_for_budget(model.size_parameters, intent.budget_constraint)
5780
score += 10 * budget_score
5881

59-
# 5. Context length requirement (10 points)
60-
# Longer context is better for some use cases
61-
if intent.use_case in ["summarization", "qa_retrieval"]:
82+
# 5. Context length requirement (5 points)
83+
if intent.use_case in ["summarization", "qa_retrieval", "document_analysis_rag",
84+
"long_document_summarization", "research_legal_analysis"]:
6285
if model.context_length >= 32000:
63-
score += 10
64-
elif model.context_length >= 8192:
6586
score += 5
87+
elif model.context_length >= 8192:
88+
score += 2.5
6689

67-
logger.debug(f"Scored {model.name}: {score:.1f}")
90+
logger.debug(f"Scored {model.name}: {score:.1f} (quality: {quality_score:.1f})")
6891
return score
6992

93+
def _get_usecase_quality_score(self, model_name: str, use_case: str) -> float:
94+
"""
95+
Get use-case-specific quality score from Artificial Analysis benchmarks.
96+
97+
Args:
98+
model_name: Model name
99+
use_case: Use case identifier
100+
101+
Returns:
102+
Quality score 0-100 (from weighted benchmarks or fallback heuristic)
103+
"""
104+
if self._quality_scorer:
105+
return self._quality_scorer.get_quality_score(model_name, use_case)
106+
107+
# Fallback to simple heuristic if quality scorer not available
108+
return 60.0 # Default moderate score
109+
70110
def _score_model_size_for_latency(self, size_str: str, latency_requirement: str) -> float:
71111
"""
72112
Score model size appropriateness for latency requirement.

backend/src/recommendation/solution_scorer.py

Lines changed: 45 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,29 @@
11
"""Solution scoring for multi-criteria recommendation ranking.
22
33
Scores deployment configurations on 4 criteria (0-100 scale):
4-
- Accuracy: Model capability based on parameter count
5-
- Price: Cost efficiency (inverse of cost, normalized)
6-
- Latency: SLO compliance and headroom
4+
- Accuracy/Quality: Model capability (from Artificial Analysis benchmarks or param count fallback)
5+
- Price: Cost efficiency (inverse of cost, normalized)
6+
- Latency: SLO compliance and headroom (from Andre's PostgreSQL benchmarks)
77
- Complexity: Deployment simplicity (fewer GPUs = simpler)
8+
9+
INTEGRATION NOTE:
10+
- Quality scoring: Uses Yuval's weighted_scores CSVs (Artificial Analysis benchmarks)
11+
- Latency/Price/Complexity: Uses Andre's scoring logic and benchmark data
812
"""
913

1014
import logging
1115
import re
16+
from typing import Optional
1217

1318
logger = logging.getLogger(__name__)
1419

20+
# Try to import use-case quality scorer
21+
try:
22+
from .usecase_quality_scorer import score_model_quality
23+
USE_CASE_QUALITY_AVAILABLE = True
24+
except ImportError:
25+
USE_CASE_QUALITY_AVAILABLE = False
26+
1527

1628
class SolutionScorer:
1729
"""Score deployment configurations on 4 criteria (0-100 scale)."""
@@ -55,9 +67,36 @@ class SolutionScorer:
5567
"complexity": 0.10,
5668
}
5769

58-
def score_accuracy(self, model_size_str: str) -> int:
70+
def score_accuracy(self, model_size_str: str, model_name: Optional[str] = None,
71+
use_case: Optional[str] = None) -> int:
72+
"""
73+
Score model accuracy/quality.
74+
75+
Priority:
76+
1. Use-case specific benchmark score (Artificial Analysis data) if available
77+
2. Fallback to model size-based heuristic (Andre's original logic)
78+
79+
Args:
80+
model_size_str: Model size string (e.g., "8B", "70B", "8x7B")
81+
model_name: Optional model name for use-case-specific scoring
82+
use_case: Optional use case for benchmark-based scoring
83+
84+
Returns:
85+
Score 0-100
86+
"""
87+
# Try use-case-specific quality scoring first (Yuval's contribution)
88+
if USE_CASE_QUALITY_AVAILABLE and model_name and use_case:
89+
quality_score = score_model_quality(model_name, use_case)
90+
if quality_score > 0:
91+
logger.debug(f"Quality score for {model_name} ({use_case}): {quality_score:.1f}")
92+
return int(quality_score)
93+
94+
# Fallback to size-based heuristic (Andre's original logic)
95+
return self._score_accuracy_by_size(model_size_str)
96+
97+
def _score_accuracy_by_size(self, model_size_str: str) -> int:
5998
"""
60-
Score model accuracy based on parameter count tier.
99+
Score model accuracy based on parameter count tier (fallback).
61100
62101
Args:
63102
model_size_str: Model size string (e.g., "8B", "70B", "8x7B")
@@ -200,7 +239,7 @@ def score_balanced(
200239
price_score: int,
201240
latency_score: int,
202241
complexity_score: int,
203-
weights: dict[str, float] | None = None,
242+
weights: Optional[dict] = None,
204243
) -> float:
205244
"""
206245
Calculate weighted composite score.

0 commit comments

Comments
 (0)