This document describes the end-to-end recommendation flow implemented in the NeuralNav codebase. It traces data from user input through to ranked deployment recommendations.
The recommendation flow follows a configuration-first approach: rather than pre-filtering models, the system queries all (model, GPU) configurations that meet SLO targets from the benchmark database, then scores each on four criteria.
User Message
↓
Intent Extraction (LLM)
↓
Traffic Profile + SLO Targets (from templates)
↓
Query PostgreSQL for SLO-compliant configurations
↓
Score each configuration (accuracy, price, latency, complexity)
↓
Generate 5 ranked lists
↓
Return best recommendation or all ranked lists
| Endpoint | Purpose | Returns |
|---|---|---|
POST /api/v1/recommend |
Simple recommendation | Single best config with YAML |
POST /api/v1/ranked-recommend-from-spec |
Multi-criteria ranking | 5 ranked lists (10 configs each) |
POST /api/v1/re-recommend |
Re-run with edited specs | Single best config |
POST /api/v1/regenerate-and-recommend |
Regenerate profile from intent | Single best config |
Entry Point: src/planner/api/routes/
File: src/planner/intent_extraction/extractor.py
The IntentExtractor uses an LLM (Ollama qwen2.5:7b) to parse the user's natural language request into structured deployment intent.
Input: User message (e.g., "I need a chatbot for 1000 users, low latency is critical")
Output: DeploymentIntent object containing:
use_case: Mapped to one of 9 supported use casesuser_count: Number of concurrent userslatency_requirement: very_high, high, medium, lowbudget_constraint: strict, moderate, flexible, nonedomain_specialization: Optional list of domainsexperience_class: instant, conversational, interactive, deferred, batch
Key Function:
intent = intent_extractor.extract_intent(user_message, conversation_history)
intent = intent_extractor.infer_missing_fields(intent)File: src/planner/specification/traffic_profile.py
The TrafficProfileGenerator maps the use case to a GuideLLM traffic profile and calculates SLO targets.
Input: DeploymentIntent
Output:
TrafficProfile: prompt_tokens, output_tokens, expected_qpsSLOTargets: ttft_p95_target_ms, itl_p95_target_ms, e2e_p95_target_ms
Data Source: data/configuration/slo_templates.json
Traffic Profiles (aligned with GuideLLM):
| Use Case | Prompt Tokens | Output Tokens |
|---|---|---|
| chatbot_conversational | 512 | 256 |
| code_completion | 512 | 256 |
| code_generation_detailed | 1024 | 1024 |
| translation | 1024 | 1024 |
| content_generation | 512 | 256 |
| summarization_short | 512 | 256 |
| document_analysis_rag | 4096 | 512 |
| long_document_summarization | 10240 | 1536 |
| research_legal_analysis | 10240 | 1536 |
Key Functions:
traffic_profile = traffic_generator.generate_profile(intent)
slo_targets = traffic_generator.generate_slo_targets(intent)File: src/planner/knowledge_base/benchmarks.py
The BenchmarkRepository queries PostgreSQL for all (model, GPU, tensor_parallel) configurations that meet SLO targets for the traffic profile.
Input: Traffic profile and SLO targets
Output: List of BenchmarkData objects with latency/throughput metrics
Key Query: find_configurations_meeting_slo()
- Matches exact traffic profile (prompt_tokens, output_tokens)
- Filters by p95 SLO targets (TTFT, ITL, E2E)
- Uses window functions to select highest QPS per configuration
- Returns one benchmark per unique (model, hardware, hardware_count) combination
Data Source: exported_summaries table in PostgreSQL (loaded from data/benchmarks/performance/benchmarks_BLIS.json)
Near-Miss Tolerance: When include_near_miss=True, SLO thresholds are relaxed by 20% to include configurations that nearly meet targets.
File: src/planner/recommendation/config_finder.py
The ConfigFinder.plan_all_capacities() method processes each benchmark configuration and calculates four scores.
Input:
- Traffic profile and SLO targets
- Deployment intent
- Model evaluator (for accuracy scoring)
Output: List of DeploymentRecommendation objects with ConfigurationScores
For each benchmark configuration:
- Calculate replicas needed to handle expected QPS (with 20% headroom)
- Build GPU config (tensor_parallel from benchmark, replicas from QPS calculation)
- Calculate cost from GPU type and count
- Score on 4 dimensions (see below)
- Create DeploymentRecommendation with scores attached
Key Function:
all_configs = capacity_planner.plan_all_capacities(
traffic_profile=traffic_profile,
slo_targets=slo_targets,
intent=intent,
model_evaluator=model_evaluator,
include_near_miss=True,
)Files:
- src/planner/recommendation/scorer.py - Calculates 4 scores
- src/planner/recommendation/quality/usecase_scorer.py - Benchmark-based quality scoring
Primary Source: Use-case specific quality scores from Artificial Analysis benchmarks
Data Files: data/benchmarks/accuracy/weighted_scores/*.csv
The UseCaseQualityScorer loads pre-calculated weighted scores for each use case. Each use case has different benchmark weights (e.g., code_completion weights LiveCodeBench 35%, SciCode 30%).
Fallback: If model not found in benchmark data, uses ModelEvaluator.score_model() which considers:
- Use case quality match (50 points)
- Domain specialization (15 points)
- Latency-appropriate model size (20 points)
- Budget-appropriate model size (10 points)
- Context length (5 points)
Key Function:
accuracy_score = model_evaluator.score_model(model, intent)Formula: 100 * (max_cost - config_cost) / (max_cost - min_cost)
Normalized inverse cost across all viable configurations. Cheapest = 100, most expensive = 0.
Key Function:
price_score = scorer.score_price(cost_per_month, min_cost, max_cost)Based on ratio of predicted latency to SLO target (worst metric determines status):
| Ratio | Score | SLO Status |
|---|---|---|
| ≤ 1.0 | 90-100 | compliant (bonus for headroom) |
| 1.0-1.2 | 70-89 | near_miss |
| > 1.2 | 0-69 | exceeds |
Key Function:
latency_score, slo_status = scorer.score_latency(
predicted_ttft, predicted_itl, predicted_e2e,
target_ttft, target_itl, target_e2e
)Based on total GPU count:
| GPU Count | Score |
|---|---|
| 1 | 100 |
| 2 | 90 |
| 4 | 75 |
| 8 | 60 |
| >8 | Linear decay (min 40) |
Key Function:
complexity_score = scorer.score_complexity(gpu_count)Weighted composite of all four scores:
balanced_score = (
accuracy_score * 0.40 +
price_score * 0.40 +
latency_score * 0.10 +
complexity_score * 0.10
)Custom weights can be provided via API (0-10 scale, normalized to percentages).
File: src/planner/recommendation/analyzer.py
The Analyzer generates 5 ranked lists from scored configurations.
Input: List of scored DeploymentRecommendations, optional filters
Output: Dict with 5 keys, each containing top 10 configurations
Filters Applied:
min_accuracy: Exclude configs with accuracy < thresholdmax_cost: Exclude configs with monthly cost > ceiling
5 Ranked Views:
| View | Sorted By |
|---|---|
best_accuracy |
Accuracy score (descending) |
lowest_cost |
Price score (descending) |
lowest_latency |
Latency score (descending) |
simplest |
Complexity score (descending) |
balanced |
Weighted composite score (descending) |
Key Function:
ranked_lists = ranking_service.generate_ranked_lists(
configurations=all_configs,
min_accuracy=70,
max_cost=5000,
top_n=10,
weights={"accuracy": 4, "price": 4, "latency": 1, "complexity": 1}
)File: src/planner/orchestration/workflow.py
The RecommendationWorkflow orchestrates all steps and returns the appropriate response.
For /api/v1/recommend:
- Returns single best configuration (highest balanced score)
- Includes top 3 alternatives
- Auto-generates YAML files
For /api/v1/ranked-recommend-from-spec:
- Returns
RankedRecommendationsResponsewith all 5 ranked lists - Includes specification (intent, traffic_profile, slo_targets)
- Reports total configs evaluated and configs after filters
| File | Description | Used By |
|---|---|---|
| data/configuration/slo_templates.json | 9 use case templates with SLO targets | TrafficProfileGenerator |
| data/configuration/model_catalog.json | 47 curated models with metadata | ModelCatalog, ModelEvaluator |
| data/benchmarks/performance/benchmarks_BLIS.json | Latency benchmarks (loaded to PostgreSQL) | BenchmarkRepository |
| data/benchmarks/accuracy/weighted_scores/*.csv | 9 use-case quality score files | UseCaseQualityScorer |
| data/archive/usecase_weights.json | Benchmark weight definitions per use case | Documentation |
| Class | File | Responsibility |
|---|---|---|
RecommendationWorkflow |
orchestration/workflow.py | Orchestrate end-to-end flow |
IntentExtractor |
intent_extraction/extractor.py | Parse user message to intent |
TrafficProfileGenerator |
specification/traffic_profile.py | Generate traffic profile and SLO targets |
BenchmarkRepository |
knowledge_base/benchmarks.py | Query PostgreSQL for benchmarks |
ConfigFinder |
recommendation/config_finder.py | Find viable configs, calculate scores |
Scorer |
recommendation/scorer.py | Calculate 4 scores |
UseCaseQualityScorer |
recommendation/quality/usecase_scorer.py | Benchmark-based quality scores |
Analyzer |
recommendation/analyzer.py | Filter and sort into 5 ranked lists |
ModelCatalog |
knowledge_base/model_catalog.py | Model metadata and GPU pricing |
User Request
│
▼
┌─────────────────────┐
│ API Route Handler │ (/api/v1/recommend or /api/v1/ranked-recommend-from-spec)
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ RecommendationWorkflow │
│ .generate_recommendation() or │
│ .generate_ranked_recommendations() │
└──────────┬──────────┘
│
├──► IntentExtractor.extract_intent()
│ └──► Ollama LLM (qwen2.5:7b)
│
├──► TrafficProfileGenerator.generate_profile()
│ └──► SLOTemplateRepository (slo_templates.json)
│
├──► TrafficProfileGenerator.generate_slo_targets()
│
▼
┌─────────────────────┐
│ CapacityPlanner │
│ .plan_all_capacities() │
└──────────┬──────────┘
│
├──► BenchmarkRepository.find_configurations_meeting_slo()
│ └──► PostgreSQL (exported_summaries table)
│
├──► For each config:
│ ├──► ModelCatalog.get_model() (lookup metadata)
│ ├──► ModelEvaluator.score_model() (accuracy)
│ │ └──► UseCaseQualityScorer (Artificial Analysis data)
│ ├──► SolutionScorer.score_latency()
│ ├──► SolutionScorer.score_complexity()
│ └──► ModelCatalog.calculate_gpu_cost()
│
└──► SolutionScorer.score_price() (after min/max known)
SolutionScorer.score_balanced()
│
▼
┌─────────────────────┐
│ RankingService │
│ .generate_ranked_lists() │
└──────────┬──────────┘
│
├──► Apply filters (min_accuracy, max_cost)
├──► Recalculate balanced scores (if custom weights)
└──► Sort into 5 ranked views
│
▼
┌─────────────────────┐
│ API Response │
│ (DeploymentRecommendation or │
│ RankedRecommendationsResponse) │
└─────────────────────┘
Request:
{
"message": "I need a chatbot for 1000 users with low latency",
"min_accuracy": 50,
"max_cost": 5000,
"include_near_miss": true
}Extracted Intent:
- use_case: chatbot_conversational
- user_count: 1000
- latency_requirement: high
- experience_class: conversational
Generated Profile:
- prompt_tokens: 512
- output_tokens: 256
- expected_qps: 9
SLO Targets (p95):
- TTFT: 150ms
- ITL: 25ms
- E2E: 7000ms
Example Configuration Scores:
Granite 3.1 8B on 1x H100:
Accuracy: 72 (from weighted_scores CSV)
Price: 85 (relatively inexpensive)
Latency: 95 (well under SLO)
Complexity: 100 (single GPU)
Balanced: 82.5
-
Parametric performance models: Train regression models to predict latency for arbitrary traffic profiles (not just the 4 GuideLLM profiles)
-
Multi-QPS benchmark selection: Currently selects highest QPS meeting SLO; future versions may offer QPS-specific recommendations
-
GPU availability scoring: Factor in procurement constraints and lead times
-
Feedback loop: Use actual deployment performance to improve future recommendations