Skip to content

Commit 5058469

Browse files
authored
Merge pull request #31 from anfredette/solution-ranking
docs: recommendation ranking options document
2 parents 168179e + 6ac9cd5 commit 5058469

File tree

1 file changed

+288
-0
lines changed

1 file changed

+288
-0
lines changed

docs/Compass Solution Ranking.md

Lines changed: 288 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,288 @@
1+
## Ranking and Presenting Compass Recommendations
2+
3+
### Executive Summary
4+
5+
The Compass project aims to recommend optimal LLM and GPU configurations that satisfy users' specific use cases and priorities. This document outlines a strategic framework for ranking and presenting configuration options based on multiple decision criteria, acknowledging that users have varying priorities and may benefit from understanding tradeoffs between competing objectives.
6+
7+
**Priority Hierarchy:**
8+
- **Primary Criteria**: Accuracy and Price are the most important factors for most users
9+
- **Secondary Criteria**: Latency Performance (SLO headroom) and Operational Complexity provide additional differentiation among viable options
10+
- **Note**: Since all recommendations are filtered to meet SLO requirements, latency becomes a secondary factor for ranking among compliant options
11+
12+
***
13+
14+
### Primary Optimization Criteria
15+
16+
**Note**: Accuracy and Price are considered the primary ranking criteria, while Latency Performance and Operational Complexity are secondary factors that help differentiate among options that meet basic requirements.
17+
18+
#### 1. **Accuracy (Model Quality)**
19+
20+
**Definition**: The capability of the LLM to perform the intended task effectively.
21+
22+
**Compass Approach**:
23+
- Current: Uses model quality indicators from catalog metadata
24+
- Planned: Integration with benchmark-based metrics (MMLU, HumanEval, MT-Bench, etc.)
25+
26+
**Key Tradeoff**: Larger, more accurate models have higher latency and cost. Benchmark data captures this relationship for informed decision-making.
27+
28+
***
29+
30+
#### 2. **Price (Total Cost of Ownership)**
31+
32+
**Definition**: The financial cost of deploying and operating the configuration.
33+
34+
**Deployment Models**:
35+
- **Cloud GPU rental**: Hourly/monthly rates from cloud providers (AWS, GCP, Azure)
36+
- **On-premise GPU purchase**: Capital expenditure + operational costs (power, cooling, maintenance)
37+
- **Existing GPU infrastructure**: Utilization-based costing for already-owned hardware
38+
39+
**Compass Approach**: Current code supports cloud GPU pricing. On-premise and existing infrastructure cost models are planned.
40+
41+
***
42+
43+
#### 3. **Latency Performance** (Secondary)
44+
45+
**Definition**: Response time characteristics measured through SLO metrics:
46+
- **TTFT** (Time to First Token) - Critical for interactive applications
47+
- **ITL** (Inter-Token Latency) - Important for streaming responses
48+
- **E2E** (End-to-End Latency) - Overall response time
49+
50+
**Compass Approach**: Uses p95 percentile targets from benchmark data to filter configurations that meet user-specified SLO requirements.
51+
52+
**Role in Ranking**: Since all recommended configurations already meet SLO targets, latency becomes a secondary differentiator. Configurations with better SLO headroom (e.g., 120ms TTFT vs. 150ms target) may be preferred for additional reliability margin, but this is less critical than accuracy and price differences.
53+
54+
***
55+
56+
#### 4. **Operational Complexity** (Secondary)
57+
58+
**Definition**: The difficulty of deploying and managing the configuration.
59+
60+
**Key Factors**:
61+
- Number of GPU instances (fewer is simpler)
62+
- Infrastructure coordination (multi-node vs. single-node)
63+
- Deployment topology (tensor parallelism vs. replicas)
64+
65+
**Example Tradeoff**: 2x H100 GPUs may be preferred over 8x L4 GPUs even at higher cost due to reduced networking complexity and management overhead.
66+
67+
**Compass Approach**: Balances tensor parallelism and replica scaling to minimize operational burden while meeting SLO targets.
68+
69+
**Role in Ranking**: Complexity can be an important differentiator when choosing between configurations with similar accuracy and cost profiles. Simpler deployments are generally more reliable and easier to troubleshoot.
70+
71+
***
72+
73+
### Configuration Variants
74+
75+
Some factors affect recommendations but are treated as **configuration variants** rather than independent ranking criteria:
76+
77+
#### **Quantization Options**
78+
79+
Quantization (FP16, INT8, INT4) creates multiple deployment options for each model-GPU combination:
80+
81+
**Impact on Primary Criteria**:
82+
- **Accuracy**: Lower precision may reduce quality (benchmark-dependent)
83+
- **Price**: Reduced memory requirements → fewer GPUs needed
84+
- **Latency**: Faster inference due to smaller compute requirements
85+
- **Complexity**: Simpler deployment with smaller memory footprint
86+
87+
**Compass Approach**: Present quantization as selectable options within each recommendation, with clear indication of tradeoffs based on benchmark data.
88+
89+
**Example**:
90+
```
91+
Recommended: Llama-3-70B on 2x H100
92+
├─ FP16 (Standard): $450/month, TTFT=120ms, Reference accuracy
93+
└─ INT8 (Optimized): $290/month, TTFT=95ms, ~1-2% accuracy reduction
94+
```
95+
96+
***
97+
98+
### Secondary Considerations
99+
100+
The following factors may be addressed in future releases based on team priorities:
101+
102+
- **GPU Availability**: Procurement constraints, lead times, and regional availability
103+
- **Reliability**: Uptime SLAs, redundancy options, and failover capabilities
104+
105+
*These are documented for completeness but not prioritized for initial implementation.*
106+
107+
***
108+
109+
### Scoring and Ranking Framework
110+
111+
#### Scoring Each Configuration (0-100 Scale)
112+
113+
Each viable configuration receives scores across the four primary criteria:
114+
115+
**1. Accuracy Score** (0-100):
116+
- Based on model capability tier or benchmark scores
117+
- Example: 7B model = 60, 70B model = 85, 405B model = 95
118+
- Minimum threshold filter (e.g., user requires score ≥ 70)
119+
120+
**2. Price Score** (0-100):
121+
- Inverse of cost: Lower cost = higher score
122+
- Normalized: `100 * (max_cost - config_cost) / (max_cost - min_cost)`
123+
- Maximum cost filter (e.g., user has $500/month budget ceiling)
124+
125+
**3. Latency Score** (0-100):
126+
- Composite of TTFT, ITL, and E2E performance vs. targets
127+
- Configurations meeting all SLOs score ≥ 90
128+
- Below-SLO configurations score proportionally lower
129+
- Near-miss configurations (10-20% over SLO) score 70-89
130+
131+
**4. Complexity Score** (0-100):
132+
- Based on GPU count and deployment topology
133+
- Example: 1 GPU = 100, 2 GPUs = 90, 4 GPUs = 75, 8+ GPUs = 60
134+
- Factors: Single-node (simpler) vs. multi-node, tensor parallelism overhead
135+
136+
#### Ranking Strategies
137+
138+
**Primary Approach: Single-Criterion Ranking**
139+
140+
Show top 5 configurations for each optimization priority:
141+
142+
1. **Best Accuracy** - Sort by accuracy score (descending), filter by cost ceiling
143+
2. **Lowest Cost** - Sort by price score (descending), filter by minimum accuracy
144+
3. **Lowest Latency** - Sort by latency score (descending), filter by cost and accuracy
145+
4. **Simplest** - Sort by complexity score (descending), filter by cost and accuracy
146+
5. **Balanced** - Sort by composite score using equal weights (25% each criterion)
147+
148+
**Handling "Balanced" Recommendation**:
149+
- **Weighted composite score** allows flexible prioritization of different criteria
150+
- **Default weights**: Can be set based on general user preferences (e.g., 40% accuracy, 40% price, 10% latency, 10% complexity to reflect primary vs. secondary criteria)
151+
- **User-adjustable weights**: Allow users to customize weights based on their specific priorities
152+
- **Priority-based weighting**: Alternatively, derive weights from higher-level user priorities (e.g., "Quality first" → higher accuracy weight, "Budget conscious" → higher price weight)
153+
- **Pareto frontier approach**: Alternative method that finds configurations where no other option is better on all criteria (more complex but mathematically optimal)
154+
155+
#### Filtering Logic
156+
157+
**Hard Constraints** (Applied before ranking):
158+
- Minimum accuracy score (user-specified or default)
159+
- Maximum cost ceiling (optional user constraint)
160+
- SLO compliance (optional: include/exclude near-miss configurations)
161+
162+
**Example Workflow**:
163+
```
164+
1. Filter all viable configs by minimum accuracy ≥ 70
165+
2. Filter by maximum cost ≤ $500/month (if specified)
166+
3. Score remaining configs on all 4 criteria
167+
4. Generate ranked lists for each optimization criterion:
168+
- Best Accuracy (top 5)
169+
- Lowest Cost (top 5)
170+
- Lowest Latency (top 5)
171+
- Simplest (top 5)
172+
- Balanced (top 5, using weighted composite score)
173+
5. Present as ordered lists with selectable views (tabs, dropdown, or list filters)
174+
6. Optionally display graphical representations (e.g., Pareto frontier chart)
175+
```
176+
177+
**Display Options**:
178+
- **Ordered Lists**: Primary display method showing top N configurations for each ranking criterion
179+
- **Pareto Frontier Chart**: Graphical visualization plotting configurations on 2D/3D space (e.g., Cost vs. Accuracy, with point size representing complexity)
180+
- **Interactive Filtering**: Allow users to adjust weights and see rankings update in real-time
181+
182+
#### Near-SLO Options Integration
183+
184+
Include configurations that slightly exceed SLO targets:
185+
- Score latency as 70-89 (vs. 90-100 for SLO-compliant)
186+
- Clearly mark with warning indicators
187+
- Highlight cost/accuracy benefits compared to SLO-compliant alternatives
188+
189+
**Example Display**:
190+
```
191+
🎯 Lowest Cost (Top 5):
192+
193+
1. ✅ Llama-3-8B on 1x L4 - $120/month (Score: 100) [Meets SLOs]
194+
2. ✅ Llama-3-8B on 1x A100 - $180/month (Score: 95) [Meets SLOs]
195+
3. ⚠️ Llama-3-70B on 2x A100 - $290/month (Score: 85) [TTFT +10% over SLO]
196+
4. ✅ Llama-3-70B on 2x H100 - $450/month (Score: 70) [Meets SLOs]
197+
5. ✅ Llama-3-405B on 4x H100 - $900/month (Score: 40) [Meets SLOs]
198+
```
199+
200+
***
201+
202+
### Implementation Phasing
203+
204+
**Phase 1**: Single-criterion ranking
205+
- Implement 0-100 scoring for all 4 criteria
206+
- Support minimum accuracy and maximum cost filters
207+
- Show top 5 for "Lowest Cost" (primary view)
208+
- Show top 5 for other criteria (Accuracy, Latency, Simplest, Balanced) as alternative views
209+
210+
**Phase 2**: Multi-priority views
211+
- Add tabs/buttons for all 5 ranking modes (Accuracy, Cost, Latency, Simplest, Balanced)
212+
- Include near-SLO configurations with clear warnings
213+
- Show top 5 for each mode (consistent across all views)
214+
215+
**Phase 3**: Interactive refinement and visualization
216+
- User-adjustable weights for balanced score (slider controls or direct weight input)
217+
- Visual Pareto frontier chart for multi-dimensional tradeoff exploration
218+
- 2D charts: Cost vs. Accuracy (most common)
219+
- 3D charts: Cost vs. Accuracy vs. Complexity (advanced view)
220+
- Interactive point selection to view full configuration details
221+
- Sensitivity analysis (e.g., "How does ranking change if budget increases to $600?")
222+
- Priority-driven weight selection (e.g., user selects "Quality first" and system sets weights automatically)
223+
224+
***
225+
226+
### Summary: Decision Framework
227+
228+
**Ranking Criteria**:
229+
1. **Accuracy** (Primary) - Model capability (metadata → benchmark scores)
230+
2. **Price** (Primary) - TCO across deployment models (cloud, on-prem, existing)
231+
3. **Latency** (Secondary) - SLO headroom beyond compliance threshold (TTFT, ITL, E2E at p95)
232+
4. **Operational Complexity** (Secondary) - GPU count and deployment topology
233+
234+
**Configuration Variants**:
235+
- Quantization (FP16/INT8/INT4) as selectable options within recommendations
236+
237+
**Secondary Considerations**:
238+
- Throughput, GPU availability, reliability, scalability (future releases)
239+
240+
***
241+
242+
### Key Decisions for Team Discussion
243+
244+
1. **Scoring Scale**: Use 0-100 or 0-1 normalization?
245+
- **Recommendation**: 0-100 scale for better interpretability ("85/100" vs "0.85")
246+
- Easier to explain to users and debug during development
247+
248+
2. **Balanced Score Calculation**: How to determine the "Balanced" recommendation?
249+
- **Option A**: Default weights reflecting primary vs. secondary criteria (e.g., 40% accuracy, 40% price, 10% latency, 10% complexity)
250+
- **Option B**: User-adjustable weights via UI controls (sliders or direct input)
251+
- **Option C**: Priority-driven weights derived from high-level user preferences ("Quality first", "Budget conscious", etc.)
252+
- **Option D**: Pareto frontier - mathematically optimal but more complex
253+
- **Recommendation**: Start with Option A for Phase 1, add Option B in Phase 3
254+
255+
3. **Number of Options per View**: Show top 5 for each ranking criterion?
256+
- **Recommendation**: Yes - provides good variety without overwhelming users
257+
- UI can display as expandable list or tabbed views
258+
259+
4. **Minimum Accuracy Threshold**: Should there be a default minimum, or always user-specified?
260+
- **Option A**: No default, show all configurations
261+
- **Option B**: Default minimum based on use case (e.g., production = 70, development = 50)
262+
- **Recommendation**: Option B - prevents showing clearly inadequate models
263+
264+
5. **Near-SLO Configurations**: Include by default or require opt-in?
265+
- **Recommendation**: Include by default with clear ⚠️ warnings
266+
- Significant cost savings justify showing, but must be clearly marked
267+
- Allow users to toggle "Hide near-miss options" if desired
268+
269+
6. **Multi-Cost Model Support**: How to handle cloud vs. on-prem vs. existing GPU pricing?
270+
- **Option A**: Cloud GPU pricing only (standardized rates)
271+
- **Option B**: Also support user-provided cost parameters for on-prem/existing infrastructure.
272+
- **Option C (Future)**: Compare cloud vs. on-prem vs. existing GPU options
273+
- Scoring logic remains the same, just the cost input source changes
274+
275+
7. **Accuracy Scoring Method**: Model tiers vs. benchmark scores?
276+
- **Phase 1 (Current Approach)**: Simple capability tiers (7B=60, 70B=85, 405B=95)
277+
- **Phase 2 (WIP)**: Integrate actual benchmark scores (MMLU, HumanEval normalized to 0-100)
278+
- Allows smooth evolution without changing scoring framework
279+
280+
8. **Visualization Approaches**: How to present recommendations graphically?
281+
- **Primary**: Ordered lists for each ranking criterion (simplest, always available)
282+
- **Phase 3**: Pareto frontier charts for multi-dimensional tradeoff exploration
283+
- 2D scatter plots: Cost vs. Accuracy (most intuitive)
284+
- Interactive elements: Click points to view full configuration details
285+
- Point styling: Size/color to represent secondary criteria (complexity, latency)
286+
- **Benefits**: Graphical views help users understand tradeoff space and identify optimal regions
287+
- **Note**: Lists remain primary interface; charts supplement for advanced users
288+

0 commit comments

Comments
 (0)