|
| 1 | +## Ranking and Presenting Compass Recommendations |
| 2 | + |
| 3 | +### Executive Summary |
| 4 | + |
| 5 | +The Compass project aims to recommend optimal LLM and GPU configurations that satisfy users' specific use cases and priorities. This document outlines a strategic framework for ranking and presenting configuration options based on multiple decision criteria, acknowledging that users have varying priorities and may benefit from understanding tradeoffs between competing objectives. |
| 6 | + |
| 7 | +**Priority Hierarchy:** |
| 8 | +- **Primary Criteria**: Accuracy and Price are the most important factors for most users |
| 9 | +- **Secondary Criteria**: Latency Performance (SLO headroom) and Operational Complexity provide additional differentiation among viable options |
| 10 | +- **Note**: Since all recommendations are filtered to meet SLO requirements, latency becomes a secondary factor for ranking among compliant options |
| 11 | + |
| 12 | +*** |
| 13 | + |
| 14 | +### Primary Optimization Criteria |
| 15 | + |
| 16 | +**Note**: Accuracy and Price are considered the primary ranking criteria, while Latency Performance and Operational Complexity are secondary factors that help differentiate among options that meet basic requirements. |
| 17 | + |
| 18 | +#### 1. **Accuracy (Model Quality)** |
| 19 | + |
| 20 | +**Definition**: The capability of the LLM to perform the intended task effectively. |
| 21 | + |
| 22 | +**Compass Approach**: |
| 23 | +- Current: Uses model quality indicators from catalog metadata |
| 24 | +- Planned: Integration with benchmark-based metrics (MMLU, HumanEval, MT-Bench, etc.) |
| 25 | + |
| 26 | +**Key Tradeoff**: Larger, more accurate models have higher latency and cost. Benchmark data captures this relationship for informed decision-making. |
| 27 | + |
| 28 | +*** |
| 29 | + |
| 30 | +#### 2. **Price (Total Cost of Ownership)** |
| 31 | + |
| 32 | +**Definition**: The financial cost of deploying and operating the configuration. |
| 33 | + |
| 34 | +**Deployment Models**: |
| 35 | +- **Cloud GPU rental**: Hourly/monthly rates from cloud providers (AWS, GCP, Azure) |
| 36 | +- **On-premise GPU purchase**: Capital expenditure + operational costs (power, cooling, maintenance) |
| 37 | +- **Existing GPU infrastructure**: Utilization-based costing for already-owned hardware |
| 38 | + |
| 39 | +**Compass Approach**: Current code supports cloud GPU pricing. On-premise and existing infrastructure cost models are planned. |
| 40 | + |
| 41 | +*** |
| 42 | + |
| 43 | +#### 3. **Latency Performance** (Secondary) |
| 44 | + |
| 45 | +**Definition**: Response time characteristics measured through SLO metrics: |
| 46 | +- **TTFT** (Time to First Token) - Critical for interactive applications |
| 47 | +- **ITL** (Inter-Token Latency) - Important for streaming responses |
| 48 | +- **E2E** (End-to-End Latency) - Overall response time |
| 49 | + |
| 50 | +**Compass Approach**: Uses p95 percentile targets from benchmark data to filter configurations that meet user-specified SLO requirements. |
| 51 | + |
| 52 | +**Role in Ranking**: Since all recommended configurations already meet SLO targets, latency becomes a secondary differentiator. Configurations with better SLO headroom (e.g., 120ms TTFT vs. 150ms target) may be preferred for additional reliability margin, but this is less critical than accuracy and price differences. |
| 53 | + |
| 54 | +*** |
| 55 | + |
| 56 | +#### 4. **Operational Complexity** (Secondary) |
| 57 | + |
| 58 | +**Definition**: The difficulty of deploying and managing the configuration. |
| 59 | + |
| 60 | +**Key Factors**: |
| 61 | +- Number of GPU instances (fewer is simpler) |
| 62 | +- Infrastructure coordination (multi-node vs. single-node) |
| 63 | +- Deployment topology (tensor parallelism vs. replicas) |
| 64 | + |
| 65 | +**Example Tradeoff**: 2x H100 GPUs may be preferred over 8x L4 GPUs even at higher cost due to reduced networking complexity and management overhead. |
| 66 | + |
| 67 | +**Compass Approach**: Balances tensor parallelism and replica scaling to minimize operational burden while meeting SLO targets. |
| 68 | + |
| 69 | +**Role in Ranking**: Complexity can be an important differentiator when choosing between configurations with similar accuracy and cost profiles. Simpler deployments are generally more reliable and easier to troubleshoot. |
| 70 | + |
| 71 | +*** |
| 72 | + |
| 73 | +### Configuration Variants |
| 74 | + |
| 75 | +Some factors affect recommendations but are treated as **configuration variants** rather than independent ranking criteria: |
| 76 | + |
| 77 | +#### **Quantization Options** |
| 78 | + |
| 79 | +Quantization (FP16, INT8, INT4) creates multiple deployment options for each model-GPU combination: |
| 80 | + |
| 81 | +**Impact on Primary Criteria**: |
| 82 | +- **Accuracy**: Lower precision may reduce quality (benchmark-dependent) |
| 83 | +- **Price**: Reduced memory requirements → fewer GPUs needed |
| 84 | +- **Latency**: Faster inference due to smaller compute requirements |
| 85 | +- **Complexity**: Simpler deployment with smaller memory footprint |
| 86 | + |
| 87 | +**Compass Approach**: Present quantization as selectable options within each recommendation, with clear indication of tradeoffs based on benchmark data. |
| 88 | + |
| 89 | +**Example**: |
| 90 | +``` |
| 91 | +Recommended: Llama-3-70B on 2x H100 |
| 92 | +├─ FP16 (Standard): $450/month, TTFT=120ms, Reference accuracy |
| 93 | +└─ INT8 (Optimized): $290/month, TTFT=95ms, ~1-2% accuracy reduction |
| 94 | +``` |
| 95 | + |
| 96 | +*** |
| 97 | + |
| 98 | +### Secondary Considerations |
| 99 | + |
| 100 | +The following factors may be addressed in future releases based on team priorities: |
| 101 | + |
| 102 | +- **GPU Availability**: Procurement constraints, lead times, and regional availability |
| 103 | +- **Reliability**: Uptime SLAs, redundancy options, and failover capabilities |
| 104 | + |
| 105 | +*These are documented for completeness but not prioritized for initial implementation.* |
| 106 | + |
| 107 | +*** |
| 108 | + |
| 109 | +### Scoring and Ranking Framework |
| 110 | + |
| 111 | +#### Scoring Each Configuration (0-100 Scale) |
| 112 | + |
| 113 | +Each viable configuration receives scores across the four primary criteria: |
| 114 | + |
| 115 | +**1. Accuracy Score** (0-100): |
| 116 | +- Based on model capability tier or benchmark scores |
| 117 | +- Example: 7B model = 60, 70B model = 85, 405B model = 95 |
| 118 | +- Minimum threshold filter (e.g., user requires score ≥ 70) |
| 119 | + |
| 120 | +**2. Price Score** (0-100): |
| 121 | +- Inverse of cost: Lower cost = higher score |
| 122 | +- Normalized: `100 * (max_cost - config_cost) / (max_cost - min_cost)` |
| 123 | +- Maximum cost filter (e.g., user has $500/month budget ceiling) |
| 124 | + |
| 125 | +**3. Latency Score** (0-100): |
| 126 | +- Composite of TTFT, ITL, and E2E performance vs. targets |
| 127 | +- Configurations meeting all SLOs score ≥ 90 |
| 128 | +- Below-SLO configurations score proportionally lower |
| 129 | +- Near-miss configurations (10-20% over SLO) score 70-89 |
| 130 | + |
| 131 | +**4. Complexity Score** (0-100): |
| 132 | +- Based on GPU count and deployment topology |
| 133 | +- Example: 1 GPU = 100, 2 GPUs = 90, 4 GPUs = 75, 8+ GPUs = 60 |
| 134 | +- Factors: Single-node (simpler) vs. multi-node, tensor parallelism overhead |
| 135 | + |
| 136 | +#### Ranking Strategies |
| 137 | + |
| 138 | +**Primary Approach: Single-Criterion Ranking** |
| 139 | + |
| 140 | +Show top 5 configurations for each optimization priority: |
| 141 | + |
| 142 | +1. **Best Accuracy** - Sort by accuracy score (descending), filter by cost ceiling |
| 143 | +2. **Lowest Cost** - Sort by price score (descending), filter by minimum accuracy |
| 144 | +3. **Lowest Latency** - Sort by latency score (descending), filter by cost and accuracy |
| 145 | +4. **Simplest** - Sort by complexity score (descending), filter by cost and accuracy |
| 146 | +5. **Balanced** - Sort by composite score using equal weights (25% each criterion) |
| 147 | + |
| 148 | +**Handling "Balanced" Recommendation**: |
| 149 | +- **Weighted composite score** allows flexible prioritization of different criteria |
| 150 | +- **Default weights**: Can be set based on general user preferences (e.g., 40% accuracy, 40% price, 10% latency, 10% complexity to reflect primary vs. secondary criteria) |
| 151 | +- **User-adjustable weights**: Allow users to customize weights based on their specific priorities |
| 152 | +- **Priority-based weighting**: Alternatively, derive weights from higher-level user priorities (e.g., "Quality first" → higher accuracy weight, "Budget conscious" → higher price weight) |
| 153 | +- **Pareto frontier approach**: Alternative method that finds configurations where no other option is better on all criteria (more complex but mathematically optimal) |
| 154 | + |
| 155 | +#### Filtering Logic |
| 156 | + |
| 157 | +**Hard Constraints** (Applied before ranking): |
| 158 | +- Minimum accuracy score (user-specified or default) |
| 159 | +- Maximum cost ceiling (optional user constraint) |
| 160 | +- SLO compliance (optional: include/exclude near-miss configurations) |
| 161 | + |
| 162 | +**Example Workflow**: |
| 163 | +``` |
| 164 | +1. Filter all viable configs by minimum accuracy ≥ 70 |
| 165 | +2. Filter by maximum cost ≤ $500/month (if specified) |
| 166 | +3. Score remaining configs on all 4 criteria |
| 167 | +4. Generate ranked lists for each optimization criterion: |
| 168 | + - Best Accuracy (top 5) |
| 169 | + - Lowest Cost (top 5) |
| 170 | + - Lowest Latency (top 5) |
| 171 | + - Simplest (top 5) |
| 172 | + - Balanced (top 5, using weighted composite score) |
| 173 | +5. Present as ordered lists with selectable views (tabs, dropdown, or list filters) |
| 174 | +6. Optionally display graphical representations (e.g., Pareto frontier chart) |
| 175 | +``` |
| 176 | + |
| 177 | +**Display Options**: |
| 178 | +- **Ordered Lists**: Primary display method showing top N configurations for each ranking criterion |
| 179 | +- **Pareto Frontier Chart**: Graphical visualization plotting configurations on 2D/3D space (e.g., Cost vs. Accuracy, with point size representing complexity) |
| 180 | +- **Interactive Filtering**: Allow users to adjust weights and see rankings update in real-time |
| 181 | + |
| 182 | +#### Near-SLO Options Integration |
| 183 | + |
| 184 | +Include configurations that slightly exceed SLO targets: |
| 185 | +- Score latency as 70-89 (vs. 90-100 for SLO-compliant) |
| 186 | +- Clearly mark with warning indicators |
| 187 | +- Highlight cost/accuracy benefits compared to SLO-compliant alternatives |
| 188 | + |
| 189 | +**Example Display**: |
| 190 | +``` |
| 191 | +🎯 Lowest Cost (Top 5): |
| 192 | +
|
| 193 | +1. ✅ Llama-3-8B on 1x L4 - $120/month (Score: 100) [Meets SLOs] |
| 194 | +2. ✅ Llama-3-8B on 1x A100 - $180/month (Score: 95) [Meets SLOs] |
| 195 | +3. ⚠️ Llama-3-70B on 2x A100 - $290/month (Score: 85) [TTFT +10% over SLO] |
| 196 | +4. ✅ Llama-3-70B on 2x H100 - $450/month (Score: 70) [Meets SLOs] |
| 197 | +5. ✅ Llama-3-405B on 4x H100 - $900/month (Score: 40) [Meets SLOs] |
| 198 | +``` |
| 199 | + |
| 200 | +*** |
| 201 | + |
| 202 | +### Implementation Phasing |
| 203 | + |
| 204 | +**Phase 1**: Single-criterion ranking |
| 205 | +- Implement 0-100 scoring for all 4 criteria |
| 206 | +- Support minimum accuracy and maximum cost filters |
| 207 | +- Show top 5 for "Lowest Cost" (primary view) |
| 208 | +- Show top 5 for other criteria (Accuracy, Latency, Simplest, Balanced) as alternative views |
| 209 | + |
| 210 | +**Phase 2**: Multi-priority views |
| 211 | +- Add tabs/buttons for all 5 ranking modes (Accuracy, Cost, Latency, Simplest, Balanced) |
| 212 | +- Include near-SLO configurations with clear warnings |
| 213 | +- Show top 5 for each mode (consistent across all views) |
| 214 | + |
| 215 | +**Phase 3**: Interactive refinement and visualization |
| 216 | +- User-adjustable weights for balanced score (slider controls or direct weight input) |
| 217 | +- Visual Pareto frontier chart for multi-dimensional tradeoff exploration |
| 218 | + - 2D charts: Cost vs. Accuracy (most common) |
| 219 | + - 3D charts: Cost vs. Accuracy vs. Complexity (advanced view) |
| 220 | + - Interactive point selection to view full configuration details |
| 221 | +- Sensitivity analysis (e.g., "How does ranking change if budget increases to $600?") |
| 222 | +- Priority-driven weight selection (e.g., user selects "Quality first" and system sets weights automatically) |
| 223 | + |
| 224 | +*** |
| 225 | + |
| 226 | +### Summary: Decision Framework |
| 227 | + |
| 228 | +**Ranking Criteria**: |
| 229 | +1. **Accuracy** (Primary) - Model capability (metadata → benchmark scores) |
| 230 | +2. **Price** (Primary) - TCO across deployment models (cloud, on-prem, existing) |
| 231 | +3. **Latency** (Secondary) - SLO headroom beyond compliance threshold (TTFT, ITL, E2E at p95) |
| 232 | +4. **Operational Complexity** (Secondary) - GPU count and deployment topology |
| 233 | + |
| 234 | +**Configuration Variants**: |
| 235 | +- Quantization (FP16/INT8/INT4) as selectable options within recommendations |
| 236 | + |
| 237 | +**Secondary Considerations**: |
| 238 | +- Throughput, GPU availability, reliability, scalability (future releases) |
| 239 | + |
| 240 | +*** |
| 241 | + |
| 242 | +### Key Decisions for Team Discussion |
| 243 | + |
| 244 | +1. **Scoring Scale**: Use 0-100 or 0-1 normalization? |
| 245 | + - **Recommendation**: 0-100 scale for better interpretability ("85/100" vs "0.85") |
| 246 | + - Easier to explain to users and debug during development |
| 247 | + |
| 248 | +2. **Balanced Score Calculation**: How to determine the "Balanced" recommendation? |
| 249 | + - **Option A**: Default weights reflecting primary vs. secondary criteria (e.g., 40% accuracy, 40% price, 10% latency, 10% complexity) |
| 250 | + - **Option B**: User-adjustable weights via UI controls (sliders or direct input) |
| 251 | + - **Option C**: Priority-driven weights derived from high-level user preferences ("Quality first", "Budget conscious", etc.) |
| 252 | + - **Option D**: Pareto frontier - mathematically optimal but more complex |
| 253 | + - **Recommendation**: Start with Option A for Phase 1, add Option B in Phase 3 |
| 254 | + |
| 255 | +3. **Number of Options per View**: Show top 5 for each ranking criterion? |
| 256 | + - **Recommendation**: Yes - provides good variety without overwhelming users |
| 257 | + - UI can display as expandable list or tabbed views |
| 258 | + |
| 259 | +4. **Minimum Accuracy Threshold**: Should there be a default minimum, or always user-specified? |
| 260 | + - **Option A**: No default, show all configurations |
| 261 | + - **Option B**: Default minimum based on use case (e.g., production = 70, development = 50) |
| 262 | + - **Recommendation**: Option B - prevents showing clearly inadequate models |
| 263 | + |
| 264 | +5. **Near-SLO Configurations**: Include by default or require opt-in? |
| 265 | + - **Recommendation**: Include by default with clear ⚠️ warnings |
| 266 | + - Significant cost savings justify showing, but must be clearly marked |
| 267 | + - Allow users to toggle "Hide near-miss options" if desired |
| 268 | + |
| 269 | +6. **Multi-Cost Model Support**: How to handle cloud vs. on-prem vs. existing GPU pricing? |
| 270 | + - **Option A**: Cloud GPU pricing only (standardized rates) |
| 271 | + - **Option B**: Also support user-provided cost parameters for on-prem/existing infrastructure. |
| 272 | + - **Option C (Future)**: Compare cloud vs. on-prem vs. existing GPU options |
| 273 | + - Scoring logic remains the same, just the cost input source changes |
| 274 | + |
| 275 | +7. **Accuracy Scoring Method**: Model tiers vs. benchmark scores? |
| 276 | + - **Phase 1 (Current Approach)**: Simple capability tiers (7B=60, 70B=85, 405B=95) |
| 277 | + - **Phase 2 (WIP)**: Integrate actual benchmark scores (MMLU, HumanEval normalized to 0-100) |
| 278 | + - Allows smooth evolution without changing scoring framework |
| 279 | + |
| 280 | +8. **Visualization Approaches**: How to present recommendations graphically? |
| 281 | + - **Primary**: Ordered lists for each ranking criterion (simplest, always available) |
| 282 | + - **Phase 3**: Pareto frontier charts for multi-dimensional tradeoff exploration |
| 283 | + - 2D scatter plots: Cost vs. Accuracy (most intuitive) |
| 284 | + - Interactive elements: Click points to view full configuration details |
| 285 | + - Point styling: Size/color to represent secondary criteria (complexity, latency) |
| 286 | + - **Benefits**: Graphical views help users understand tradeoff space and identify optimal regions |
| 287 | + - **Note**: Lists remain primary interface; charts supplement for advanced users |
| 288 | + |
0 commit comments