|
3 | 3 | ## High Priority |
4 | 4 |
|
5 | 5 | ### [Done]1. Fix Token Usage Tracking |
6 | | -- [ ] Pull token info from endpoint response (Gemini model answering questions) |
| 6 | +- [x] Pull token info from endpoint response (Gemini model answering questions) |
7 | 7 | - Currently showing 0 tokens for `api_input_tokens` and `api_output_tokens` |
8 | 8 | - Need to extract from response object and store in evaluation results |
9 | 9 | - Affects accurate cost estimation |
10 | | -- [ ] Re-run cost estimation script after fix |
| 10 | +- [x] Re-run cost estimation script after fix |
11 | 11 | ```bash |
12 | | - python scripts/calculate_cost_estimate_multi.py eval_output/latest --comparison |
| 12 | + $ python scripts/show_cost.py |
| 13 | + |
13 | 14 | ``` |
| 15 | + [] Fix existing bugs caused by claude |
14 | 16 |
|
15 | 17 | ### 2. Clean Up Repository |
16 | 18 | - [ ] Remove old evaluation output files from repository |
|
38 | 40 | - `ADVERSARIAL_CONTEXT_INJECTION_TESTS.md` |
39 | 41 | - `JUDGE_LLM_CONSISTENCY_TESTS.md` |
40 | 42 |
|
41 | | -## High Priority - Pattern Fix Loop Extensions |
42 | | - |
43 | | -### Scaling to Production (Multi-Agent Architecture) |
44 | | - |
45 | | -**Summary:** Scale pattern fix loop from POC (3-5 tickets, single agent) to production (20+ tickets, multi-agent with dynamic loading). See `docs/PATTERN_FIX_LOOP_SCALING.md` for full details. |
46 | | - |
47 | | -**Roadmap:** |
48 | | -- **Phase 1 (POC):** ✅ Single agent, 3-5 tickets, sequential (current) |
49 | | -- **Phase 2 (Specialists):** Coordinator + specialist agents (Solr, Prompt, Validator), improves iteration quality |
50 | | -- **Phase 3 (Parallel):** Add parallel ticket validation, scale to 10-15 tickets, 5-10x speedup |
51 | | -- **Phase 4 (Learning):** Cross-pattern learning, scale to 20+ tickets, learn from past successes |
52 | | - |
53 | | -**Key Techniques:** |
54 | | -- **Multi-agent decomposition:** Baseline → Specialist → Validator → Aggregator (each fresh context) |
55 | | -- **Dynamic loading:** Agents read from `.diagnostics/` files instead of accumulating context |
56 | | -- **Parallel validation:** Validate N tickets simultaneously (N validators in background) |
57 | | -- **External storage:** Iteration history, pattern metadata stored externally (not in context) |
58 | | -- **Specialist agents:** SolrExpert (4K context), PromptExpert (4K), VarianceAnalyzer (3K) |
59 | | - |
60 | | -**Benefits:** |
61 | | -- Context per agent stays ~3-5K tokens (vs single agent growing to 12K+) |
62 | | -- Parallel validation: 10 tickets in 60s instead of 600s |
63 | | -- Specialist expertise improves suggestion quality |
64 | | -- Cross-pattern learning reduces iterations needed |
65 | | - |
66 | | -**Implementation Priority:** Medium (after POC proves concept) |
67 | | - |
68 | | -### Variance Detection and Auto-Fix (Future Enhancement) |
69 | | - |
70 | | -- [ ] **Add Variance-Aware Agent** - High Impact, Medium Effort |
71 | | - - [ ] Implement `VarianceAnalyzer` class in `scripts/okp_mcp_variance_analyzer.py` |
72 | | - - Analyze variance across multiple stability runs |
73 | | - - Diagnose root cause (bad ground truth vs retrieval variance vs prompt ambiguity) |
74 | | - - Suggest specific fixes based on diagnosis |
75 | | - - [ ] Add variance analysis to pattern fix loop Phase 4 |
76 | | - - Currently: Calculates variance and reports if > 0.05 |
77 | | - - Enhancement: Auto-diagnose cause and suggest fix |
78 | | - - [ ] Implement variance detection capabilities: |
79 | | - - ✅ Compare responses for semantic similarity (bad ground truth detection) |
80 | | - - ✅ Compare retrieved URLs for order variance (retrieval variance detection) |
81 | | - - ✅ Detect response style variance (prompt ambiguity detection) |
82 | | - - ✅ Apply appropriate fix based on diagnosis |
83 | | - - [ ] See: `docs/VARIANCE_SOLUTIONS.md` for diagnostic framework |
84 | | - - **Impact:** Agents can automatically detect and fix unstable answers |
85 | | - - **Current State:** Agents can only see single-run metrics, cannot detect variance |
86 | | - - **Priority:** Medium (useful but not critical for POC) |
87 | | - |
88 | | -- [ ] **Add Semantic Answer Correctness (If High Variance)** - Medium Impact, Low Effort |
89 | | - - [ ] Implement hybrid answer_correctness metric in `src/lightspeed_evaluation/core/metrics/custom/semantic_answer.py` |
90 | | - - Use sentence transformers for semantic similarity (fast, deterministic, no wording bias) |
91 | | - - Combine with LLM judge for borderline cases (semantic similarity 0.50-0.85) |
92 | | - - Weighted scoring: `0.6 * llm_score + 0.4 * semantic_score` |
93 | | - - [ ] Add to VarianceAnalyzer for detecting wording-based variance: |
94 | | - - If semantic_similarity > 0.90 but answer_correctness variance > 0.02 → bad ground truth |
95 | | - - Auto-generate more specific expected_response using LLM |
96 | | - - [ ] Add configuration option to switch between: |
97 | | - - `answer_correctness_mode: "llm"` (current, good for absolute scoring) |
98 | | - - `answer_correctness_mode: "semantic"` (embedding-based, no wording bias) |
99 | | - - `answer_correctness_mode: "hybrid"` (best of both) |
100 | | - - **When to use:** If stability checks show high variance (>0.05) due to wording differences |
101 | | - - **Benefits:** Reduces variance from semantically identical but differently worded answers |
102 | | - - **Tradeoff:** Pure semantic similarity less precise on factual correctness |
103 | | - - **Priority:** Low (only implement if variance becomes major issue) |
104 | | - |
105 | | -## High Priority - RAG Testing Improvements for RHEL 10 |
106 | | - |
107 | | -### Do These First (This Week) |
108 | | - |
109 | | -- [ ] **1. Add RHEL Version-Aware Metrics** (1 hour) - High Impact, Low Effort |
110 | | - - [ ] Create `src/lightspeed_evaluation/core/metrics/custom/version_accuracy.py` |
111 | | - - Validates RHEL version accuracy in contexts and responses |
112 | | - - Checks if target version is in contexts |
113 | | - - Detects wrong version in response |
114 | | - - Calculates target version ratio in contexts |
115 | | - - [ ] Add to `config/system.yaml` metrics_metadata |
116 | | - - [ ] Set threshold to 0.8 |
117 | | - - **Impact:** Directly measures what we care about - are we retrieving the right version? |
118 | | - |
119 | | -- [ ] **2. Create RHEL 10-Specific Test Suite** (2 hours) - High Impact, Medium Effort |
120 | | - - [ ] Create `config/rhel10_focused_tests.yaml` |
121 | | - - [ ] Include test categories: |
122 | | - - New features (bootc, performance improvements) |
123 | | - - Version-specific configuration |
124 | | - - Migration and upgrade paths |
125 | | - - Common administrative tasks |
126 | | - - Troubleshooting scenarios |
127 | | - - Package management (DNF5) |
128 | | - - Security (SELinux) |
129 | | - - **Impact:** Focused test coverage on primary use case |
130 | | - |
131 | | -- [ ] **3. Add Version Markers to Test Data** (1 hour) - Medium Impact, Low Effort |
132 | | - - [ ] Update all test YAML files with: |
133 | | - - `target_version: "10"` |
134 | | - - `version_strictness: "required|preferred|mixed"` |
135 | | - - `expected_version_in_response: "10"` |
136 | | - - `expected_version_in_contexts: ["10"]` |
137 | | - - `forbidden_versions: ["8", "9"]` |
138 | | - - [ ] Create validator in evaluation pipeline |
139 | | - - **Impact:** Explicit pass/fail criteria for version correctness |
140 | | - |
141 | | -### Do These Next (Next 2 Weeks) |
142 | | - |
143 | | -- [ ] **4. Add Context Quality Metrics** (3 hours) - High Impact, Low Effort |
144 | | - - [ ] Create `src/lightspeed_evaluation/core/metrics/custom/context_validation.py` |
145 | | - - [ ] Implement `ContextVersionPurityMetric` |
146 | | - - Measure percentage of contexts matching target version |
147 | | - - [ ] Implement `ContextRecencyMetric` |
148 | | - - Check if contexts are from recent documentation |
149 | | - - Flag old documentation (>2 years) |
150 | | - - **Impact:** Better understanding of WHY context_precision is only 42.9% |
151 | | - - **Related:** Currently context_precision pass rate is 42.9%, need better validation |
152 | | - |
153 | | -- [ ] **5. Create Regression Test Suite from Current Failures** (2 hours) - High Impact, Medium Effort |
154 | | - - [ ] Create `scripts/create_regression_suite.py` |
155 | | - - Extract questions with scores below 0.5 |
156 | | - - Group by conversation and track worst metrics |
157 | | - - Generate `config/regression_tests.yaml` |
158 | | - - [ ] Track these specific questions over time |
159 | | - - **Impact:** Systematic tracking of problematic questions |
160 | | - - **Note:** Use this after each major evaluation run to build regression dataset |
161 | | - |
162 | | -- [ ] **6. Add Gemini 2.5 Flash-Specific Optimizations** (2 hours) - Medium Impact, Medium Effort |
163 | | - - [ ] Create `config/gemini_optimized_system.yaml` |
164 | | - - [ ] Configure Gemini-specific parameters: |
165 | | - - `top_p: 0.95` |
166 | | - - `top_k: 40` |
167 | | - - Safety settings for technical documentation |
168 | | - - [ ] Add structured prompt templates |
169 | | - - System prompt emphasizing RHEL version awareness |
170 | | - - Instruction to only use provided documentation |
171 | | - - [ ] Use same model for judge LLM for consistency |
172 | | - - **Impact:** Better alignment with Gemini's strengths |
173 | | - |
174 | | -### Do Eventually (Longer Term) |
175 | | - |
176 | | -- [ ] **7. Create Golden Dataset with Human Validation** - High Impact, High Effort |
177 | | - - [ ] Select 20 critical RHEL 10 questions (most common user queries) |
178 | | - - [ ] Manually validate/write perfect expected responses |
179 | | - - [ ] Have RHEL experts review |
180 | | - - [ ] Create `config/golden_rhel10_tests.yaml` with: |
181 | | - - `quality_level: "gold"` |
182 | | - - `expert_validated: true` |
183 | | - - `validation_date` and `validator` fields |
184 | | - - `gold_standard_response` (expert written) |
185 | | - - `required_facts` (must-have information) |
186 | | - - `forbidden_statements` (common misconceptions) |
187 | | - - [ ] Use as high-confidence regression suite |
188 | | - - **Impact:** High-confidence baseline for measuring improvements |
189 | | - |
190 | | -- [ ] **8. Add Failure Mode Detection** (3 hours) - Medium Impact, Low Effort |
191 | | - - [ ] Create `src/lightspeed_evaluation/core/metrics/custom/failure_modes.py` |
192 | | - - [ ] Implement `FailureModeDetector` to catch: |
193 | | - - Version hallucination (query version ≠ response version) |
194 | | - - Empty/refusal responses |
195 | | - - Context ignored (response contradicts context) |
196 | | - - Over-generic responses |
197 | | - - Wrong doc type (KB article instead of documentation) |
198 | | - - **Impact:** Better root cause analysis of failures |
199 | | - |
200 | | -- [ ] **9. Add Cost Tracking and Optimization** - Low Impact, Low Effort |
201 | | - - [ ] Add cost tracking to evaluation pipeline: |
202 | | - - Track API calls, input/output tokens |
203 | | - - Calculate estimated cost using Gemini 2.5 Flash pricing |
204 | | - - $0.075 per 1M input tokens |
205 | | - - $0.30 per 1M output tokens |
206 | | - - [ ] Add to evaluation output and summary reports |
207 | | - - **Impact:** Better budget management for testing |
208 | | - |
209 | | -- [ ] **10. Implement Continuous Testing Dashboard** - Low Impact, High Effort |
210 | | - - [ ] Create `scripts/generate_dashboard_data.py` |
211 | | - - [ ] Build web dashboard to track: |
212 | | - - Pass rate trends by metric over time |
213 | | - - Cost per successful evaluation |
214 | | - - Common failure patterns |
215 | | - - Version accuracy over time |
216 | | - - Per-question performance |
217 | | - - [ ] Host at `http://localhost:8000/dashboard` |
218 | | - - **Impact:** Long-term visibility into testing trends |
219 | | - |
220 | 43 | ## Medium Priority |
221 | 44 |
|
222 | 45 | ### 4. Implement New Ragas Metrics |
@@ -297,22 +120,6 @@ Following specs created this week: |
297 | 120 | - [x] `scripts/calculate_cost_estimate.py` |
298 | 121 | - [x] `scripts/calculate_cost_estimate_multi.py` |
299 | 122 |
|
300 | | -## Expected Impact from RAG Testing Improvements |
301 | | - |
302 | | -With items 1-6 implemented: |
303 | | -- **Better visibility** into version correctness (currently blind spot) |
304 | | -- **Higher confidence** in test results (know WHY things fail) |
305 | | -- **Faster debugging** of failures (failure mode detection) |
306 | | -- **Lower costs** from focused testing (RHEL 10 specific suite) |
307 | | -- **60% → 75%+ pass rate** expected on RHEL 10 temporal questions |
308 | | -- **Reduced variance** in results (better test data quality) |
309 | | - |
310 | | -### Current Baseline (from version filtering analysis) |
311 | | -- Temporal test pass rate: 60% (was 40% before version filtering) |
312 | | -- Context precision: 42.9% pass rate (needs improvement) |
313 | | -- Faithfulness: 42.9% pass rate (was 14.3% before filtering) |
314 | | -- Version accuracy: Not currently measured (Item #1 will add this) |
315 | | - |
316 | 123 | ## Notes |
317 | 124 |
|
318 | 125 | ### Testing Philosophy |
|
0 commit comments