Skip to content

Commit ee4d294

Browse files
author
Elle Mackey
committed
remove HEAL related parts of todo
1 parent b47c8cf commit ee4d294

1 file changed

Lines changed: 5 additions & 198 deletions

File tree

TODO.md

Lines changed: 5 additions & 198 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,16 @@
33
## High Priority
44

55
### [Done]1. Fix Token Usage Tracking
6-
- [ ] Pull token info from endpoint response (Gemini model answering questions)
6+
- [x] Pull token info from endpoint response (Gemini model answering questions)
77
- Currently showing 0 tokens for `api_input_tokens` and `api_output_tokens`
88
- Need to extract from response object and store in evaluation results
99
- Affects accurate cost estimation
10-
- [ ] Re-run cost estimation script after fix
10+
- [x] Re-run cost estimation script after fix
1111
```bash
12-
python scripts/calculate_cost_estimate_multi.py eval_output/latest --comparison
12+
$ python scripts/show_cost.py
13+
1314
```
15+
[] Fix existing bugs caused by claude
1416

1517
### 2. Clean Up Repository
1618
- [ ] Remove old evaluation output files from repository
@@ -38,185 +40,6 @@
3840
- `ADVERSARIAL_CONTEXT_INJECTION_TESTS.md`
3941
- `JUDGE_LLM_CONSISTENCY_TESTS.md`
4042

41-
## High Priority - Pattern Fix Loop Extensions
42-
43-
### Scaling to Production (Multi-Agent Architecture)
44-
45-
**Summary:** Scale pattern fix loop from POC (3-5 tickets, single agent) to production (20+ tickets, multi-agent with dynamic loading). See `docs/PATTERN_FIX_LOOP_SCALING.md` for full details.
46-
47-
**Roadmap:**
48-
- **Phase 1 (POC):** ✅ Single agent, 3-5 tickets, sequential (current)
49-
- **Phase 2 (Specialists):** Coordinator + specialist agents (Solr, Prompt, Validator), improves iteration quality
50-
- **Phase 3 (Parallel):** Add parallel ticket validation, scale to 10-15 tickets, 5-10x speedup
51-
- **Phase 4 (Learning):** Cross-pattern learning, scale to 20+ tickets, learn from past successes
52-
53-
**Key Techniques:**
54-
- **Multi-agent decomposition:** Baseline → Specialist → Validator → Aggregator (each fresh context)
55-
- **Dynamic loading:** Agents read from `.diagnostics/` files instead of accumulating context
56-
- **Parallel validation:** Validate N tickets simultaneously (N validators in background)
57-
- **External storage:** Iteration history, pattern metadata stored externally (not in context)
58-
- **Specialist agents:** SolrExpert (4K context), PromptExpert (4K), VarianceAnalyzer (3K)
59-
60-
**Benefits:**
61-
- Context per agent stays ~3-5K tokens (vs single agent growing to 12K+)
62-
- Parallel validation: 10 tickets in 60s instead of 600s
63-
- Specialist expertise improves suggestion quality
64-
- Cross-pattern learning reduces iterations needed
65-
66-
**Implementation Priority:** Medium (after POC proves concept)
67-
68-
### Variance Detection and Auto-Fix (Future Enhancement)
69-
70-
- [ ] **Add Variance-Aware Agent** - High Impact, Medium Effort
71-
- [ ] Implement `VarianceAnalyzer` class in `scripts/okp_mcp_variance_analyzer.py`
72-
- Analyze variance across multiple stability runs
73-
- Diagnose root cause (bad ground truth vs retrieval variance vs prompt ambiguity)
74-
- Suggest specific fixes based on diagnosis
75-
- [ ] Add variance analysis to pattern fix loop Phase 4
76-
- Currently: Calculates variance and reports if > 0.05
77-
- Enhancement: Auto-diagnose cause and suggest fix
78-
- [ ] Implement variance detection capabilities:
79-
- ✅ Compare responses for semantic similarity (bad ground truth detection)
80-
- ✅ Compare retrieved URLs for order variance (retrieval variance detection)
81-
- ✅ Detect response style variance (prompt ambiguity detection)
82-
- ✅ Apply appropriate fix based on diagnosis
83-
- [ ] See: `docs/VARIANCE_SOLUTIONS.md` for diagnostic framework
84-
- **Impact:** Agents can automatically detect and fix unstable answers
85-
- **Current State:** Agents can only see single-run metrics, cannot detect variance
86-
- **Priority:** Medium (useful but not critical for POC)
87-
88-
- [ ] **Add Semantic Answer Correctness (If High Variance)** - Medium Impact, Low Effort
89-
- [ ] Implement hybrid answer_correctness metric in `src/lightspeed_evaluation/core/metrics/custom/semantic_answer.py`
90-
- Use sentence transformers for semantic similarity (fast, deterministic, no wording bias)
91-
- Combine with LLM judge for borderline cases (semantic similarity 0.50-0.85)
92-
- Weighted scoring: `0.6 * llm_score + 0.4 * semantic_score`
93-
- [ ] Add to VarianceAnalyzer for detecting wording-based variance:
94-
- If semantic_similarity > 0.90 but answer_correctness variance > 0.02 → bad ground truth
95-
- Auto-generate more specific expected_response using LLM
96-
- [ ] Add configuration option to switch between:
97-
- `answer_correctness_mode: "llm"` (current, good for absolute scoring)
98-
- `answer_correctness_mode: "semantic"` (embedding-based, no wording bias)
99-
- `answer_correctness_mode: "hybrid"` (best of both)
100-
- **When to use:** If stability checks show high variance (>0.05) due to wording differences
101-
- **Benefits:** Reduces variance from semantically identical but differently worded answers
102-
- **Tradeoff:** Pure semantic similarity less precise on factual correctness
103-
- **Priority:** Low (only implement if variance becomes major issue)
104-
105-
## High Priority - RAG Testing Improvements for RHEL 10
106-
107-
### Do These First (This Week)
108-
109-
- [ ] **1. Add RHEL Version-Aware Metrics** (1 hour) - High Impact, Low Effort
110-
- [ ] Create `src/lightspeed_evaluation/core/metrics/custom/version_accuracy.py`
111-
- Validates RHEL version accuracy in contexts and responses
112-
- Checks if target version is in contexts
113-
- Detects wrong version in response
114-
- Calculates target version ratio in contexts
115-
- [ ] Add to `config/system.yaml` metrics_metadata
116-
- [ ] Set threshold to 0.8
117-
- **Impact:** Directly measures what we care about - are we retrieving the right version?
118-
119-
- [ ] **2. Create RHEL 10-Specific Test Suite** (2 hours) - High Impact, Medium Effort
120-
- [ ] Create `config/rhel10_focused_tests.yaml`
121-
- [ ] Include test categories:
122-
- New features (bootc, performance improvements)
123-
- Version-specific configuration
124-
- Migration and upgrade paths
125-
- Common administrative tasks
126-
- Troubleshooting scenarios
127-
- Package management (DNF5)
128-
- Security (SELinux)
129-
- **Impact:** Focused test coverage on primary use case
130-
131-
- [ ] **3. Add Version Markers to Test Data** (1 hour) - Medium Impact, Low Effort
132-
- [ ] Update all test YAML files with:
133-
- `target_version: "10"`
134-
- `version_strictness: "required|preferred|mixed"`
135-
- `expected_version_in_response: "10"`
136-
- `expected_version_in_contexts: ["10"]`
137-
- `forbidden_versions: ["8", "9"]`
138-
- [ ] Create validator in evaluation pipeline
139-
- **Impact:** Explicit pass/fail criteria for version correctness
140-
141-
### Do These Next (Next 2 Weeks)
142-
143-
- [ ] **4. Add Context Quality Metrics** (3 hours) - High Impact, Low Effort
144-
- [ ] Create `src/lightspeed_evaluation/core/metrics/custom/context_validation.py`
145-
- [ ] Implement `ContextVersionPurityMetric`
146-
- Measure percentage of contexts matching target version
147-
- [ ] Implement `ContextRecencyMetric`
148-
- Check if contexts are from recent documentation
149-
- Flag old documentation (>2 years)
150-
- **Impact:** Better understanding of WHY context_precision is only 42.9%
151-
- **Related:** Currently context_precision pass rate is 42.9%, need better validation
152-
153-
- [ ] **5. Create Regression Test Suite from Current Failures** (2 hours) - High Impact, Medium Effort
154-
- [ ] Create `scripts/create_regression_suite.py`
155-
- Extract questions with scores below 0.5
156-
- Group by conversation and track worst metrics
157-
- Generate `config/regression_tests.yaml`
158-
- [ ] Track these specific questions over time
159-
- **Impact:** Systematic tracking of problematic questions
160-
- **Note:** Use this after each major evaluation run to build regression dataset
161-
162-
- [ ] **6. Add Gemini 2.5 Flash-Specific Optimizations** (2 hours) - Medium Impact, Medium Effort
163-
- [ ] Create `config/gemini_optimized_system.yaml`
164-
- [ ] Configure Gemini-specific parameters:
165-
- `top_p: 0.95`
166-
- `top_k: 40`
167-
- Safety settings for technical documentation
168-
- [ ] Add structured prompt templates
169-
- System prompt emphasizing RHEL version awareness
170-
- Instruction to only use provided documentation
171-
- [ ] Use same model for judge LLM for consistency
172-
- **Impact:** Better alignment with Gemini's strengths
173-
174-
### Do Eventually (Longer Term)
175-
176-
- [ ] **7. Create Golden Dataset with Human Validation** - High Impact, High Effort
177-
- [ ] Select 20 critical RHEL 10 questions (most common user queries)
178-
- [ ] Manually validate/write perfect expected responses
179-
- [ ] Have RHEL experts review
180-
- [ ] Create `config/golden_rhel10_tests.yaml` with:
181-
- `quality_level: "gold"`
182-
- `expert_validated: true`
183-
- `validation_date` and `validator` fields
184-
- `gold_standard_response` (expert written)
185-
- `required_facts` (must-have information)
186-
- `forbidden_statements` (common misconceptions)
187-
- [ ] Use as high-confidence regression suite
188-
- **Impact:** High-confidence baseline for measuring improvements
189-
190-
- [ ] **8. Add Failure Mode Detection** (3 hours) - Medium Impact, Low Effort
191-
- [ ] Create `src/lightspeed_evaluation/core/metrics/custom/failure_modes.py`
192-
- [ ] Implement `FailureModeDetector` to catch:
193-
- Version hallucination (query version ≠ response version)
194-
- Empty/refusal responses
195-
- Context ignored (response contradicts context)
196-
- Over-generic responses
197-
- Wrong doc type (KB article instead of documentation)
198-
- **Impact:** Better root cause analysis of failures
199-
200-
- [ ] **9. Add Cost Tracking and Optimization** - Low Impact, Low Effort
201-
- [ ] Add cost tracking to evaluation pipeline:
202-
- Track API calls, input/output tokens
203-
- Calculate estimated cost using Gemini 2.5 Flash pricing
204-
- $0.075 per 1M input tokens
205-
- $0.30 per 1M output tokens
206-
- [ ] Add to evaluation output and summary reports
207-
- **Impact:** Better budget management for testing
208-
209-
- [ ] **10. Implement Continuous Testing Dashboard** - Low Impact, High Effort
210-
- [ ] Create `scripts/generate_dashboard_data.py`
211-
- [ ] Build web dashboard to track:
212-
- Pass rate trends by metric over time
213-
- Cost per successful evaluation
214-
- Common failure patterns
215-
- Version accuracy over time
216-
- Per-question performance
217-
- [ ] Host at `http://localhost:8000/dashboard`
218-
- **Impact:** Long-term visibility into testing trends
219-
22043
## Medium Priority
22144

22245
### 4. Implement New Ragas Metrics
@@ -297,22 +120,6 @@ Following specs created this week:
297120
- [x] `scripts/calculate_cost_estimate.py`
298121
- [x] `scripts/calculate_cost_estimate_multi.py`
299122

300-
## Expected Impact from RAG Testing Improvements
301-
302-
With items 1-6 implemented:
303-
- **Better visibility** into version correctness (currently blind spot)
304-
- **Higher confidence** in test results (know WHY things fail)
305-
- **Faster debugging** of failures (failure mode detection)
306-
- **Lower costs** from focused testing (RHEL 10 specific suite)
307-
- **60% → 75%+ pass rate** expected on RHEL 10 temporal questions
308-
- **Reduced variance** in results (better test data quality)
309-
310-
### Current Baseline (from version filtering analysis)
311-
- Temporal test pass rate: 60% (was 40% before version filtering)
312-
- Context precision: 42.9% pass rate (needs improvement)
313-
- Faithfulness: 42.9% pass rate (was 14.3% before filtering)
314-
- Version accuracy: Not currently measured (Item #1 will add this)
315-
316123
## Notes
317124

318125
### Testing Philosophy

0 commit comments

Comments
 (0)