Skip to content

Commit 2b98985

Browse files
author
Elle Mackey
committed
stashing data from running random test questions from Arin's Docta project - analysis on where queries have the highest failure rates in eval, etc to identify weaknesses in system and plan fixes
1 parent 9cdaf8d commit 2b98985

6 files changed

Lines changed: 989 additions & 0 deletions

File tree

generality/ANALYSIS_SUMMARY.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
cla# Generality Test Failure Analysis Summary
2+
3+
## Test Overview
4+
- **Questions**: 20 RHEL 10 benchmark questions
5+
- **Runs**: 3 consecutive evaluations
6+
- **Metric analyzed**: custom:answer_correctness
7+
- **Overall failure rate**: ~50% (10/20 per run)
8+
9+
## Key Findings
10+
11+
### 1. Query Characteristics Strongly Predict Failure ⚠️
12+
13+
**Misspellings** (Hypothesis 1a):
14+
- **72.7%** of failed questions have MISSPELLED queries
15+
- Only **22.2%** of passed questions have misspellings
16+
- **+50.5% difference** - strongest predictor of failure
17+
18+
**Query Length** (Hypothesis 1b):
19+
- Failed: 91% SHORT/MEDIUM queries (avg 86 chars)
20+
- Passed: 67% LONG queries (avg 235 chars)
21+
- **-57.6% difference** for LONG queries
22+
23+
**Perfect Grammar**:
24+
- **0%** of failed questions have perfect grammar
25+
- **33.3%** of passed questions have perfect grammar
26+
27+
### 2. RAG_BYPASS Patterns
28+
29+
**Successful RAG_BYPASS** (Passed without context):
30+
- 4/31 (12.9%) - Model's parametric knowledge was sufficient
31+
- Examples: Q13 (UBI containers) - passed WITHOUT context
32+
33+
**Failed RAG_BYPASS** (Failed without context):
34+
- 5/29 (17.2%) - Parametric knowledge insufficient
35+
- **Critical finding**: Q03, Q05, Q13 had **NO tool calls logged**
36+
- API routing decided not to use RAG at all
37+
- Not a "missing docs" issue - tools were never invoked
38+
39+
### 3. Tool Usage Analysis 🔍
40+
41+
**Tools WERE called** for 17/20 questions:
42+
- `search_portal` (OKP MCP) - used for retrieval
43+
- `get_document` (OKP MCP) - used for document fetching
44+
- `mcp_list_tools` - OKP server initialization
45+
46+
**Tools NOT called** for 3/20 questions:
47+
- Q03: "How do I go about submmiting feedback through Jira?" (misspelled)
48+
- Q05: "How does dual RAiD provide redudancy in an active/passive configeration?" (misspelled)
49+
- Q13: "As someone managing... UBI-based..." (LONG query, passed in Run 3)
50+
51+
**Why no tools?**
52+
- API routing layer decided query could be answered without RAG
53+
- Possibly: Simple/generic questions
54+
- Possibly: Misspellings made query too unclear for retrieval
55+
56+
### 4. Context Quality Issues (82.8% of failures)
57+
58+
**Most failures (24/29) HAD contexts retrieved**, suggesting:
59+
1. Wrong/irrelevant documents retrieved
60+
2. Retrieved docs had deprecation warnings (see Q02 example)
61+
3. LLM couldn't synthesize correct answer from provided context
62+
4. Ground truth mismatch
63+
64+
**Example - Q02** (Failed WITH context):
65+
- Query: "How can I report Red Hat documentation errors using Jira?"
66+
- Tools: `search_portal` called successfully
67+
- Contexts: 7 documents retrieved (with deprecation warnings)
68+
- Result: Still FAILED - despite having context
69+
70+
## Root Cause Breakdown
71+
72+
| Root Cause | % of Failures | Evidence |
73+
|------------|---------------|----------|
74+
| Misspellings | 72.7% | Q03, Q05, Q06, Q08, Q12, Q16, Q19, Q20 |
75+
| Short queries | 90.9% | Combined with misspellings |
76+
| No RAG invoked | 17.2% | Q03, Q05, Q13 (no tool calls) |
77+
| Poor context quality | 82.8% | Had context but still failed |
78+
79+
## Specific Failed Questions
80+
81+
### Consistently Failed (All 3 Runs)
82+
- **Q03**: "submmiting feedback" - No tools called, misspelled
83+
- **Q05**: "RAiD...redudancy...configeration" - No tools called, multiple misspellings
84+
- **Q02**: "report... errors using Jira" - Tools called, context retrieved, still failed
85+
- **Q08**: "pcp-zero-conf package" - Tools called, context retrieved, still failed
86+
87+
### Variable Failures
88+
- **Q13**: Failed in Run 1 & 2 (no tools), Passed in Run 3 (parametric knowledge)
89+
- **Q06**: "samba-bgq service" (misspelled as bgq, should be bgqd)
90+
- **Q12**: "recomendation" (misspelled)
91+
92+
## Hypotheses Validation
93+
94+
### ✅ Hypothesis 1: Query Characteristics
95+
**CONFIRMED** - Strong correlation between failures and:
96+
- Misspellings (+50.5% in failures)
97+
- Short queries (-57.6% for LONG in failures)
98+
- Poor grammar
99+
100+
### ⚠️ Hypothesis 2: Missing OKP Docs
101+
**PARTIALLY CONFIRMED** - But nuanced:
102+
- **NOT** that docs are missing from OKP Solr
103+
- Rather: API routing doesn't invoke RAG for certain queries
104+
- When RAG IS invoked (17/20 cases), documents ARE retrieved
105+
- Issue is more about:
106+
1. **Routing decision** (3 questions never searched)
107+
2. **Context quality** (24 questions had docs but still failed)
108+
109+
## Recommendations
110+
111+
1. **Improve Spelling Correction**
112+
- Implement query preprocessing to fix common misspellings
113+
- Test: "submmiting" → "submitting", "RAiD" → "RAID"
114+
115+
2. **Investigate RAG Routing Logic**
116+
- Why did Q03, Q05 not trigger RAG?
117+
- Are misspellings preventing RAG invocation?
118+
- Consider lowering threshold for RAG engagement
119+
120+
3. **Context Quality**
121+
- 82.8% of failures HAD context - why didn't it help?
122+
- Are retrieved docs relevant?
123+
- Are deprecation warnings confusing the LLM?
124+
125+
4. **Query Expansion**
126+
- Short queries may need expansion/rephrasing
127+
- Test: expand "AD trust FIPS mode, what do?" to proper question
128+
129+
## Files Generated
130+
131+
- `analyze_failures.py` - Query characteristics analysis
132+
- `analyze_contexts.py` - Context retrieval analysis (RAG_BYPASS)
133+
- `analyze_tool_calls.py` - Tool invocation analysis
134+
- `failure_analysis.json` - Detailed failure data
135+
- `context_analysis.json` - Context availability stats
136+
137+
## Next Steps
138+
139+
1. Test with spelling-corrected queries
140+
2. Investigate why certain queries bypass RAG
141+
3. Examine quality of retrieved contexts for failed questions
142+
4. Compare ground truth vs. actual responses for context-rich failures

generality/analyze_contexts.py

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
#!/usr/bin/env python3
2+
"""
3+
Analyze whether contexts are being returned from OKP MCP.
4+
5+
If contexts are empty/missing, it suggests either:
6+
1. OKP MCP is not being called
7+
2. OKP database doesn't have relevant documents
8+
3. Query routing/retrieval is failing
9+
"""
10+
11+
import pandas as pd
12+
import json
13+
from pathlib import Path
14+
15+
16+
def analyze_contexts(csv_path: Path, run_name: str = ""):
17+
"""Analyze contexts in evaluation results."""
18+
19+
# Load the CSV
20+
df = pd.read_csv(csv_path)
21+
22+
# Filter for answer_correctness metric
23+
ac_df = df[df['metric_identifier'] == 'custom:answer_correctness'].copy()
24+
25+
# Check if contexts is null/empty
26+
ac_df['has_context'] = ac_df['contexts'].notna() & (ac_df['contexts'] != '') & (ac_df['contexts'] != '[]')
27+
28+
print("=" * 80)
29+
print(f"CONTEXTS ANALYSIS {run_name}")
30+
print("=" * 80)
31+
32+
print(f"\nTotal answer_correctness evaluations: {len(ac_df)}")
33+
print(f"With contexts: {ac_df['has_context'].sum()}")
34+
print(f"Without contexts: {(~ac_df['has_context']).sum()}")
35+
36+
# Break down by pass/fail
37+
print("\n" + "=" * 80)
38+
print("CONTEXTS BY RESULT")
39+
print("=" * 80)
40+
41+
results_summary = {}
42+
for result in ['PASS', 'FAIL']:
43+
subset = ac_df[ac_df['result'] == result]
44+
with_ctx = subset['has_context'].sum()
45+
total = len(subset)
46+
47+
results_summary[result] = {
48+
'total': total,
49+
'with_contexts': with_ctx,
50+
'without_contexts': total - with_ctx,
51+
}
52+
53+
print(f"\n{result}:")
54+
print(f" Total: {total}")
55+
print(f" With contexts: {with_ctx} ({100*with_ctx/total:.1f}%)")
56+
print(f" Without contexts: {total - with_ctx} ({100*(total-with_ctx)/total:.1f}%)")
57+
58+
# Show sample contexts for failed questions
59+
print("\n" + "=" * 80)
60+
print("SAMPLE FAILED QUESTIONS - CONTEXT CHECK")
61+
print("=" * 80)
62+
63+
failed = ac_df[ac_df['result'] == 'FAIL']
64+
for idx, row in failed.head(5).iterrows():
65+
conv_id = row['conversation_group_id']
66+
# Extract question number
67+
q_num = conv_id.split('_q')[-1] if '_q' in conv_id else '?'
68+
69+
print(f"\nQ{q_num} - {conv_id} (FAIL):")
70+
print(f" Query: {row['query'][:100]}{'...' if len(row['query']) > 100 else ''}")
71+
72+
ctx = row['contexts']
73+
if pd.isna(ctx) or ctx == '' or ctx == '[]':
74+
print(f" Contexts: ❌ EMPTY/NULL - No docs retrieved!")
75+
else:
76+
# Try to parse as JSON to see structure
77+
try:
78+
ctx_data = json.loads(ctx) if isinstance(ctx, str) else ctx
79+
if isinstance(ctx_data, list):
80+
print(f" Contexts: ✓ {len(ctx_data)} documents retrieved")
81+
if ctx_data:
82+
# Show first doc preview
83+
first_doc = str(ctx_data[0])[:200]
84+
print(f" Preview: {first_doc}...")
85+
else:
86+
print(f" Contexts: ? Unexpected type: {type(ctx_data)}")
87+
except Exception as e:
88+
print(f" Contexts: ? Parse error: {e}")
89+
print(f" Raw: {str(ctx)[:100]}...")
90+
91+
return results_summary
92+
93+
94+
def main():
95+
"""Analyze contexts across all runs."""
96+
current_dir = Path.cwd()
97+
if current_dir.name != 'generality':
98+
current_dir = Path(__file__).parent
99+
100+
all_results = {}
101+
102+
# Analyze each run
103+
for run_num in [1, 2, 3]:
104+
run_dir = current_dir / f"run{run_num}"
105+
csv_files = list(run_dir.glob("*_detailed.csv"))
106+
107+
if csv_files:
108+
print(f"\n{'='*80}")
109+
print(f"RUN {run_num}")
110+
print(f"{'='*80}")
111+
results = analyze_contexts(csv_files[0], f"- Run {run_num}")
112+
all_results[f"run{run_num}"] = results
113+
114+
# Summary across all runs
115+
print("\n" + "=" * 80)
116+
print("SUMMARY ACROSS ALL RUNS")
117+
print("=" * 80)
118+
119+
total_fails = sum(r['FAIL']['total'] for r in all_results.values())
120+
total_fails_no_ctx = sum(r['FAIL']['without_contexts'] for r in all_results.values())
121+
122+
total_pass = sum(r['PASS']['total'] for r in all_results.values())
123+
total_pass_no_ctx = sum(r['PASS']['without_contexts'] for r in all_results.values())
124+
125+
print(f"\nFAILED questions:")
126+
print(f" Total: {total_fails}")
127+
print(f" Without contexts: {total_fails_no_ctx} ({100*total_fails_no_ctx/total_fails:.1f}%)")
128+
129+
print(f"\nPASSED questions:")
130+
print(f" Total: {total_pass}")
131+
print(f" Without contexts: {total_pass_no_ctx} ({100*total_pass_no_ctx/total_pass:.1f}%)")
132+
133+
print("\n" + "=" * 80)
134+
print("CONCLUSIONS")
135+
print("=" * 80)
136+
137+
print(f"\n📊 RAG_BYPASS Analysis:")
138+
print(f" PASSED without context (successful RAG_BYPASS): {total_pass_no_ctx}/{total_pass} ({100*total_pass_no_ctx/total_pass:.1f}%)")
139+
print(f" - Model used parametric knowledge successfully")
140+
141+
print(f"\n FAILED without context (missing docs): {total_fails_no_ctx}/{total_fails} ({100*total_fails_no_ctx/total_fails:.1f}%)")
142+
print(f" - OKP retrieval failed, AND parametric knowledge insufficient")
143+
144+
if total_fails_no_ctx > 0:
145+
print("\n⚠️ Failed questions with NO contexts suggest:")
146+
print(" - Documents don't exist in OKP Solr database")
147+
print(" - Query routing/retrieval is failing")
148+
print(" - Misspellings preventing successful search")
149+
150+
fails_with_ctx = total_fails - total_fails_no_ctx
151+
if fails_with_ctx > 0:
152+
print(f"\n⚠️ Failed questions WITH contexts ({fails_with_ctx}/{total_fails}, {100*fails_with_ctx/total_fails:.1f}%):")
153+
print(" Issue is likely:")
154+
print(" - Context quality/relevance (wrong docs retrieved)")
155+
print(" - LLM answer generation from provided context")
156+
print(" - Ground truth mismatch")
157+
158+
# Save results (convert int64 to int for JSON serialization)
159+
output_file = current_dir / "context_analysis.json"
160+
serializable_results = {}
161+
for run, data in all_results.items():
162+
serializable_results[run] = {}
163+
for result, stats in data.items():
164+
serializable_results[run][result] = {k: int(v) for k, v in stats.items()}
165+
166+
with open(output_file, 'w') as f:
167+
json.dump(serializable_results, f, indent=2)
168+
169+
print(f"\n✓ Detailed results saved to: {output_file}")
170+
171+
172+
if __name__ == "__main__":
173+
main()

0 commit comments

Comments
 (0)