|
| 1 | +cla# Generality Test Failure Analysis Summary |
| 2 | + |
| 3 | +## Test Overview |
| 4 | +- **Questions**: 20 RHEL 10 benchmark questions |
| 5 | +- **Runs**: 3 consecutive evaluations |
| 6 | +- **Metric analyzed**: custom:answer_correctness |
| 7 | +- **Overall failure rate**: ~50% (10/20 per run) |
| 8 | + |
| 9 | +## Key Findings |
| 10 | + |
| 11 | +### 1. Query Characteristics Strongly Predict Failure ⚠️ |
| 12 | + |
| 13 | +**Misspellings** (Hypothesis 1a): |
| 14 | +- **72.7%** of failed questions have MISSPELLED queries |
| 15 | +- Only **22.2%** of passed questions have misspellings |
| 16 | +- **+50.5% difference** - strongest predictor of failure |
| 17 | + |
| 18 | +**Query Length** (Hypothesis 1b): |
| 19 | +- Failed: 91% SHORT/MEDIUM queries (avg 86 chars) |
| 20 | +- Passed: 67% LONG queries (avg 235 chars) |
| 21 | +- **-57.6% difference** for LONG queries |
| 22 | + |
| 23 | +**Perfect Grammar**: |
| 24 | +- **0%** of failed questions have perfect grammar |
| 25 | +- **33.3%** of passed questions have perfect grammar |
| 26 | + |
| 27 | +### 2. RAG_BYPASS Patterns |
| 28 | + |
| 29 | +**Successful RAG_BYPASS** (Passed without context): |
| 30 | +- 4/31 (12.9%) - Model's parametric knowledge was sufficient |
| 31 | +- Examples: Q13 (UBI containers) - passed WITHOUT context |
| 32 | + |
| 33 | +**Failed RAG_BYPASS** (Failed without context): |
| 34 | +- 5/29 (17.2%) - Parametric knowledge insufficient |
| 35 | +- **Critical finding**: Q03, Q05, Q13 had **NO tool calls logged** |
| 36 | + - API routing decided not to use RAG at all |
| 37 | + - Not a "missing docs" issue - tools were never invoked |
| 38 | + |
| 39 | +### 3. Tool Usage Analysis 🔍 |
| 40 | + |
| 41 | +**Tools WERE called** for 17/20 questions: |
| 42 | +- `search_portal` (OKP MCP) - used for retrieval |
| 43 | +- `get_document` (OKP MCP) - used for document fetching |
| 44 | +- `mcp_list_tools` - OKP server initialization |
| 45 | + |
| 46 | +**Tools NOT called** for 3/20 questions: |
| 47 | +- Q03: "How do I go about submmiting feedback through Jira?" (misspelled) |
| 48 | +- Q05: "How does dual RAiD provide redudancy in an active/passive configeration?" (misspelled) |
| 49 | +- Q13: "As someone managing... UBI-based..." (LONG query, passed in Run 3) |
| 50 | + |
| 51 | +**Why no tools?** |
| 52 | +- API routing layer decided query could be answered without RAG |
| 53 | +- Possibly: Simple/generic questions |
| 54 | +- Possibly: Misspellings made query too unclear for retrieval |
| 55 | + |
| 56 | +### 4. Context Quality Issues (82.8% of failures) |
| 57 | + |
| 58 | +**Most failures (24/29) HAD contexts retrieved**, suggesting: |
| 59 | +1. Wrong/irrelevant documents retrieved |
| 60 | +2. Retrieved docs had deprecation warnings (see Q02 example) |
| 61 | +3. LLM couldn't synthesize correct answer from provided context |
| 62 | +4. Ground truth mismatch |
| 63 | + |
| 64 | +**Example - Q02** (Failed WITH context): |
| 65 | +- Query: "How can I report Red Hat documentation errors using Jira?" |
| 66 | +- Tools: `search_portal` called successfully |
| 67 | +- Contexts: 7 documents retrieved (with deprecation warnings) |
| 68 | +- Result: Still FAILED - despite having context |
| 69 | + |
| 70 | +## Root Cause Breakdown |
| 71 | + |
| 72 | +| Root Cause | % of Failures | Evidence | |
| 73 | +|------------|---------------|----------| |
| 74 | +| Misspellings | 72.7% | Q03, Q05, Q06, Q08, Q12, Q16, Q19, Q20 | |
| 75 | +| Short queries | 90.9% | Combined with misspellings | |
| 76 | +| No RAG invoked | 17.2% | Q03, Q05, Q13 (no tool calls) | |
| 77 | +| Poor context quality | 82.8% | Had context but still failed | |
| 78 | + |
| 79 | +## Specific Failed Questions |
| 80 | + |
| 81 | +### Consistently Failed (All 3 Runs) |
| 82 | +- **Q03**: "submmiting feedback" - No tools called, misspelled |
| 83 | +- **Q05**: "RAiD...redudancy...configeration" - No tools called, multiple misspellings |
| 84 | +- **Q02**: "report... errors using Jira" - Tools called, context retrieved, still failed |
| 85 | +- **Q08**: "pcp-zero-conf package" - Tools called, context retrieved, still failed |
| 86 | + |
| 87 | +### Variable Failures |
| 88 | +- **Q13**: Failed in Run 1 & 2 (no tools), Passed in Run 3 (parametric knowledge) |
| 89 | +- **Q06**: "samba-bgq service" (misspelled as bgq, should be bgqd) |
| 90 | +- **Q12**: "recomendation" (misspelled) |
| 91 | + |
| 92 | +## Hypotheses Validation |
| 93 | + |
| 94 | +### ✅ Hypothesis 1: Query Characteristics |
| 95 | +**CONFIRMED** - Strong correlation between failures and: |
| 96 | +- Misspellings (+50.5% in failures) |
| 97 | +- Short queries (-57.6% for LONG in failures) |
| 98 | +- Poor grammar |
| 99 | + |
| 100 | +### ⚠️ Hypothesis 2: Missing OKP Docs |
| 101 | +**PARTIALLY CONFIRMED** - But nuanced: |
| 102 | +- **NOT** that docs are missing from OKP Solr |
| 103 | +- Rather: API routing doesn't invoke RAG for certain queries |
| 104 | +- When RAG IS invoked (17/20 cases), documents ARE retrieved |
| 105 | +- Issue is more about: |
| 106 | + 1. **Routing decision** (3 questions never searched) |
| 107 | + 2. **Context quality** (24 questions had docs but still failed) |
| 108 | + |
| 109 | +## Recommendations |
| 110 | + |
| 111 | +1. **Improve Spelling Correction** |
| 112 | + - Implement query preprocessing to fix common misspellings |
| 113 | + - Test: "submmiting" → "submitting", "RAiD" → "RAID" |
| 114 | + |
| 115 | +2. **Investigate RAG Routing Logic** |
| 116 | + - Why did Q03, Q05 not trigger RAG? |
| 117 | + - Are misspellings preventing RAG invocation? |
| 118 | + - Consider lowering threshold for RAG engagement |
| 119 | + |
| 120 | +3. **Context Quality** |
| 121 | + - 82.8% of failures HAD context - why didn't it help? |
| 122 | + - Are retrieved docs relevant? |
| 123 | + - Are deprecation warnings confusing the LLM? |
| 124 | + |
| 125 | +4. **Query Expansion** |
| 126 | + - Short queries may need expansion/rephrasing |
| 127 | + - Test: expand "AD trust FIPS mode, what do?" to proper question |
| 128 | + |
| 129 | +## Files Generated |
| 130 | + |
| 131 | +- `analyze_failures.py` - Query characteristics analysis |
| 132 | +- `analyze_contexts.py` - Context retrieval analysis (RAG_BYPASS) |
| 133 | +- `analyze_tool_calls.py` - Tool invocation analysis |
| 134 | +- `failure_analysis.json` - Detailed failure data |
| 135 | +- `context_analysis.json` - Context availability stats |
| 136 | + |
| 137 | +## Next Steps |
| 138 | + |
| 139 | +1. Test with spelling-corrected queries |
| 140 | +2. Investigate why certain queries bypass RAG |
| 141 | +3. Examine quality of retrieved contexts for failed questions |
| 142 | +4. Compare ground truth vs. actual responses for context-rich failures |
0 commit comments