emac-E
diff --git a/‎generality/ANALYSIS_SUMMARY.md‎
Lines changed: 142 additions & 0 deletions b/‎generality/ANALYSIS_SUMMARY.md‎
Lines changed: 142 additions & 0 deletions
diff --git a/‎generality/analyze_contexts.py‎
Lines changed: 173 additions & 0 deletions b/‎generality/analyze_contexts.py‎
Lines changed: 173 additions & 0 deletions
@@ -0,0 +1,142 @@
+cla# Generality Test Failure Analysis Summary
+
+## Test Overview
+- **Questions**: 20 RHEL 10 benchmark questions
+- **Runs**: 3 consecutive evaluations
+- **Metric analyzed**: custom:answer_correctness
+- **Overall failure rate**: ~50% (10/20 per run)
+
+## Key Findings
+
+### 1. Query Characteristics Strongly Predict Failure ⚠️
+
+**Misspellings** (Hypothesis 1a):
+- **72.7%** of failed questions have MISSPELLED queries
+- Only **22.2%** of passed questions have misspellings
+- **+50.5% difference** - strongest predictor of failure
+
+**Query Length** (Hypothesis 1b):
+- Failed: 91% SHORT/MEDIUM queries (avg 86 chars)
+- Passed: 67% LONG queries (avg 235 chars)
+- **-57.6% difference** for LONG queries
+
+**Perfect Grammar**:
+- **0%** of failed questions have perfect grammar
+- **33.3%** of passed questions have perfect grammar
+
+### 2. RAG_BYPASS Patterns
+
+**Successful RAG_BYPASS** (Passed without context):
+- 4/31 (12.9%) - Model's parametric knowledge was sufficient
+- Examples: Q13 (UBI containers) - passed WITHOUT context
+
+**Failed RAG_BYPASS** (Failed without context):
+- 5/29 (17.2%) - Parametric knowledge insufficient
+- **Critical finding**: Q03, Q05, Q13 had **NO tool calls logged**
+  - API routing decided not to use RAG at all
+  - Not a "missing docs" issue - tools were never invoked
+
+### 3. Tool Usage Analysis 🔍
+
+**Tools WERE called** for 17/20 questions:
+- `search_portal` (OKP MCP) - used for retrieval
+- `get_document` (OKP MCP) - used for document fetching
+- `mcp_list_tools` - OKP server initialization
+
+**Tools NOT called** for 3/20 questions:
+- Q03: "How do I go about submmiting feedback through Jira?" (misspelled)
+- Q05: "How does dual RAiD provide redudancy in an active/passive configeration?" (misspelled)
+- Q13: "As someone managing... UBI-based..." (LONG query, passed in Run 3)
+
+**Why no tools?**
+- API routing layer decided query could be answered without RAG
+- Possibly: Simple/generic questions
+- Possibly: Misspellings made query too unclear for retrieval
+
+### 4. Context Quality Issues (82.8% of failures)
+
+**Most failures (24/29) HAD contexts retrieved**, suggesting:
+1. Wrong/irrelevant documents retrieved
+2. Retrieved docs had deprecation warnings (see Q02 example)
+3. LLM couldn't synthesize correct answer from provided context
+4. Ground truth mismatch
+
+**Example - Q02** (Failed WITH context):
+- Query: "How can I report Red Hat documentation errors using Jira?"
+- Tools: `search_portal` called successfully
+- Contexts: 7 documents retrieved (with deprecation warnings)
+- Result: Still FAILED - despite having context
+
+## Root Cause Breakdown
+
+| Root Cause | % of Failures | Evidence |
+|------------|---------------|----------|
+| Misspellings | 72.7% | Q03, Q05, Q06, Q08, Q12, Q16, Q19, Q20 |
+| Short queries | 90.9% | Combined with misspellings |
+| No RAG invoked | 17.2% | Q03, Q05, Q13 (no tool calls) |
+| Poor context quality | 82.8% | Had context but still failed |
+
+## Specific Failed Questions
+
+### Consistently Failed (All 3 Runs)
+- **Q03**: "submmiting feedback" - No tools called, misspelled
+- **Q05**: "RAiD...redudancy...configeration" - No tools called, multiple misspellings
+- **Q02**: "report... errors using Jira" - Tools called, context retrieved, still failed
+- **Q08**: "pcp-zero-conf package" - Tools called, context retrieved, still failed
+
+### Variable Failures
+- **Q13**: Failed in Run 1 & 2 (no tools), Passed in Run 3 (parametric knowledge)
+- **Q06**: "samba-bgq service" (misspelled as bgq, should be bgqd)
+- **Q12**: "recomendation" (misspelled)
+
+## Hypotheses Validation
+
+### ✅ Hypothesis 1: Query Characteristics
+**CONFIRMED** - Strong correlation between failures and:
+- Misspellings (+50.5% in failures)
+- Short queries (-57.6% for LONG in failures)
+- Poor grammar
+
+### ⚠️ Hypothesis 2: Missing OKP Docs
+**PARTIALLY CONFIRMED** - But nuanced:
+- **NOT** that docs are missing from OKP Solr
+- Rather: API routing doesn't invoke RAG for certain queries
+- When RAG IS invoked (17/20 cases), documents ARE retrieved
+- Issue is more about:
+  1. **Routing decision** (3 questions never searched)
+  2. **Context quality** (24 questions had docs but still failed)
+
+## Recommendations
+
+1. **Improve Spelling Correction**
+   - Implement query preprocessing to fix common misspellings
+   - Test: "submmiting" → "submitting", "RAiD" → "RAID"
+
+2. **Investigate RAG Routing Logic**
+   - Why did Q03, Q05 not trigger RAG?
+   - Are misspellings preventing RAG invocation?
+   - Consider lowering threshold for RAG engagement
+
+3. **Context Quality**
+   - 82.8% of failures HAD context - why didn't it help?
+   - Are retrieved docs relevant?
+   - Are deprecation warnings confusing the LLM?
+
+4. **Query Expansion**
+   - Short queries may need expansion/rephrasing
+   - Test: expand "AD trust FIPS mode, what do?" to proper question
+
+## Files Generated
+
+- `analyze_failures.py` - Query characteristics analysis
+- `analyze_contexts.py` - Context retrieval analysis (RAG_BYPASS)
+- `analyze_tool_calls.py` - Tool invocation analysis
+- `failure_analysis.json` - Detailed failure data
+- `context_analysis.json` - Context availability stats
+
+## Next Steps
+
+1. Test with spelling-corrected queries
+2. Investigate why certain queries bypass RAG
+3. Examine quality of retrieved contexts for failed questions
+4. Compare ground truth vs. actual responses for context-rich failures
@@ -0,0 +1,173 @@
+#!/usr/bin/env python3
+"""
+Analyze whether contexts are being returned from OKP MCP.
+
+If contexts are empty/missing, it suggests either:
+1. OKP MCP is not being called
+2. OKP database doesn't have relevant documents
+3. Query routing/retrieval is failing
+"""
+
+import pandas as pd
+import json
+from pathlib import Path
+
+
+def analyze_contexts(csv_path: Path, run_name: str = ""):
+    """Analyze contexts in evaluation results."""
+
+    # Load the CSV
+    df = pd.read_csv(csv_path)
+
+    # Filter for answer_correctness metric
+    ac_df = df[df['metric_identifier'] == 'custom:answer_correctness'].copy()
+
+    # Check if contexts is null/empty
+    ac_df['has_context'] = ac_df['contexts'].notna() & (ac_df['contexts'] != '') & (ac_df['contexts'] != '[]')
+
+    print("=" * 80)
+    print(f"CONTEXTS ANALYSIS {run_name}")
+    print("=" * 80)
+
+    print(f"\nTotal answer_correctness evaluations: {len(ac_df)}")
+    print(f"With contexts: {ac_df['has_context'].sum()}")
+    print(f"Without contexts: {(~ac_df['has_context']).sum()}")
+
+    # Break down by pass/fail
+    print("\n" + "=" * 80)
+    print("CONTEXTS BY RESULT")
+    print("=" * 80)
+
+    results_summary = {}
+    for result in ['PASS', 'FAIL']:
+        subset = ac_df[ac_df['result'] == result]
+        with_ctx = subset['has_context'].sum()
+        total = len(subset)
+
+        results_summary[result] = {
+            'total': total,
+            'with_contexts': with_ctx,
+            'without_contexts': total - with_ctx,
+        }
+
+        print(f"\n{result}:")
+        print(f"  Total: {total}")
+        print(f"  With contexts: {with_ctx} ({100*with_ctx/total:.1f}%)")
+        print(f"  Without contexts: {total - with_ctx} ({100*(total-with_ctx)/total:.1f}%)")
+
+    # Show sample contexts for failed questions
+    print("\n" + "=" * 80)
+    print("SAMPLE FAILED QUESTIONS - CONTEXT CHECK")
+    print("=" * 80)
+
+    failed = ac_df[ac_df['result'] == 'FAIL']
+    for idx, row in failed.head(5).iterrows():
+        conv_id = row['conversation_group_id']
+        # Extract question number
+        q_num = conv_id.split('_q')[-1] if '_q' in conv_id else '?'
+
+        print(f"\nQ{q_num} - {conv_id} (FAIL):")
+        print(f"  Query: {row['query'][:100]}{'...' if len(row['query']) > 100 else ''}")
+
+        ctx = row['contexts']
+        if pd.isna(ctx) or ctx == '' or ctx == '[]':
+            print(f"  Contexts: ❌ EMPTY/NULL - No docs retrieved!")
+        else:
+            # Try to parse as JSON to see structure
+            try:
+                ctx_data = json.loads(ctx) if isinstance(ctx, str) else ctx
+                if isinstance(ctx_data, list):
+                    print(f"  Contexts: ✓ {len(ctx_data)} documents retrieved")
+                    if ctx_data:
+                        # Show first doc preview
+                        first_doc = str(ctx_data[0])[:200]
+                        print(f"  Preview: {first_doc}...")
+                else:
+                    print(f"  Contexts: ? Unexpected type: {type(ctx_data)}")
+            except Exception as e:
+                print(f"  Contexts: ? Parse error: {e}")
+                print(f"  Raw: {str(ctx)[:100]}...")
+
+    return results_summary
+
+
+def main():
+    """Analyze contexts across all runs."""
+    current_dir = Path.cwd()
+    if current_dir.name != 'generality':
+        current_dir = Path(__file__).parent
+
+    all_results = {}
+
+    # Analyze each run
+    for run_num in [1, 2, 3]:
+        run_dir = current_dir / f"run{run_num}"
+        csv_files = list(run_dir.glob("*_detailed.csv"))
+
+        if csv_files:
+            print(f"\n{'='*80}")
+            print(f"RUN {run_num}")
+            print(f"{'='*80}")
+            results = analyze_contexts(csv_files[0], f"- Run {run_num}")
+            all_results[f"run{run_num}"] = results
+
+    # Summary across all runs
+    print("\n" + "=" * 80)
+    print("SUMMARY ACROSS ALL RUNS")
+    print("=" * 80)
+
+    total_fails = sum(r['FAIL']['total'] for r in all_results.values())
+    total_fails_no_ctx = sum(r['FAIL']['without_contexts'] for r in all_results.values())
+
+    total_pass = sum(r['PASS']['total'] for r in all_results.values())
+    total_pass_no_ctx = sum(r['PASS']['without_contexts'] for r in all_results.values())
+
+    print(f"\nFAILED questions:")
+    print(f"  Total: {total_fails}")
+    print(f"  Without contexts: {total_fails_no_ctx} ({100*total_fails_no_ctx/total_fails:.1f}%)")
+
+    print(f"\nPASSED questions:")
+    print(f"  Total: {total_pass}")
+    print(f"  Without contexts: {total_pass_no_ctx} ({100*total_pass_no_ctx/total_pass:.1f}%)")
+
+    print("\n" + "=" * 80)
+    print("CONCLUSIONS")
+    print("=" * 80)
+
+    print(f"\n📊 RAG_BYPASS Analysis:")
+    print(f"   PASSED without context (successful RAG_BYPASS): {total_pass_no_ctx}/{total_pass} ({100*total_pass_no_ctx/total_pass:.1f}%)")
+    print(f"   - Model used parametric knowledge successfully")
+
+    print(f"\n   FAILED without context (missing docs): {total_fails_no_ctx}/{total_fails} ({100*total_fails_no_ctx/total_fails:.1f}%)")
+    print(f"   - OKP retrieval failed, AND parametric knowledge insufficient")
+
+    if total_fails_no_ctx > 0:
+        print("\n⚠️  Failed questions with NO contexts suggest:")
+        print("   - Documents don't exist in OKP Solr database")
+        print("   - Query routing/retrieval is failing")
+        print("   - Misspellings preventing successful search")
+
+    fails_with_ctx = total_fails - total_fails_no_ctx
+    if fails_with_ctx > 0:
+        print(f"\n⚠️  Failed questions WITH contexts ({fails_with_ctx}/{total_fails}, {100*fails_with_ctx/total_fails:.1f}%):")
+        print("   Issue is likely:")
+        print("   - Context quality/relevance (wrong docs retrieved)")
+        print("   - LLM answer generation from provided context")
+        print("   - Ground truth mismatch")
+
+    # Save results (convert int64 to int for JSON serialization)
+    output_file = current_dir / "context_analysis.json"
+    serializable_results = {}
+    for run, data in all_results.items():
+        serializable_results[run] = {}
+        for result, stats in data.items():
+            serializable_results[run][result] = {k: int(v) for k, v in stats.items()}
+
+    with open(output_file, 'w') as f:
+        json.dump(serializable_results, f, indent=2)
+
+    print(f"\n✓ Detailed results saved to: {output_file}")
+
+
+if __name__ == "__main__":
+    main()