Cleaning: Remove HEAL PoC work and merge back changes/files relevant to eval only#9
Merged
Conversation
emac-E
commented
May 16, 2026
Owner
- test updates - new visuals, configs, way to just test okp-mcp retrieval and get ragas metrics not related to LLM-UT
- feat: add okp-mcp functional test integration (Phase 1)
- feat: add dual-mode testing (retrieval + full inference)
- docs: add dual-mode workflow guide to OKP_MCP_INTEGRATION.md
- feat: add 7 new questions from cla-tests updates
- feat: add autonomous agent for okp-mcp (Phase 1)
- feat: add worktree support and approval workflow to agent
- fix: normalize ticket IDs and improve metric validation
- feat: detect RAG usage and document retrieval status
- feat: add Pydantic AI advisor for LLM-powered suggestions
- feat: Add LLM-powered autonomous agent for okp-mcp (Phase 2.1)
- docs: Add design intent and integration guide for autonomous agent
- chore: Update uv.lock after adding anthropic dependency
- feat: Add faithfulness, answer_correctness, and response_relevancy metrics
- docs: Add complete iteration strategy with model escalation
- feat: Implement complete iteration loop with model escalation
- feat: Implement git diff approval flow for code changes
- feat: Implement complete worktree isolation flow
- fix: Initialize variables before try block to avoid UnboundLocalError
- feat: Add environment variable pre-flight check
- Fix critical bugs in okp-mcp agent and add incremental improvement support
- Optimize LLM prompt tokens and add iteration summary logging
- debug: Add comprehensive logging to diagnose pf2 change failure
- feat: Add comprehensive progress report for YOLO mode runs
- feat: Add batch processing for multiple tickets
- docs: Update example_tickets.txt with all functional tests
- docs: Add comprehensive optimization and agentic workflow analysis
- fix: Add missing calculate_url_f1 and calculate_mrr methods
- feat: Implement complete answer-first evaluation workflow
- new additions: get jira tickets, use solr and linux experts to fill in expected answers and find patterns in multiple tickets to try to batch multiple fixes that fall under one pattern. More docs on design intent, other new ideas, POC script is a WIP not yet tested. But, there are tests.
- Reorganize: Move agentic work to okp_mcp_agent/ subfolder
- more path fixes after move
- QSG updates
- Before HEAL migration, store all
- adding some data from generality testing with Arin's ragas generated data set
- Cleaning out files that were migrated into HEAL (the real agentic repo - this was just a PoC branch for an idea)
- update MCP server port - local dev
- remove HEAL related parts of todo
- Merge info from old onboarding presentation about new metrics to update EVAL guide
- stashing data from running random test questions from Arin's Docta project - analysis on where queries have the highest failure rates in eval, etc to identify weaknesses in system and plan fixes
…al and get ragas metrics not related to LLM-UT
Convert okp-mcp functional tests to lightspeed-evaluation format to enable:
- Quantitative metrics (F1, MRR, context relevance) vs binary pass/fail
- Multi-run stability analysis
- Cross-suite overfitting detection
- Agentic iteration with structured feedback
Changes:
1. Converter script (scripts/convert_functional_cases_to_eval.py)
- Parses okp-mcp functional_cases.py AST
- Converts FunctionalCase to evaluation YAML
- Supports filtering by test ID
- Generated 20 test cases from okp-mcp
2. New metric: custom:forbidden_claims_eval
- Verifies known-incorrect claims don't appear in response
- Prevents regression to previously wrong answers
- Score: 1.0 (no forbidden claims) to 0.0 (all forbidden claims)
3. Data model updates
- Added forbidden_claims field to TurnData
- Added validation rules for forbidden_claims_eval metric
- Registered metric in CustomMetrics
4. Generated test suite
- config/okp_mcp_test_suites/functional_tests.yaml
- 20 RSPEED test cases with metrics:
* custom:url_retrieval_eval (F1, MRR, ranking)
* custom:keywords_eval (required facts)
* custom:forbidden_claims_eval (regression detection)
* ragas:context_relevance
* ragas:context_precision_without_reference
5. Documentation
- docs/OKP_MCP_INTEGRATION.md: Integration guide
- Workflow for preventing overfitting
- Agentic development integration patterns
Next steps (Phase 2):
- Build multi-suite comparison tools
- Create additional test suites (general_documentation, regression_guard)
- Optional: unified dashboard
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Elle Mackey <emackey@emackey-thinkpadp16vgen1.boston.csb>
Extended okp-mcp integration with two testing modes for different workflows: MODE 1: Retrieval-Only (Fast - ~30sec/run) - File: functional_tests_retrieval.yaml - Script: run_mcp_retrieval_suite.sh - Metrics: 3 (url_retrieval, context_precision, context_relevance) - Purpose: Daily okp-mcp tuning, rapid iteration - No LLM response generation needed MODE 2: Full Inference (Complete - ~3-5min/run) - File: functional_tests_full.yaml - Script: run_okp_mcp_full_suite.sh (NEW) - Metrics: 5 (adds keywords_eval, forbidden_claims_eval) - Purpose: Pre-commit validation, end-to-end testing - Validates LLM actually uses retrieved docs correctly Changes: 1. Converter script enhanced - Added --mode flag (retrieval_only | full) - Conditionally includes response-based metrics - Updated help text and output messages 2. New test suite variants - functional_tests_retrieval.yaml (3 metrics, fast) - functional_tests_full.yaml (5 metrics, complete) - functional_tests.yaml (default, same as full) - README.md with usage guide and workflow 3. New execution script - run_okp_mcp_full_suite.sh for full inference mode - Uses system.yaml instead of system_mcp_direct.yaml - Clears API cache instead of MCP direct cache - Output to okp_mcp_full_output/ directory 4. System config updates - Added forbidden_claims_eval to system.yaml - Added keywords_eval and forbidden_claims_eval to system_mcp_direct.yaml - All custom metrics now properly registered 5. Bug fix - run_mcp_retrieval_suite.sh now uses 'uv run lightspeed-eval' Workflow: Day-to-day: ./run_mcp_retrieval_suite.sh (fast iteration) Pre-commit: ./run_okp_mcp_full_suite.sh (complete validation) See config/okp_mcp_test_suites/README.md for full documentation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Elle Mackey <emackey@emackey-thinkpadp16vgen1.boston.csb>
Document the optimized development workflow using dual-mode testing: 1. Enhanced Step 4 (Fix Until Test Passes) - Initial diagnosis with full eval (1 run) - Fast iteration for retrieval problems (30 sec/run) - Full iteration for answer problems (3-5 min/run) - Decision tree for choosing mode 2. Enhanced Step 5 (Verify All Tests Pass) - Automated regression detection across test suites - Cross-suite validation examples 3. Enhanced Step 6 (Commit) - Template with detailed evaluation metrics - Includes iteration details and time savings 4. Quick Reference - Common commands for each workflow phase - Result checking commands - Baseline creation for regression detection 5. Complete Example - RSPEED-2482 walkthrough - Shows 9 min total (vs 50+ min old way) - 90% time savings through smart mode switching 6. Decision Tree - Visual guide for when to use which mode - Based on URL F1, MRR, and context relevance metrics 7. Next Steps - Placeholder for Phase 2 autonomous agent - References to Pydantic AI approach This provides complete human-in-the-loop guidance before building the autonomous agent in Phase 2. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Elle Mackey <emackey@emackey-thinkpadp16vgen1.boston.csb>
Added new test cases: - 3 jailbreak tests (typo fixes and RSPEED-1142) - 1 math test (Ultimate Question wording fix) - 1 printing test (list printers and their status) - 1 RHEL test (find command for production files) - 1 SAP test (RHEL for SAP Solutions repositories) All 6 expected_fail tests were already in negative_tests.yaml. Total: CLA_tests.yaml now has 96 questions.
Created okp_mcp_agent.py with diagnosis capability: - Automated evaluation running and CSV parsing - Problem classification (retrieval vs answer) - Metric thresholds for decision making - --use-existing flag for fast testing without re-running evals Phase 1 (working now): - diagnose command with metric analysis - Shell command execution (run evals, restart okp-mcp) - CSV result parsing with error handling Phase 2 (TODO - documented in OKP_MCP_AGENT.md): - LLM integration for code suggestions - Automated boost query editing - Iteration loops (fast/full modes) - Regression testing across all suites - Git automation (commits, PRs) Estimated Phase 2 effort: 10-15 hours Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replaced --auto-commit with safer review-based workflow: - --worktree: Creates isolated git worktree for changes - --worktree-name: Custom branch name - --suggest-only: Suggest changes without applying (Phase 2) - --non-interactive: Skip approval prompts - ask_approval(): Interactive confirmation before risky actions - create_worktree(): Isolated development environment - cleanup_worktree(): Safe cleanup after work Benefits: - No accidental commits to main branch - Easy to test/review changes in isolation - Can merge incrementally when ready - Worktree auto-cleanup with confirmation Removed --auto-commit flag (too risky, not aligned with user workflow). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Bug fixes: - Ticket ID normalization: Convert hyphens to underscores (RSPEED-2482 → RSPEED_2482) to match CSV format - Added has_metrics property to check if any metrics were parsed - Show clear error when ticket not found in CSV with available tickets list - Improved diagnosis output to distinguish between: - No metrics found (evaluation issue) - Retrieval problem (URL F1/MRR/context relevance low) - Answer problem (retrieval good but keywords missing) - Metrics look good (all thresholds passed) Before: Silent failure showing 'METRICS LOOK GOOD' with no data After: Clear error messages and proper metric display Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added comprehensive RAG usage tracking: - rag_used: Detects if search/retrieval tool was called - docs_retrieved: Detects if any documents were returned - num_docs_retrieved(): Counts retrieved documents Enhanced diagnosis with 3 distinct scenarios: 1. RAG NOT USED - LLM answered from general knowledge → System prompt may need adjustment to force tool usage 2. RAG CALLED BUT NO DOCUMENTS - Search returned empty → Query reformulation needed or Solr index missing docs 3. RAG USED BUT WRONG DOCS - Documents retrieved but incorrect → URL F1 = 0.00: None of expected docs returned → URL F1 < 0.7: Some expected docs missing → Boost query tuning needed Summary output now shows: RAG Status: ✅ Used, 5 docs retrieved RAG Status:⚠️ Used, but NO documents retrieved RAG Status: ❌ NOT used (LLM used general knowledge only) This helps identify whether the problem is: - Configuration (RAG not being called) - Search quality (RAG called but wrong/no results) - Answer quality (RAG worked but LLM ignored context) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Phase 2 - LLM Integration: Added OkpMcpLLMAdvisor with Pydantic AI for autonomous code suggestions: - suggest_boost_query_changes(): Analyzes metrics and suggests Solr boost query improvements - suggest_prompt_changes(): Suggests system prompt modifications Features: - Model-agnostic: Easy to switch between Claude, GPT, Gemini, or local models - Vertex AI support: Uses Claude via Google Cloud Vertex AI (GOOGLE_APPLICATION_CREDENTIALS) - Structured outputs: Pydantic models for suggestions (reasoning, confidence, expected improvement) - Expert system prompts: Specialized for Solr/Lucene optimization and RAG prompting Supported models: - vertexai:claude-sonnet-4-0 (default - Claude via Vertex AI) - vertexai:gemini-2.0-flash (Gemini via Vertex AI) - claude-sonnet-4-0 (direct Anthropic) - openai:gpt-4o (OpenAI) - ollama:llama3 (local models) Dependencies added: - pydantic-ai>=0.0.14 Example usage: advisor = OkpMcpLLMAdvisor(model='vertexai:claude-sonnet-4-0') suggestion = advisor.suggest_boost_query_changes(metrics) Test mode: uv run python scripts/okp_mcp_llm_advisor.py Next: Integrate with okp_mcp_agent.py for autonomous iteration. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements AI-powered code analysis and suggestions for okp-mcp RSPEED ticket fixing. Key Features: - LLM advisor using Anthropic SDK with Vertex AI - Tiered model routing (Haiku→Sonnet→Opus) for cost optimization (~50% savings) - Integrated into okp_mcp_agent.py with both problem types: - Retrieval problems: Suggests boost query changes - Answer problems: Suggests system prompt improvements - Added 3 new metrics for enhanced diagnosis: - ragas:faithfulness (answer grounded in context) - custom:answer_correctness (vs expected answer) - ragas:response_relevancy (addresses question) Implementation: - scripts/okp_mcp_llm_advisor.py: LLM advisor with structured outputs - scripts/okp_mcp_agent.py: Integrated LLM suggestions into diagnosis - docs/OKP_MCP_AGENT.md: Environment setup and credentials guide - docs/MULTI_STAGE_TESTING_PLAN.md: Complete multi-stage testing architecture Technical Details: - Uses Claude's tool calling for guaranteed valid JSON - GOOGLE_CLAUDE_CREDENTIALS support to avoid GCP credential conflicts - Graceful fallback if LLM advisor unavailable - All quality checks passing (black, ruff) Next Steps (Tomorrow): - Move to okp-mcp/tools/autonomous_agent/ - Complete metric integration - Implement multi-stage validation workflow Cost: ~$0.01 per ticket diagnosis (vs $0.02 without tiering) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Created comprehensive integration guide for connecting external autonomous systems (cron jobs, JIRA monitoring, CI/CD) to the OKP-MCP agent. Key sections: - System architecture overview with clear data flow diagrams - 4 integration points: CLI, JSON output, Git worktrees, JIRA API - 3 automation workflows with complete example scripts: * Daily JIRA ticket scanning (cron job) * Weekly automated fix attempts with PR creation * CI/CD validation on pull requests - Configuration management (env vars, config files) - Monitoring and observability (logs, metrics, health checks) - Security and permissions requirements - Example integration scripts (bash, Python Flask webhook) - Deployment scenarios (local, cron server, Kubernetes) This complements MULTI_STAGE_TESTING_PLAN.md which covers technical implementation details. Together these documents provide complete guidance for: 1. Building the agent (MULTI_STAGE_TESTING_PLAN.md) 2. Integrating it with automation systems (this doc) Enables external systems to: - Monitor JIRA for new tickets - Automatically diagnose problems - Attempt fixes in isolated worktrees - Create PRs for human review - Validate no regressions in CI/CD Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…trics Complete Phase 1: Metrics Integration Added three new answer quality metrics to MetricSummary and agent: - faithfulness (threshold: 0.8) - Answer grounded in retrieved context - answer_correctness (threshold: 0.75) - Semantic similarity to expected - response_relevancy (threshold: 0.8) - Answer addresses the question Changes: - Updated MetricSummary dataclass with 3 new optional fields - Updated to_prompt_context() to display new metrics with thresholds - Updated both _get_llm_boost_suggestion and _get_llm_prompt_suggestion to pass new metrics from EvaluationResult - Updated test case in llm_advisor __main__ with realistic values These metrics provide independent signals based on correlation analysis: - Faithfulness: Catches hallucinations (LLM adding unsupported info) - Answer Correctness: Validates against known good answers (LLM judge) - Response Relevancy: Ensures answer actually addresses the question Next: Move to okp-mcp/tools/autonomous_agent/ Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated MULTI_STAGE_TESTING_PLAN.md with comprehensive iteration design: **New Section: Iteration Strategy & Safety Mechanisms** - Two separate iteration budgets (primary vs regression) - Model escalation path: Sonnet → Opus → Human - Complete feedback loop example with actual metrics - Regression handling with revert-on-failure policy **Safety Mechanisms:** 1. Plateau Detection - Stop if no improvement for N iterations 2. Model Escalation - Escalate to better model after failed attempts 3. Improvement Check - Require MIN_IMPROVEMENT_THRESHOLD (0.05) 4. Regression Revert - Revert primary fix if regression can't be fixed **Iteration Limits:** - Primary fix: 5 iterations max - Regression fix: 3 iterations max (per regression) - Escalation threshold: 2 failed attempts before escalating - Model exhausted → Escalate to human **Code Changes Required:** - 2d. Add iteration constants (PRIMARY_FIX_MAX_ITERATIONS, etc.) - 2e. Add apply_code_change() method - 2f. Add improvement checking (metrics_improved, detected_plateau, escalate_model) - 2g. Add iteration loop (fix_ticket_with_iteration, fix_ticket_multi_stage) **Cost Optimization:** - Typical cost: $0.02-$0.04 per ticket (fixed in 2-3 iterations) - Max cost: $0.11 per ticket (all 5 iterations with escalation) Updated implementation timeline to reflect Day 2 progress. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implemented the full feedback loop for autonomous ticket fixing: **Core Iteration Loop:** - fix_ticket_with_iteration() - Main loop with model escalation - fix_ticket_multi_stage() - Orchestrator (primary → CLA → regressions) **Safety Mechanisms:** - metrics_improved() - Requires 0.05 improvement threshold - detected_plateau() - Detects stuck iterations (2 consecutive) - escalate_model() - Sonnet → Opus → Human escalation path - Regression revert policy - Auto-revert if regression can't be fixed **Code Editing:** - apply_code_change() - Applies LLM suggestions (manual for now) - _get_llm_suggestion_object() - Returns suggestion without printing **Iteration Constants Added:** - PRIMARY_FIX_MAX_ITERATIONS = 5 - REGRESSION_FIX_MAX_ITERATIONS = 3 - ESCALATION_THRESHOLD = 2 - PLATEAU_THRESHOLD = 2 - MIN_IMPROVEMENT_THRESHOLD = 0.05 - TIER_MODELS config (Haiku, Sonnet, Opus) **Complete Workflow:** 1. Diagnose ticket → Get LLM suggestion 2. Apply code change (with user confirmation) 3. Restart okp-mcp service 4. Re-evaluate metrics 5. Check improvement (>= 0.05 required) 6. Escalate model if stuck (after 2 failed attempts) 7. Repeat until fixed or max iterations (5 for primary, 3 for regression) 8. Validate CLA tests for regressions 9. Fix regressions or revert primary fix **Updated main() to use new multi-stage fix method** **Status:** - ✅ Iteration loop implemented - ✅ Model escalation working - ✅ Safety mechanisms in place -⚠️ Code editing is interactive (manual apply for now) -⚠️ CLA validation placeholder (needs implementation) Next: Test with real ticket, implement automated code editing Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replaced placeholder code editing with interactive git diff approval flow. **New Flow:** 1. Display agent reasoning (from LLM) 2. User applies change manually in editor 3. Show git diff of changes 4. User approves/rejects 5. If approved → commit with agent reasoning 6. If rejected → revert changes **apply_code_change() Improvements:** - Shows full agent reasoning and context - Iteration context passed to commit message - Displays suggested change and expected improvement - Shows confidence level - Git diff review before proceeding - Auto-commit with detailed message including reasoning - Auto-revert on rejection or error - Preserves original content on failure **Commit Message Format:** ``` [Context] (e.g., "Primary Fix - Iteration 2/5 - Model: medium") agent: [suggested change] Reasoning: [LLM reasoning] Confidence: [high/medium/low] ``` **Future Work (Documented):** Added TODO for autonomous AST-based editing with: - Surgical AST manipulation (no manual intervention) - Full audit trail logging: * Agent reasoning at each stage * AST diffs before/after * Intermediate metrics after each change * Final metrics - JSON log file for review/debugging **Status:** - ✅ Interactive git diff approval working - ✅ Agent reasoning displayed before edits - ✅ Changes committed with full context - ✅ Auto-revert on rejection - ⏳ Manual editing required (AST coming later) Ready for testing with real ticket! Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implemented safe worktree-based workflow for isolated development and testing. **New Methods:** - update_compose_mount() - Updates podman-compose.yml to mount worktree - revert_compose_mount() - Restores main mount from backup - verify_container_healthy() - Waits for container health check (max 30s) - restart_okp_mcp() - Now optionally verifies health before proceeding **Complete Worktree Workflow:** ``` Stage 0: Setup 1. Create worktree (fix/RSPEED-XXXX branch) 2. Update podman-compose.yml volume mount to worktree 3. Restart container and verify healthy Stage 1: Fix Primary Ticket 4. Run iteration loop (edits happen in worktree) 5. All commits stay in worktree Stage 2: Validate CLA Tests 6. Check for regressions (placeholder) Stage 3: Fix Regressions (if any) 7. Fix each regression or revert all changes Stage 4: Cleanup (finally block) 8. Merge worktree to main (if successful) 9. Revert compose mount back to main 10. Restart container with main mount 11. Clean up worktree ``` **Safety Features:** - Isolated testing in worktree (never edits main directly) - Backup of podman-compose.yml before modification - Container health verification after each restart - Auto-cleanup in finally block (always runs) - Proper error handling if merge fails **podman-compose.yml Updates:** - Comments out main mount: `#- ../../okp-mcp/src:/dev/src:z` - Adds worktree mount: `- ../../okp-mcp-fix-RSPEED-XXXX/src:/dev/src:z` - Automatically reverts to main after testing **Container Health Check:** - Uses podman inspect to check health status - Waits up to 30 seconds for container to become healthy - Interactive prompt if health check fails - Ensures container is ready before running tests **Benefits:** - ✅ Zero risk to main branch - ✅ Isolated testing environment - ✅ Can test multiple tickets simultaneously (different worktrees) - ✅ Safe rollback on failure - ✅ Container always verified healthy before tests **Status:** - ✅ Worktree creation/cleanup working - ✅ Compose mount update/revert working - ✅ Container health verification working - ✅ Finally block ensures cleanup - ⏳ Ready for end-to-end testing Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed bug where primary_fixed and primary_commit were not defined before the try block, causing UnboundLocalError in the finally block when an exception occurred during worktree setup or iteration. Now initializes: - primary_fixed = False - primary_commit = None Before the try block, ensuring they're always defined when finally runs.
Added check_environment() method to verify required environment variables are set before starting multi-stage fix workflow. **Required Variables:** - GOOGLE_APPLICATION_CREDENTIALS (for Gemini evaluation LLM) - ANTHROPIC_VERTEX_PROJECT_ID (for Claude advisor, if enabled) **Behavior:** - Checks at the start of fix_ticket_multi_stage() - Lists missing variables with helpful export commands - Returns False immediately if variables missing - Prevents cryptic errors deep in evaluation pipeline **Example Error Message:** ❌ Missing required environment variables: - GOOGLE_APPLICATION_CREDENTIALS Please set these variables before running the agent: export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json **Benefits:** - Fail fast with clear error message - User knows exactly what to fix - No wasted time creating worktrees only to fail later - Better developer experience
…pport 1. Fix metrics_improved() bug preventing commits when URL F1 stays at 0.00 - Added check to reject "improvements" when URL F1 = 0.00 on both iterations - Context metrics can be misleading when retrieving wrong docs - URL F1 is ground truth - if 0.00, we're not retrieving ANY expected URLs 2. Fix missing code_snippet from LLM suggestions - Made code_snippet required (was Optional) - Updated prompt to request JSON with code_snippet instead of using Edit tool - _call_with_structured_output only captures JSON, doesn't execute tool calls - Added clear examples showing expected code_snippet format 3. Fix document discovery showing empty URLs and titles - Solr uses view_uri/id fields, not url field - Updated okp_solr_config_analyzer.py to build URLs like okp-mcp does - Now correctly displays: https://access.redhat.com/solutions/... 4. Add incremental improvement support - MIN_IMPROVEMENT_THRESHOLD (0.05): significant improvement, resets escalation - SMALL_IMPROVEMENT_THRESHOLD (0.02): small but real, commits and builds on it - Agent now accumulates small gains instead of reverting them - Still escalates to better models if stuck with small improvements 5. Add expected_response to RSPEED_2482 for bootstrap discovery Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Token optimization reduces prompt size by ~17% (1350 tokens saved per call): 1. Compact iteration history format (lines 267-310 in llm_advisor.py) - Before: ~200 tokens/iteration (verbose descriptions) - After: ~50 tokens/iteration (compact table format) - 5 iterations: 1000 tokens → 250 tokens = 750 tokens saved - Table format: Iter | Change | Metric Δ | Overlap | Result 2. Reduce Solr explain output (lines 247-264) - Top 3 docs × 300 chars → Top 2 docs × 200 chars - ~900 chars → ~400 chars = 500 chars saved 3. Limit ranking analysis (lines 225-245) - All missing docs → Top 3 missing docs only - ~200 tokens → ~100 tokens = 100 tokens saved 4. Reduce context truncation (line 187-192) - 300 chars → 200 chars = 100 chars saved 5. Add iteration_summary.txt generation (new function) - Saves detailed human-readable table to .diagnostics/TICKET_ID/ - Automatically saved on success, max iterations, or escalation - Includes full metrics, URL overlap, query augmentation details - Makes it easy to track what was tried across runs Total: ~1350 tokens saved per LLM call For 5-iteration run: 8000 tokens/call → 6650 tokens/call Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added detailed debug output to apply_code_change() to track exactly where parameter changes fail: - Show regex pattern and what it matches before replacement - Show file length before/after to verify content change - Show git status output to verify git detects changes - Enhanced error messages with context This will help diagnose why pf2 change showed "❌ Change not applied" despite regex test showing it should work. Also enhanced: - Solr analyzer initialization error handling - Import path handling for running from any directory - Iteration diagnostics with question/response/keywords for LLM judge debugging Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added detailed progress reporting so users can read results after
overnight YOLO runs complete:
REPORT SECTIONS:
1. Run Statistics
- Status (Fixed, Max Iterations, Escalated, etc.)
- Start/end time with duration (e.g., "45m 23s")
- Total iterations attempted
- Changes applied vs reverted
2. Iteration Details (existing table enhanced)
- Change description, metric delta, URL overlap, result
- Detailed metrics per iteration
3. Metric Progression Chart
- Shows all metrics across iterations in tabular format
- URL_F1, MRR, Context Relevance/Precision
- Keywords, Answer Correctness, Faithfulness, Response Relevancy
4. Best Scores Achieved
- Tracks peak value for each metric
- Shows which iteration achieved best score
5. Legend
- Explains all metrics and table columns
INTEGRATION:
- Called at all exit points in fix_ticket():
- Already Passing
- Fixed (success)
- Escalated to Human
- Max Iterations
- Also added to fast_retrieval_loop()
- Report saved to: .diagnostics/{ticket_id}/iteration_summary.txt
YOLO MODE FRIENDLY:
User can now run `--yolo` overnight and read comprehensive report in
the morning showing exactly what happened during the run.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added ability to process multiple tickets in a single run with
comprehensive batch reporting.
USAGE:
# Multiple tickets on command line:
uv run scripts/okp_mcp_agent.py fix RSPEED-2482 RSPEED-2481 RSPEED-2480 --yolo
# Or from a file:
uv run scripts/okp_mcp_agent.py fix --ticket-file tickets.txt --yolo
FEATURES:
- Accept multiple ticket IDs via command line or --ticket-file
- Process each ticket sequentially
- Error handling per ticket (continues on failure)
- Ctrl+C handling (saves progress for completed tickets)
- Batch summary report showing:
- Total/Fixed/Failed/Interrupted counts
- Duration and timing
- Per-ticket results
- Links to individual iteration reports
OUTPUT:
- Individual reports: .diagnostics/{ticket_id}/iteration_summary.txt
- Batch summary: .diagnostics/batch_summary_YYYYMMDD_HHMMSS.txt
EXAMPLE WORKFLOW:
1. Create tickets.txt with one ticket ID per line
2. Run: uv run scripts/okp_mcp_agent.py fix --ticket-file tickets.txt --yolo --max-iterations 20
3. Go to sleep
4. Wake up and read batch_summary.txt + individual reports
Perfect for overnight YOLO runs across multiple tickets!
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated example file to show all 20 available functional test tickets with descriptions. Users can uncomment the tickets they want to process. Organized by category: - Container compatibility - VM/virtualization - EUS support - RHEL 10 - Configuration - EOL migrations - SAP - Other Default: Only RSPEED-2482 enabled for testing
Created detailed analysis of performance bottlenecks, parallelization opportunities, and advanced agentic workflows to improve the agent. KEY SECTIONS: 1. Current Bottlenecks - Container restarts: 10-20s each (biggest bottleneck) - Sequential ticket processing: No parallelism - Single-threaded LLM calls: Missing easy parallelism - Full evaluation overhead: 30s per run - Git operation overhead: Many small subprocesses 2. Parallelization Opportunities - Multi-ticket parallel processing (3x speedup) - Parallel LLM judges (4x faster evaluation) - Parallel suggestion generation (better quality) - Batch operations everywhere 3. Advanced Agentic Workflows - Multi-agent collaboration (specialized experts) - Automated root cause analysis (Solr explain parsing) - Regression prediction (ML model for risk) - Automated bisection (find breaking commits) - Knowledge graph (learn from fix history) - Meta-learning (extract fix patterns) 4. Infrastructure Improvements - Distributed worker architecture - LLM API optimization (streaming, batching) - Telemetry & observability EXPERIMENTATION ROADMAP: - Phase 1 (Quick Wins): Parallel LLM judges, caching - 30-40% speedup - Phase 2 (Parallelization): Multi-ticket workers - 2-4x speedup - Phase 3 (Advanced Agents): Root cause, prediction - better quality - Phase 4 (Production): Distributed, auto-scaling - 10x scale IMMEDIATE PRIORITIES: ✅ Batch processing (Done!) ✅ Progress reports (Done!) □ Parallel LLM judges (Easy win) □ LLM response caching (Easy win) □ Multi-ticket parallel processing (Big win) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed AttributeError in fast_retrieval_loop where query_solr_direct() was calling missing helper methods. Added: - calculate_url_f1(): Compute F1 score for URL retrieval - calculate_mrr(): Compute Mean Reciprocal Rank Both methods: - Normalize URLs (remove https://, trailing slashes) - Handle empty inputs gracefully - Return float 0.0-1.0 Error was: AttributeError: 'OkpMcpAgent' object has no attribute 'calculate_url_f1' Now fast_retrieval_loop can properly compute fast URL-based metrics without LLM judges. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added realistic workflow for customer bugs where you only have the question and expected answer, not the ground truth URLs. CORE FEATURES: 1. Answer-Only Evaluation Mode - Start with just question + expected_response (no URLs needed) - Uses LLM to judge answer correctness vs expert answer - If wrong -> diagnoses why (extraction vs retrieval problem) 2. Smart Document Discovery with Verification - Searches Solr using expected_response as query - LLM verifies which candidate docs actually contain the answer - Filters out high-scoring but irrelevant docs (e.g., JBoss docs) - Only returns verified documents 3. Automatic YAML Enrichment - When answer is correct -> saves retrieved URLs to config - When docs are discovered -> adds them to config - Creates regression test automatically 4. Root Cause Diagnosis - Checks if retrieved docs contain expected answer - Distinguishes extraction vs retrieval problems - Guides user to correct fix strategy CODE CHANGES: - check_answer_in_retrieved_docs(): LLM judges if docs have answer - Enhanced discover_expected_documents(): Verifies each doc with LLM - Answer-first mode in diagnose(): Handles missing expected_urls - Auto-save discovered URLs to YAML DOCUMENTATION: - docs/ANSWER_FIRST_WORKFLOW.md: Complete 500-line guide - README.md: Added OKP-MCP Agent section with quick start - docs/OPTIMIZATION_OPPORTUNITIES.md: Added answer-first section WORK CASES SUPPORTED: - Customer bugs (answer-only, no URLs) - Regression tests (with known URLs) - Bootstrap mode (wrong URLs) - Fix mode (correct URLs) - Batch processing (multiple tickets) This makes the agent useful for real customer bugs, not just regression testing with known ground truth. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…n expected answers and find patterns in multiple tickets to try to batch multiple fixes that fall under one pattern. More docs on design intent, other new ideas, POC script is a WIP not yet tested. But, there are tests.
- Move 59 agentic files to dedicated okp_mcp_agent/ subfolder - Fix all imports (scripts.* -> relative imports, lightspeed_evaluation.agents -> okp_mcp_agent.core) - Fix config paths (config/okp_mcp_test_suites -> okp_mcp_agent/config/test_suites) - Update usage examples in docstrings - Create TODO.md and IMPORT_FIXES_NEEDED.md documentation - Create BRANCH_ORGANIZATION_REPORT.md for onboarding Prepares agent system for migration to okp-mcp repo. See okp_mcp_agent/TODO.md for migration checklist. Files moved: - agents/ (5 files): okp_mcp_agent.py, llm_advisor, solr_checker, etc. - bootstrap/ (5 files): JIRA extraction scripts - pattern_discovery/ (3 files): Pattern analysis - core/ (4 files): LinuxExpert, SolrExpert, PatternDiscovery - tests/ (17 files): All agent tests - config/ (35 files): test_suites, patterns, bootstrap artifacts - docs/ (10 files): Agent documentation - artifacts/ (13 files): Bootstrap run outputs Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…o - this was just a PoC branch for an idea)
…oject - analysis on where queries have the highest failure rates in eval, etc to identify weaknesses in system and plan fixes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.