feat(agents): split chat agent into task-specific agents with prompt profiles

Ovtcharov · Ovtcharov · commit f69345c04368 · 2026-05-07T00:26:56.000-07:00
Before: monolithic ChatAgent with 13K-token system prompt caused 95s TTFT
for a simple "Hi!" on Gemma-4-E4B. Eval scenarios timed out at 610s.

After: 5 focused agents (chat, doc, file, data, web) + lite variants,
each with a lean prompt profile. TTFT drops from 95s to 0.12s (chat)
and 3-10s (doc). Eval pass rate: 89% judged (34/38), avg score 9.4/10.

Agent architecture:
- chat: conversation only, ~2K tokens, no tools
- doc: RAG + file search, ~5K tokens, hallucination prevention
- file: filesystem ops + discovery, ~4K tokens
- data: CSV/Excel analysis with scratchpad, ~3K tokens
- web: browser tools, ~2K tokens
- Each has a -lite variant using ~4B model for low-memory hardware

Eval framework updates:
- Per-scenario agent_type field in YAML (overrides --agent-type CLI)
- Latency validation: warns when TTFT &gt; 30s
- Preserve eval sessions for review (no delete_session)
- Increased startup timeout 120s → 240s for Windows
- Fixed shutil.which("claude") for Windows .cmd resolution
diff --git a/eval/scenarios/adversarial/empty_file.yaml b/eval/scenarios/adversarial/empty_file.yaml
@@ -1,6 +1,7 @@
 id: empty_file
 name: "Empty File Handling"
 category: adversarial
+agent_type: doc
 severity: medium
 description: |
   User asks the agent to index and read a completely empty file. Agent must
diff --git a/eval/scenarios/adversarial/large_document.yaml b/eval/scenarios/adversarial/large_document.yaml
@@ -1,6 +1,7 @@
 id: large_document
 name: "Buried Fact in Large Document"
 category: adversarial
+agent_type: doc
 severity: high
 description: |
   A specific fact is buried deep within a large document. Tests whether the
diff --git a/eval/scenarios/adversarial/topic_switch.yaml b/eval/scenarios/adversarial/topic_switch.yaml
@@ -1,6 +1,7 @@
 id: topic_switch
 name: "Rapid Topic Switch"
 category: adversarial
+agent_type: doc
 severity: medium
 description: |
   User rapidly switches topics between two different documents across four turns.
diff --git a/eval/scenarios/captured/captured_eval_cross_turn_file_recall.yaml b/eval/scenarios/captured/captured_eval_cross_turn_file_recall.yaml
@@ -1,6 +1,7 @@
 id: captured_eval_cross_turn_file_recall
 name: "Captured: Cross-Turn File Recall"
 category: captured
+agent_type: doc
 description: 'Captured from session: Eval: cross_turn_file_recall'
 note: "Subset of cross_turn_file_recall (2 of 3 turns captured from a real session)"
 persona: casual_user
diff --git a/eval/scenarios/captured/captured_eval_smart_discovery.yaml b/eval/scenarios/captured/captured_eval_smart_discovery.yaml
@@ -1,6 +1,7 @@
 id: captured_eval_smart_discovery
 name: "Captured: Smart Document Discovery"
 category: captured
+agent_type: file
 description: 'Captured from session: Eval: smart_discovery'
 persona: casual_user
 setup:
diff --git a/eval/scenarios/context_retention/conversation_summary.yaml b/eval/scenarios/context_retention/conversation_summary.yaml
@@ -1,6 +1,7 @@
 id: conversation_summary
 name: "5-Turn Conversation Summary"
 category: context_retention
+agent_type: doc
 severity: medium
 description: |
   A 5-turn conversation that tests the agent's ability to accumulate facts across
diff --git a/eval/scenarios/context_retention/cross_turn_file_recall.yaml b/eval/scenarios/context_retention/cross_turn_file_recall.yaml
@@ -1,6 +1,7 @@
 id: cross_turn_file_recall
 name: "Cross-Turn File Recall"
 category: context_retention
+agent_type: doc
 severity: critical
 description: |
   User indexes a document in Turn 1, then asks about its content in Turn 2
diff --git a/eval/scenarios/context_retention/multi_doc_context.yaml b/eval/scenarios/context_retention/multi_doc_context.yaml
@@ -1,6 +1,7 @@
 id: multi_doc_context
 name: "Multi-Document Context"
 category: context_retention
+agent_type: doc
 severity: high
 description: |
   Two documents are indexed simultaneously. Agent must answer questions from each
diff --git a/eval/scenarios/context_retention/pronoun_resolution.yaml b/eval/scenarios/context_retention/pronoun_resolution.yaml
@@ -1,6 +1,7 @@
 id: pronoun_resolution
 name: "Pronoun Resolution"
 category: context_retention
+agent_type: doc
 severity: critical
 description: |
   User asks follow-up questions using pronouns ("it", "that policy").
diff --git a/eval/scenarios/error_recovery/file_not_found.yaml b/eval/scenarios/error_recovery/file_not_found.yaml
@@ -1,6 +1,7 @@
 id: file_not_found
 name: "File Not Found -- Helpful Error"
 category: error_recovery
+agent_type: doc
 severity: medium
 description: |
   User asks to read a nonexistent file. Agent must report the error gracefully
diff --git a/eval/scenarios/error_recovery/search_empty_fallback.yaml b/eval/scenarios/error_recovery/search_empty_fallback.yaml
@@ -1,6 +1,7 @@
 id: search_empty_fallback
 name: "Search Empty -- Fallback Strategy"
 category: error_recovery
+agent_type: doc
 severity: high
 description: |
   No documents are pre-indexed. Agent must discover and index a file on its own.
diff --git a/eval/scenarios/error_recovery/vague_request_clarification.yaml b/eval/scenarios/error_recovery/vague_request_clarification.yaml
@@ -1,6 +1,7 @@
 id: vague_request_clarification
 name: "Vague Request -- Clarification"
 category: error_recovery
+agent_type: doc
 severity: medium
 description: |
   Two documents are indexed. User makes an ambiguous request ("summarize the
diff --git a/eval/scenarios/personality/concise_response.yaml b/eval/scenarios/personality/concise_response.yaml
@@ -1,6 +1,7 @@
 id: concise_response
 name: "Concise Response -- Short Greeting"
 category: personality
+agent_type: chat
 severity: medium
 description: |
   User sends a short greeting. Agent should respond concisely (1-2 sentences)
diff --git a/eval/scenarios/personality/honest_limitation.yaml b/eval/scenarios/personality/honest_limitation.yaml
@@ -1,6 +1,7 @@
 id: honest_limitation
 name: "Honest Limitation Admission"
 category: personality
+agent_type: doc
 severity: medium
 description: |
   User asks about information that is NOT in the indexed document (employee count).
diff --git a/eval/scenarios/personality/no_sycophancy.yaml b/eval/scenarios/personality/no_sycophancy.yaml
@@ -1,6 +1,7 @@
 id: no_sycophancy
 name: "No Sycophancy -- Pushback on Wrong Claims"
 category: personality
+agent_type: doc
 severity: medium
 description: |
   User asserts a factually incorrect claim based on the indexed document.
diff --git a/eval/scenarios/rag_quality/budget_query.yaml b/eval/scenarios/rag_quality/budget_query.yaml
@@ -1,6 +1,7 @@
 id: budget_query
 name: "Budget Document Query"
 category: rag_quality
+agent_type: doc
 severity: medium
 description: |
   Agent must retrieve budget facts from a structured markdown table document.
diff --git a/eval/scenarios/rag_quality/cross_section_rag.yaml b/eval/scenarios/rag_quality/cross_section_rag.yaml
@@ -1,6 +1,7 @@
 id: cross_section_rag
 name: "Cross-Section RAG Synthesis"
 category: rag_quality
+agent_type: doc
 severity: high
 description: |
   Agent must retrieve facts from different sections of the same document and
diff --git a/eval/scenarios/rag_quality/csv_analysis.yaml b/eval/scenarios/rag_quality/csv_analysis.yaml
@@ -1,6 +1,7 @@
 id: csv_analysis
 name: "CSV Aggregation and Analysis"
 category: rag_quality
+agent_type: data
 severity: high
 description: |
   Tests the agent's ability to perform aggregation and analysis on CSV data.
diff --git a/eval/scenarios/rag_quality/hallucination_resistance.yaml b/eval/scenarios/rag_quality/hallucination_resistance.yaml
@@ -1,6 +1,7 @@
 id: hallucination_resistance
 name: "Hallucination Resistance"
 category: rag_quality
+agent_type: doc
 severity: critical
 description: |
   Agent must admit when information is NOT in the indexed document.
diff --git a/eval/scenarios/rag_quality/negation_handling.yaml b/eval/scenarios/rag_quality/negation_handling.yaml
@@ -1,6 +1,7 @@
 id: negation_handling
 name: "Negation Handling"
 category: rag_quality
+agent_type: doc
 severity: high
 description: |
   Tests whether the agent correctly interprets negation in source documents.
diff --git a/eval/scenarios/rag_quality/simple_factual_rag.yaml b/eval/scenarios/rag_quality/simple_factual_rag.yaml
@@ -1,6 +1,7 @@
 id: simple_factual_rag
 name: "Simple Factual RAG"
 category: rag_quality
+agent_type: doc
 severity: critical
 description: |
   Direct fact lookup from a financial report.
diff --git a/eval/scenarios/rag_quality/table_extraction.yaml b/eval/scenarios/rag_quality/table_extraction.yaml
@@ -1,6 +1,7 @@
 id: table_extraction
 name: "HTML Table Extraction"
 category: rag_quality
+agent_type: doc
 severity: high
 description: |
   Agent must correctly parse and extract structured data from an HTML comparison
diff --git a/eval/scenarios/real_world/alphabet_10k_2024.yaml b/eval/scenarios/real_world/alphabet_10k_2024.yaml
@@ -1,6 +1,7 @@
 id: alphabet_10k_2024
 name: "Alphabet Inc 10-K FY2024 Earnings Analysis"
 category: real_world
+agent_type: doc
 severity: high
 description: |
   Agent must retrieve financial metrics from Alphabet's 2024 annual report excerpt.
diff --git a/eval/scenarios/real_world/apache_license_20.yaml b/eval/scenarios/real_world/apache_license_20.yaml
@@ -1,6 +1,7 @@
 id: apache_license_20
 name: "Apache License 2.0 Legal Clause Lookup"
 category: real_world
+agent_type: doc
 severity: medium
 description: |
   Agent must retrieve precise legal clause details from the Apache License 2.0 text.
diff --git a/eval/scenarios/real_world/attention_transformer_paper.yaml b/eval/scenarios/real_world/attention_transformer_paper.yaml
@@ -1,6 +1,7 @@
 id: attention_transformer_paper
 name: "Attention Is All You Need Research Paper Lookup"
 category: real_world
+agent_type: doc
 severity: medium
 description: |
   Agent must retrieve facts from the "Attention Is All You Need" Transformer paper.
diff --git a/eval/scenarios/real_world/bls_employment_dec2025.yaml b/eval/scenarios/real_world/bls_employment_dec2025.yaml
@@ -1,6 +1,7 @@
 id: bls_employment_dec2025
 name: "BLS Employment Situation December 2025"
 category: real_world
+agent_type: doc
 severity: medium
 description: |
   Agent must retrieve labor market statistics from the BLS December 2025 employment
diff --git a/eval/scenarios/real_world/cdc_flu_2023_2024.yaml b/eval/scenarios/real_world/cdc_flu_2023_2024.yaml
@@ -1,6 +1,7 @@
 id: cdc_flu_2023_2024
 name: "CDC 2023-2024 Influenza Season Statistics"
 category: real_world
+agent_type: doc
 severity: medium
 description: |
   Agent must retrieve public health statistics from the CDC 2023-2024 influenza
diff --git a/eval/scenarios/real_world/company_financials_xlsx.yaml b/eval/scenarios/real_world/company_financials_xlsx.yaml
@@ -1,6 +1,7 @@
 id: company_financials_xlsx
 name: "Meridian Technology Solutions FY2024 Financial Spreadsheet"
 category: real_world
+agent_type: data
 severity: high
 description: |
   Agent must parse a multi-sheet Excel spreadsheet (Income Statement + Balance Sheet)
diff --git a/eval/scenarios/real_world/department_budget_xlsx.yaml b/eval/scenarios/real_world/department_budget_xlsx.yaml
@@ -1,6 +1,7 @@
 id: department_budget_xlsx
 name: "Meridian Technology Solutions Department Budget FY2024"
 category: real_world
+agent_type: data
 severity: high
 description: |
   Agent must analyze a three-sheet Excel budget file (Budget vs Actual, Headcount, Summary)
diff --git a/eval/scenarios/real_world/fed_rate_nov2024.yaml b/eval/scenarios/real_world/fed_rate_nov2024.yaml
@@ -1,6 +1,7 @@
 id: fed_rate_nov2024
 name: "Federal Reserve Rate Decision November 2024"
 category: real_world
+agent_type: doc
 severity: high
 description: |
   Agent must retrieve precise monetary policy details from the November 2024 FOMC
diff --git a/eval/scenarios/real_world/gdpr_article17_erasure.yaml b/eval/scenarios/real_world/gdpr_article17_erasure.yaml
@@ -1,6 +1,7 @@
 id: gdpr_article17_erasure
 name: "GDPR Article 17 Right to Erasure"
 category: real_world
+agent_type: doc
 severity: high
 description: |
   Agent must retrieve regulatory compliance details from GDPR Article 17 (Right to Erasure).
diff --git a/eval/scenarios/real_world/github_terms_of_service.yaml b/eval/scenarios/real_world/github_terms_of_service.yaml
@@ -1,6 +1,7 @@
 id: github_terms_of_service
 name: "GitHub Terms of Service Policy Lookup"
 category: real_world
+agent_type: doc
 severity: medium
 description: |
   Agent must retrieve platform policy details from the GitHub Terms of Service excerpt.
diff --git a/eval/scenarios/real_world/mit_license.yaml b/eval/scenarios/real_world/mit_license.yaml
@@ -1,6 +1,7 @@
 id: mit_license
 name: "MIT License Short-Form Legal Parsing"
 category: real_world
+agent_type: doc
 severity: medium
 description: |
   Agent must parse the MIT License text and answer compliance questions.
diff --git a/eval/scenarios/real_world/nist_csf2_framework.yaml b/eval/scenarios/real_world/nist_csf2_framework.yaml
@@ -1,6 +1,7 @@
 id: nist_csf2_framework
 name: "NIST Cybersecurity Framework 2.0 Standards Lookup"
 category: real_world
+agent_type: doc
 severity: medium
 description: |
   Agent must retrieve facts about NIST CSF 2.0 from the framework overview document.
diff --git a/eval/scenarios/real_world/product_inventory_xlsx.yaml b/eval/scenarios/real_world/product_inventory_xlsx.yaml
@@ -1,6 +1,7 @@
 id: product_inventory_xlsx
 name: "Product Inventory Multi-Sheet Cross-Reference"
 category: real_world
+agent_type: data
 severity: high
 description: |
   Agent must query a three-sheet Excel inventory file (Inventory, Price List, Lookup)
diff --git a/eval/scenarios/real_world/python311_release_notes.yaml b/eval/scenarios/real_world/python311_release_notes.yaml
@@ -1,6 +1,7 @@
 id: python311_release_notes
 name: "Python 3.11 Release Notes Lookup"
 category: real_world
+agent_type: doc
 severity: medium
 description: |
   Agent must retrieve facts from the Python 3.11 What's New document.
diff --git a/eval/scenarios/real_world/raspberry_pi4_datasheet.yaml b/eval/scenarios/real_world/raspberry_pi4_datasheet.yaml
@@ -1,6 +1,7 @@
 id: raspberry_pi4_datasheet
 name: "Raspberry Pi 4 Product Datasheet Lookup"
 category: real_world
+agent_type: doc
 severity: medium
 description: |
   Agent must retrieve product specifications from the Raspberry Pi 4 Model B datasheet.
diff --git a/eval/scenarios/real_world/rfc7231_http_spec.yaml b/eval/scenarios/real_world/rfc7231_http_spec.yaml
@@ -1,6 +1,7 @@
 id: rfc7231_http_spec
 name: "RFC 7231 HTTP Semantics Spec Lookup"
 category: real_world
+agent_type: doc
 severity: medium
 description: |
   Agent must retrieve precise facts from RFC 7231 (HTTP/1.1 Semantics and Content).
diff --git a/eval/scenarios/real_world/treasury_fy2024_budget.yaml b/eval/scenarios/real_world/treasury_fy2024_budget.yaml
@@ -1,6 +1,7 @@
 id: treasury_fy2024_budget
 name: "US Treasury FY2024 Budget Results"
 category: real_world
+agent_type: doc
 severity: medium
 description: |
   Agent must retrieve fiscal policy data from the US Treasury FY2024 budget results report.
diff --git a/eval/scenarios/real_world/us_labor_statistics_xlsx.yaml b/eval/scenarios/real_world/us_labor_statistics_xlsx.yaml
@@ -1,6 +1,7 @@
 id: us_labor_statistics_xlsx
 name: "US Labor Statistics 2024 Spreadsheet Multi-Sheet Analysis"
 category: real_world
+agent_type: data
 severity: high
 description: |
   Agent must analyze a two-sheet Excel spreadsheet (Monthly Unemployment + Industry Breakdown)
diff --git a/eval/scenarios/real_world/usb20_spec_lookup.yaml b/eval/scenarios/real_world/usb20_spec_lookup.yaml
@@ -1,6 +1,7 @@
 id: usb20_spec_lookup
 name: "USB 2.0 Specification Hardware Lookup"
 category: real_world
+agent_type: doc
 severity: medium
 description: |
   Agent must retrieve precise numeric values from the USB 2.0 specification overview.
diff --git a/eval/scenarios/tool_selection/known_path_read.yaml b/eval/scenarios/tool_selection/known_path_read.yaml
@@ -1,6 +1,7 @@
 id: known_path_read
 name: "Known Path -- Use read_file Directly"
 category: tool_selection
+agent_type: doc
 severity: high
 description: |
   User provides an exact file path. Agent should read the file directly using
diff --git a/eval/scenarios/tool_selection/multi_step_plan.yaml b/eval/scenarios/tool_selection/multi_step_plan.yaml
@@ -1,6 +1,7 @@
 id: multi_step_plan
 name: "Multi-Step Plan -- Complex Request"
 category: tool_selection
+agent_type: doc
 severity: medium
 description: |
   User makes a compound request requiring the agent to retrieve multiple facts
diff --git a/eval/scenarios/tool_selection/no_tools_needed.yaml b/eval/scenarios/tool_selection/no_tools_needed.yaml
@@ -1,6 +1,7 @@
 id: no_tools_needed
 name: "No Tools -- General Knowledge"
 category: tool_selection
+agent_type: doc
 severity: high
 description: |
   No documents are indexed. User asks simple general-knowledge and arithmetic
diff --git a/eval/scenarios/tool_selection/smart_discovery.yaml b/eval/scenarios/tool_selection/smart_discovery.yaml
@@ -1,6 +1,7 @@
 id: smart_discovery
 name: "Smart Discovery"
 category: tool_selection
+agent_type: doc
 severity: critical
 description: |
   No documents are pre-indexed. User asks about PTO policy.
diff --git a/eval/scenarios/vision/screenshot_capture.yaml b/eval/scenarios/vision/screenshot_capture.yaml
@@ -1,6 +1,7 @@
 id: screenshot_capture
 name: "Screenshot Tool -- Capture and Report"
 category: vision
+agent_type: chat
 severity: medium
 description: |
   Tests that the take_screenshot tool is registered and working in ChatAgent.
diff --git a/eval/scenarios/vision/sd_graceful_degradation.yaml b/eval/scenarios/vision/sd_graceful_degradation.yaml
@@ -1,6 +1,7 @@
 id: sd_graceful_degradation
 name: "SD Tool -- Graceful Degradation"
 category: vision
+agent_type: chat
 severity: medium
 description: |
   Tests that the ChatAgent handles image generation requests gracefully —
diff --git a/eval/scenarios/vision/vlm_graceful_degradation.yaml b/eval/scenarios/vision/vlm_graceful_degradation.yaml
@@ -1,6 +1,7 @@
 id: vlm_graceful_degradation
 name: "VLM Tool -- Graceful Degradation"
 category: vision
+agent_type: chat
 severity: medium
 description: |
   Tests that the ChatAgent's VLM tools (analyze_image, answer_question_about_image)
diff --git a/eval/scenarios/web_system/clipboard_tools.yaml b/eval/scenarios/web_system/clipboard_tools.yaml
@@ -1,6 +1,7 @@
 id: clipboard_tools
 name: "Clipboard Tools -- Graceful Degradation"
 category: web_system
+agent_type: chat
 severity: low
 description: |
   Tests clipboard read/write tools. These gracefully degrade if pyperclip is not installed.
diff --git a/eval/scenarios/web_system/desktop_notification.yaml b/eval/scenarios/web_system/desktop_notification.yaml
@@ -1,6 +1,7 @@
 id: desktop_notification
 name: "Desktop Notification Tool"
 category: web_system
+agent_type: chat
 severity: low
 description: |
   Tests that notify_desktop tool is registered and handles gracefully whether
diff --git a/eval/scenarios/web_system/fetch_webpage.yaml b/eval/scenarios/web_system/fetch_webpage.yaml
diff --git a/eval/scenarios/web_system/list_windows.yaml b/eval/scenarios/web_system/list_windows.yaml
diff --git a/eval/scenarios/web_system/system_info.yaml b/eval/scenarios/web_system/system_info.yaml
diff --git a/eval/scenarios/web_system/text_to_speech.yaml b/eval/scenarios/web_system/text_to_speech.yaml
diff --git a/src/gaia/agents/chat/agent.py b/src/gaia/agents/chat/agent.py
diff --git a/src/gaia/agents/registry.py b/src/gaia/agents/registry.py
diff --git a/src/gaia/eval/runner.py b/src/gaia/eval/runner.py