amd · itomek-amd · Apr 2, 2026 · Apr 2, 2026 · Apr 2, 2026 · Apr 2, 2026
@@ -1075,6 +1075,16 @@ Evaluate results:
 gaia eval -f ./results/experiment.json
 ```
 
+Run agent eval benchmark:
+```bash
+gaia eval agent                                    # Run all scenarios
+gaia eval agent --category mcp_reliability         # Run MCP reliability scenarios only
+gaia eval agent --scenario mcp_simple_tool_call    # Run a single scenario
+gaia eval agent --iterations 5                     # Run each scenario 5 times for reliability measurement
+gaia eval agent --fix                              # Auto-fix failures with Claude Code
+gaia eval agent --compare run1/scorecard.json run2/scorecard.json  # Compare runs
+```
+
 Generate report:
 ```bash
 gaia report -d ./eval_results

@@ -0,0 +1,45 @@
+id: mcp_conditional_tool
+name: "Conditional Tool Selection -- Context-Dependent Choice"
+category: mcp_reliability
+severity: high
+description: |
+  Agent is given two turns: one where a tool is needed, and one where it is not.
+  The agent must correctly decide when to use tools based on context.
+  Tests adaptive tool selection across turns.
+
+persona: power_user
+
+setup:
+  index_documents:
+    - corpus_doc: employee_handbook
+      path: "eval/corpus/documents/employee_handbook.md"
+
+turns:
+  - turn: 1
+    objective: "Ask a question that requires searching the indexed document"
+    user_message: "How many PTO days do first-year employees get according to the handbook?"
+    ground_truth:
+      doc_id: employee_handbook
+      fact_id: pto_days
+      expected_answer: "15 days"
+    success_criteria: |
+      Agent uses a search or query tool to look up the answer in the indexed
+      employee handbook and returns "15 days" (or equivalent).
+      FAIL if agent answers without using any tool.
+      FAIL if agent returns the wrong number.
+
+  - turn: 2
+    objective: "Ask a follow-up that does NOT require a tool"
+    user_message: "Is 15 days more or less than the industry average of 10 days?"
+    ground_truth:
+      expected_behavior: "Agent answers using simple reasoning from context, without making another tool call"
+    success_criteria: |
+      Agent correctly states that 15 days is more than 10 days using basic reasoning.
+      PASS if agent answers correctly WITHOUT calling any tools (the information
+      is already in the conversation context).
+      PARTIAL PASS if agent re-queries the document but still answers correctly.
+      FAIL if agent gives a wrong comparison.
+
+expected_outcome: |
+  Agent uses tools in turn 1 (needed) and avoids tools in turn 2 (not needed),
+  demonstrating context-aware tool selection.
@@ -0,0 +1,58 @@
+id: mcp_multi_doc_tools
+name: "Multi-Document Tool Orchestration"
+category: mcp_reliability
+severity: high
+description: |
+  Agent must retrieve facts from two different indexed documents and synthesize
+  them. Tests the agent's ability to make multiple tool calls across different
+  data sources and combine the results coherently.
+
+persona: data_analyst
+
+setup:
+  index_documents:
+    - corpus_doc: employee_handbook
+      path: "eval/corpus/documents/employee_handbook.md"
+    - corpus_doc: acme_q3_report
+      path: "eval/corpus/documents/acme_q3_report.md"
+
+turns:
+  - turn: 1
+    objective: "Ask for PTO policy from the employee handbook"
+    user_message: "How many PTO days do first-year employees get?"
+    ground_truth:
+      doc_id: employee_handbook
+      fact_id: pto_days
+      expected_answer: "15 days"
+    success_criteria: |
+      Agent queries the employee handbook and returns "15 days".
+      FAIL if agent returns wrong number or does not use tools.
+
+  - turn: 2
+    objective: "Ask for financial data from a different document"
+    user_message: "Now tell me what Acme Corp's Q3 revenue was."
+    ground_truth:
+      doc_id: acme_q3_report
+      fact_id: q3_revenue
+      expected_answer: "$14.2 million"
+    success_criteria: |
+      Agent queries the Q3 report (a different document than turn 1) and returns
+      the correct revenue figure. Must use a tool to retrieve this.
+      FAIL if agent confuses data between the two documents.
+      FAIL if agent returns wrong revenue figure.
+
+  - turn: 3
+    objective: "Ask agent to combine information from both documents"
+    user_message: "Give me a one-sentence summary combining the PTO policy and the Q3 revenue figure."
+    ground_truth:
+      expected_behavior: "Agent combines facts from both prior turns into a coherent sentence"
+    success_criteria: |
+      Agent produces a sentence mentioning both 15 PTO days and $14.2 million revenue.
+      PASS if both facts are present and correctly attributed.
+      FAIL if either fact is wrong or missing.
+      Agent may or may not use tools for this turn (both are acceptable since
+      the information was already retrieved).
+
+expected_outcome: |
+  Agent makes tool calls across two different documents, correctly retrieves
+  facts from each, and synthesizes them in a final response.
@@ -0,0 +1,32 @@
+id: mcp_two_step_chain
+name: "Two-Step Tool Chain -- Sequential Dependency"
+category: mcp_reliability
+severity: critical
+description: |
+  Agent is asked to perform a task that requires two sequential tool calls,
+  where the second call depends on the result of the first. Tests multi-step
+  tool orchestration reliability.
+
+persona: power_user
+
+setup:
+  index_documents: []
+
+turns:
+  - turn: 1
+    objective: "Ask agent to find a file and then read its contents"
+    user_message: "Look through the eval/corpus/documents directory, find the file about Q3 financials, and tell me the key revenue figures."
+    ground_truth:
+      expected_behavior: "Agent first calls a directory listing or file search tool, then calls a file reading tool on the discovered file"
+    success_criteria: |
+      Agent performs at least two tool calls in sequence:
+      1. First: a directory listing, file search, or browse tool to find the Q3 report
+      2. Second: a file reading or indexing tool to retrieve content from the found file
+      PASS if both tool calls succeed and the agent reports revenue information.
+      FAIL if agent only makes one tool call and guesses the rest.
+      FAIL if agent makes no tool calls.
+      FAIL if agent hallucinates file names without checking the directory.
+
+expected_outcome: |
+  Agent chains two tool calls: first discovers the file, then reads it.
+  Demonstrates reliable multi-step tool orchestration.
@@ -0,0 +1,33 @@
+id: mcp_error_handling
+name: "Error Handling -- Tool Call Failure Recovery"
+category: mcp_reliability
+severity: high
+description: |
+  Agent is asked to perform an action that will trigger a tool error
+  (e.g., accessing a nonexistent resource). Tests whether the agent handles
+  the error gracefully and reports it to the user.
+
+persona: casual_user
+
+setup:
+  index_documents: []
+
+turns:
+  - turn: 1
+    objective: "Ask agent to access a resource that does not exist"
+    user_message: "Can you read the file at /nonexistent/path/fake_report.pdf?"
+    ground_truth:
+      expected_behavior: "Agent attempts the tool call, receives an error, and reports it gracefully without crashing"
+    success_criteria: |
+      Agent attempts to call a file-reading tool with the given path.
+      When the tool returns an error (file not found), the agent reports the
+      error gracefully to the user.
+      PASS if agent attempts the tool call AND gracefully reports the failure.
+      PARTIAL PASS (score 7-8) if agent correctly states the file does not exist
+      without attempting the tool call.
+      FAIL if agent crashes, hangs, or fabricates content from the nonexistent file.
+      FAIL if agent enters an infinite retry loop.
+
+expected_outcome: |
+  Agent handles the tool error gracefully, informs the user the file was not
+  found, and remains responsive for further requests.
@@ -0,0 +1,34 @@
+id: mcp_search_with_context
+name: "Search with Context -- Query and Filter"
+category: mcp_reliability
+severity: high
+description: |
+  Agent is asked to search for information with specific context constraints.
+  Tests the agent's ability to translate contextual requirements into
+  appropriate tool parameters.
+
+persona: data_analyst
+
+setup:
+  index_documents:
+    - corpus_doc: acme_q3_report
+      path: "eval/corpus/documents/acme_q3_report.md"
+
+turns:
+  - turn: 1
+    objective: "Ask a question that requires the agent to search within an indexed document"
+    user_message: "What was the Q3 revenue for Acme Corp?"
+    ground_truth:
+      doc_id: acme_q3_report
+      fact_id: q3_revenue
+      expected_answer: "$14.2 million"
+    success_criteria: |
+      Agent queries the indexed document and returns the correct Q3 revenue figure.
+      The agent should use a search or query tool to retrieve the answer.
+      PASS if agent returns $14.2 million (or equivalent) and used a tool to find it.
+      FAIL if agent hallucinates a different number.
+      FAIL if agent says it cannot find the information despite the document being indexed.
+
+expected_outcome: |
+  Agent uses its search/query tools to retrieve the answer from the indexed
+  document, demonstrating reliable tool-assisted information retrieval.
@@ -0,0 +1,30 @@
+id: mcp_structured_params
+name: "Structured Parameters -- Multi-field Tool Call"
+category: mcp_reliability
+severity: critical
+description: |
+  Agent is asked to perform an action requiring an MCP tool call with multiple
+  structured parameters. Tests whether the agent correctly extracts and formats
+  multiple parameters from a natural language request.
+
+persona: power_user
+
+setup:
+  index_documents: []
+
+turns:
+  - turn: 1
+    objective: "Ask the agent to search for files matching criteria"
+    user_message: "Search for files containing the word 'revenue' in the eval/corpus/documents directory."
+    ground_truth:
+      expected_behavior: "Agent calls a search or file-browsing tool with both a query/pattern parameter and a directory path parameter"
+    success_criteria: |
+      Agent calls a tool with at least two parameters: a search term (containing
+      "revenue" or similar) and a path/directory parameter pointing to the corpus.
+      PASS if the tool call includes both parameters correctly.
+      FAIL if the agent calls a tool with only one parameter or wrong parameters.
+      FAIL if agent does not call any tool.
+
+expected_outcome: |
+  Agent correctly extracts multiple parameters from the natural language request
+  and passes them as structured arguments to the MCP tool.
@@ -0,0 +1,29 @@
+id: mcp_introspection
+name: "Tool Introspection -- List Available Tools"
+category: mcp_reliability
+severity: critical
+description: |
+  Agent is asked what tools it has available. Tests whether the agent can
+  introspect its own capabilities and report them accurately.
+
+persona: power_user
+
+setup:
+  index_documents: []
+
+turns:
+  - turn: 1
+    objective: "Ask the agent what MCP tools it has available"
+    user_message: "What tools do you have access to? List them for me."
+    ground_truth:
+      expected_behavior: "Agent lists its available tools, including any MCP-provided tools"
+    success_criteria: |
+      Agent provides a list of tools it can use. The list should include at least
+      some tool names that are real and available in the system.
+      FAIL if agent claims to have no tools at all.
+      FAIL if agent lists only generic capabilities without naming specific tools.
+      PASS if agent mentions MCP tools or specific tool names from its registry.
+
+expected_outcome: |
+  Agent accurately reports its available tools, demonstrating awareness of its
+  MCP tool capabilities.
@@ -0,0 +1,36 @@
+id: mcp_no_tool_needed
+name: "No Tool Needed -- Tool Restraint"
+category: mcp_reliability
+severity: high
+description: |
+  Agent is asked a simple question that should NOT trigger any MCP tool call.
+  Tests that the agent exercises restraint and does not call tools unnecessarily.
+
+persona: casual_user
+
+setup:
+  index_documents: []
+
+turns:
+  - turn: 1
+    objective: "Ask a simple general knowledge question"
+    user_message: "What is the capital of Japan?"
+    ground_truth:
+      expected_answer: "Tokyo"
+    success_criteria: |
+      Agent answers "Tokyo" directly without calling any MCP tools.
+      FAIL if agent calls any MCP tool (search, file read, web fetch, etc.)
+      to answer this trivial question.
+
+  - turn: 2
+    objective: "Ask a simple arithmetic question"
+    user_message: "What is 25 times 4?"
+    ground_truth:
+      expected_answer: "100"
+    success_criteria: |
+      Agent answers "100" directly without calling any tools.
+      FAIL if any tool is invoked for basic arithmetic.
+
+expected_outcome: |
+  Agent correctly identifies that no tools are needed and answers from
+  general knowledge. Zero MCP tool calls across both turns.
@@ -0,0 +1,29 @@
+id: mcp_single_param
+name: "Single Parameter Tool Call"
+category: mcp_reliability
+severity: critical
+description: |
+  Agent is asked to perform an action that requires calling an MCP tool with
+  a single parameter. Tests parameter extraction and formatting.
+
+persona: power_user
+
+setup:
+  index_documents: []
+
+turns:
+  - turn: 1
+    objective: "Ask the agent to read a specific file using a tool"
+    user_message: "Can you read the file at eval/corpus/documents/employee_handbook.md?"
+    ground_truth:
+      expected_behavior: "Agent calls a file reading or browsing tool with the specified path as parameter"
+    success_criteria: |
+      Agent calls a tool (e.g., read_file, browse_files, or similar) with the
+      file path parameter correctly set to the requested path.
+      PASS if the tool is called with the correct path and returns file content.
+      FAIL if agent hallucinates file content without calling a tool.
+      FAIL if agent calls a tool but with a wrong or missing path parameter.
+
+expected_outcome: |
+  Agent correctly formats a single parameter and passes it to the appropriate
+  file-reading MCP tool.
@@ -0,0 +1,27 @@
+id: mcp_simple_tool_call
+name: "Simple MCP Tool Call -- No Parameters"
+category: mcp_reliability
+severity: critical
+description: |
+  Agent is asked to check the system status, which requires calling an MCP tool
+  with no parameters. Tests basic tool invocation reliability.
+
+persona: power_user
+
+setup:
+  index_documents: []
+
+turns:
+  - turn: 1
+    objective: "Ask the agent to check the current system status"
+    user_message: "What is the current system status? Use the system tools to check."
+    ground_truth:
+      expected_behavior: "Agent calls a system status or health check MCP tool and returns the result"
+    success_criteria: |
+      Agent calls an MCP tool (e.g., system_status, get_status, or similar) and returns
+      a coherent response describing system state.
+      FAIL if agent says it cannot use tools or ignores the request.
+      FAIL if agent hallucinates a response without calling any tool.
+
+expected_outcome: |
+  Agent identifies and calls the appropriate system status MCP tool, returns result.