Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/reference/cli.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -1075,6 +1075,16 @@ Evaluate results:
gaia eval -f ./results/experiment.json
```

Run agent eval benchmark:
```bash
gaia eval agent # Run all scenarios
gaia eval agent --category mcp_reliability # Run MCP reliability scenarios only
gaia eval agent --scenario mcp_simple_tool_call # Run a single scenario
gaia eval agent --iterations 5 # Run each scenario 5 times for reliability measurement
gaia eval agent --fix # Auto-fix failures with Claude Code
gaia eval agent --compare run1/scorecard.json run2/scorecard.json # Compare runs
```

Generate report:
```bash
gaia report -d ./eval_results
Expand Down
45 changes: 45 additions & 0 deletions eval/scenarios/mcp_reliability/complex_conditional_tool.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
id: mcp_conditional_tool
name: "Conditional Tool Selection -- Context-Dependent Choice"
category: mcp_reliability
severity: high
description: |
Agent is given two turns: one where a tool is needed, and one where it is not.
The agent must correctly decide when to use tools based on context.
Tests adaptive tool selection across turns.

persona: power_user

setup:
index_documents:
- corpus_doc: employee_handbook
path: "eval/corpus/documents/employee_handbook.md"

turns:
- turn: 1
objective: "Ask a question that requires searching the indexed document"
user_message: "How many PTO days do first-year employees get according to the handbook?"
ground_truth:
doc_id: employee_handbook
fact_id: pto_days
expected_answer: "15 days"
success_criteria: |
Agent uses a search or query tool to look up the answer in the indexed
employee handbook and returns "15 days" (or equivalent).
FAIL if agent answers without using any tool.
FAIL if agent returns the wrong number.

- turn: 2
objective: "Ask a follow-up that does NOT require a tool"
user_message: "Is 15 days more or less than the industry average of 10 days?"
ground_truth:
expected_behavior: "Agent answers using simple reasoning from context, without making another tool call"
success_criteria: |
Agent correctly states that 15 days is more than 10 days using basic reasoning.
PASS if agent answers correctly WITHOUT calling any tools (the information
is already in the conversation context).
PARTIAL PASS if agent re-queries the document but still answers correctly.
FAIL if agent gives a wrong comparison.

expected_outcome: |
Agent uses tools in turn 1 (needed) and avoids tools in turn 2 (not needed),
demonstrating context-aware tool selection.
58 changes: 58 additions & 0 deletions eval/scenarios/mcp_reliability/complex_multi_doc_tools.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
id: mcp_multi_doc_tools
name: "Multi-Document Tool Orchestration"
category: mcp_reliability
severity: high
description: |
Agent must retrieve facts from two different indexed documents and synthesize
them. Tests the agent's ability to make multiple tool calls across different
data sources and combine the results coherently.

persona: data_analyst

setup:
index_documents:
- corpus_doc: employee_handbook
path: "eval/corpus/documents/employee_handbook.md"
- corpus_doc: acme_q3_report
path: "eval/corpus/documents/acme_q3_report.md"

turns:
- turn: 1
objective: "Ask for PTO policy from the employee handbook"
user_message: "How many PTO days do first-year employees get?"
ground_truth:
doc_id: employee_handbook
fact_id: pto_days
expected_answer: "15 days"
success_criteria: |
Agent queries the employee handbook and returns "15 days".
FAIL if agent returns wrong number or does not use tools.

- turn: 2
objective: "Ask for financial data from a different document"
user_message: "Now tell me what Acme Corp's Q3 revenue was."
ground_truth:
doc_id: acme_q3_report
fact_id: q3_revenue
expected_answer: "$14.2 million"
success_criteria: |
Agent queries the Q3 report (a different document than turn 1) and returns
the correct revenue figure. Must use a tool to retrieve this.
FAIL if agent confuses data between the two documents.
FAIL if agent returns wrong revenue figure.

- turn: 3
objective: "Ask agent to combine information from both documents"
user_message: "Give me a one-sentence summary combining the PTO policy and the Q3 revenue figure."
ground_truth:
expected_behavior: "Agent combines facts from both prior turns into a coherent sentence"
success_criteria: |
Agent produces a sentence mentioning both 15 PTO days and $14.2 million revenue.
PASS if both facts are present and correctly attributed.
FAIL if either fact is wrong or missing.
Agent may or may not use tools for this turn (both are acceptable since
the information was already retrieved).

expected_outcome: |
Agent makes tool calls across two different documents, correctly retrieves
facts from each, and synthesizes them in a final response.
32 changes: 32 additions & 0 deletions eval/scenarios/mcp_reliability/complex_two_step_chain.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
id: mcp_two_step_chain
name: "Two-Step Tool Chain -- Sequential Dependency"
category: mcp_reliability
severity: critical
description: |
Agent is asked to perform a task that requires two sequential tool calls,
where the second call depends on the result of the first. Tests multi-step
tool orchestration reliability.

persona: power_user

setup:
index_documents: []

turns:
- turn: 1
objective: "Ask agent to find a file and then read its contents"
user_message: "Look through the eval/corpus/documents directory, find the file about Q3 financials, and tell me the key revenue figures."
ground_truth:
expected_behavior: "Agent first calls a directory listing or file search tool, then calls a file reading tool on the discovered file"
success_criteria: |
Agent performs at least two tool calls in sequence:
1. First: a directory listing, file search, or browse tool to find the Q3 report
2. Second: a file reading or indexing tool to retrieve content from the found file
PASS if both tool calls succeed and the agent reports revenue information.
FAIL if agent only makes one tool call and guesses the rest.
FAIL if agent makes no tool calls.
FAIL if agent hallucinates file names without checking the directory.

expected_outcome: |
Agent chains two tool calls: first discovers the file, then reads it.
Demonstrates reliable multi-step tool orchestration.
33 changes: 33 additions & 0 deletions eval/scenarios/mcp_reliability/moderate_error_handling.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
id: mcp_error_handling
name: "Error Handling -- Tool Call Failure Recovery"
category: mcp_reliability
severity: high
description: |
Agent is asked to perform an action that will trigger a tool error
(e.g., accessing a nonexistent resource). Tests whether the agent handles
the error gracefully and reports it to the user.

persona: casual_user

setup:
index_documents: []

turns:
- turn: 1
objective: "Ask agent to access a resource that does not exist"
user_message: "Can you read the file at /nonexistent/path/fake_report.pdf?"
ground_truth:
expected_behavior: "Agent attempts the tool call, receives an error, and reports it gracefully without crashing"
success_criteria: |
Agent attempts to call a file-reading tool with the given path.
When the tool returns an error (file not found), the agent reports the
error gracefully to the user.
PASS if agent attempts the tool call AND gracefully reports the failure.
PARTIAL PASS (score 7-8) if agent correctly states the file does not exist
without attempting the tool call.
FAIL if agent crashes, hangs, or fabricates content from the nonexistent file.
FAIL if agent enters an infinite retry loop.

expected_outcome: |
Agent handles the tool error gracefully, informs the user the file was not
found, and remains responsive for further requests.
34 changes: 34 additions & 0 deletions eval/scenarios/mcp_reliability/moderate_search_context.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
id: mcp_search_with_context
name: "Search with Context -- Query and Filter"
category: mcp_reliability
severity: high
description: |
Agent is asked to search for information with specific context constraints.
Tests the agent's ability to translate contextual requirements into
appropriate tool parameters.

persona: data_analyst

setup:
index_documents:
- corpus_doc: acme_q3_report
path: "eval/corpus/documents/acme_q3_report.md"

turns:
- turn: 1
objective: "Ask a question that requires the agent to search within an indexed document"
user_message: "What was the Q3 revenue for Acme Corp?"
ground_truth:
doc_id: acme_q3_report
fact_id: q3_revenue
expected_answer: "$14.2 million"
success_criteria: |
Agent queries the indexed document and returns the correct Q3 revenue figure.
The agent should use a search or query tool to retrieve the answer.
PASS if agent returns $14.2 million (or equivalent) and used a tool to find it.
FAIL if agent hallucinates a different number.
FAIL if agent says it cannot find the information despite the document being indexed.

expected_outcome: |
Agent uses its search/query tools to retrieve the answer from the indexed
document, demonstrating reliable tool-assisted information retrieval.
30 changes: 30 additions & 0 deletions eval/scenarios/mcp_reliability/moderate_structured_params.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
id: mcp_structured_params
name: "Structured Parameters -- Multi-field Tool Call"
category: mcp_reliability
severity: critical
description: |
Agent is asked to perform an action requiring an MCP tool call with multiple
structured parameters. Tests whether the agent correctly extracts and formats
multiple parameters from a natural language request.

persona: power_user

setup:
index_documents: []

turns:
- turn: 1
objective: "Ask the agent to search for files matching criteria"
user_message: "Search for files containing the word 'revenue' in the eval/corpus/documents directory."
ground_truth:
expected_behavior: "Agent calls a search or file-browsing tool with both a query/pattern parameter and a directory path parameter"
success_criteria: |
Agent calls a tool with at least two parameters: a search term (containing
"revenue" or similar) and a path/directory parameter pointing to the corpus.
PASS if the tool call includes both parameters correctly.
FAIL if the agent calls a tool with only one parameter or wrong parameters.
FAIL if agent does not call any tool.

expected_outcome: |
Agent correctly extracts multiple parameters from the natural language request
and passes them as structured arguments to the MCP tool.
29 changes: 29 additions & 0 deletions eval/scenarios/mcp_reliability/simple_introspection.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
id: mcp_introspection
name: "Tool Introspection -- List Available Tools"
category: mcp_reliability
severity: critical
description: |
Agent is asked what tools it has available. Tests whether the agent can
introspect its own capabilities and report them accurately.

persona: power_user

setup:
index_documents: []

turns:
- turn: 1
objective: "Ask the agent what MCP tools it has available"
user_message: "What tools do you have access to? List them for me."
ground_truth:
expected_behavior: "Agent lists its available tools, including any MCP-provided tools"
success_criteria: |
Agent provides a list of tools it can use. The list should include at least
some tool names that are real and available in the system.
FAIL if agent claims to have no tools at all.
FAIL if agent lists only generic capabilities without naming specific tools.
PASS if agent mentions MCP tools or specific tool names from its registry.

expected_outcome: |
Agent accurately reports its available tools, demonstrating awareness of its
MCP tool capabilities.
36 changes: 36 additions & 0 deletions eval/scenarios/mcp_reliability/simple_no_tool_needed.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
id: mcp_no_tool_needed
name: "No Tool Needed -- Tool Restraint"
category: mcp_reliability
severity: high
description: |
Agent is asked a simple question that should NOT trigger any MCP tool call.
Tests that the agent exercises restraint and does not call tools unnecessarily.

persona: casual_user

setup:
index_documents: []

turns:
- turn: 1
objective: "Ask a simple general knowledge question"
user_message: "What is the capital of Japan?"
ground_truth:
expected_answer: "Tokyo"
success_criteria: |
Agent answers "Tokyo" directly without calling any MCP tools.
FAIL if agent calls any MCP tool (search, file read, web fetch, etc.)
to answer this trivial question.

- turn: 2
objective: "Ask a simple arithmetic question"
user_message: "What is 25 times 4?"
ground_truth:
expected_answer: "100"
success_criteria: |
Agent answers "100" directly without calling any tools.
FAIL if any tool is invoked for basic arithmetic.

expected_outcome: |
Agent correctly identifies that no tools are needed and answers from
general knowledge. Zero MCP tool calls across both turns.
29 changes: 29 additions & 0 deletions eval/scenarios/mcp_reliability/simple_single_param.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
id: mcp_single_param
name: "Single Parameter Tool Call"
category: mcp_reliability
severity: critical
description: |
Agent is asked to perform an action that requires calling an MCP tool with
a single parameter. Tests parameter extraction and formatting.

persona: power_user

setup:
index_documents: []

turns:
- turn: 1
objective: "Ask the agent to read a specific file using a tool"
user_message: "Can you read the file at eval/corpus/documents/employee_handbook.md?"
ground_truth:
expected_behavior: "Agent calls a file reading or browsing tool with the specified path as parameter"
success_criteria: |
Agent calls a tool (e.g., read_file, browse_files, or similar) with the
file path parameter correctly set to the requested path.
PASS if the tool is called with the correct path and returns file content.
FAIL if agent hallucinates file content without calling a tool.
FAIL if agent calls a tool but with a wrong or missing path parameter.

expected_outcome: |
Agent correctly formats a single parameter and passes it to the appropriate
file-reading MCP tool.
27 changes: 27 additions & 0 deletions eval/scenarios/mcp_reliability/simple_tool_call.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
id: mcp_simple_tool_call
name: "Simple MCP Tool Call -- No Parameters"
category: mcp_reliability
severity: critical
description: |
Agent is asked to check the system status, which requires calling an MCP tool
with no parameters. Tests basic tool invocation reliability.

persona: power_user

setup:
index_documents: []

turns:
- turn: 1
objective: "Ask the agent to check the current system status"
user_message: "What is the current system status? Use the system tools to check."
ground_truth:
expected_behavior: "Agent calls a system status or health check MCP tool and returns the result"
success_criteria: |
Agent calls an MCP tool (e.g., system_status, get_status, or similar) and returns
a coherent response describing system state.
FAIL if agent says it cannot use tools or ignores the request.
FAIL if agent hallucinates a response without calling any tool.

expected_outcome: |
Agent identifies and calls the appropriate system status MCP tool, returns result.
Loading
Loading