Summary
Create an evaluation framework with question sets to systematically test whether AI agents can effectively use falcon-mcp tools to answer realistic cybersecurity questions. This is supplemental to E2E testing.
Problem
There's no systematic way to measure:
- Whether AI agents can successfully use the tools
- Tool description quality and clarity
- FQL documentation effectiveness
- Response format usability
- Overall MCP server quality
Without evaluation metrics, improvements are based on intuition rather than data.
Proposed Solution
Create evaluation question sets following MCP evaluation best practices:
Question Requirements
Each question must be:
- READ-ONLY: Only uses search/get operations, no modifications
- INDEPENDENT: Doesn't depend on other questions
- STABLE: Answer won't change over time (or is clearly time-bound)
- COMPLEX: Requires 3-10+ tool calls to answer
- REALISTIC: Reflects real cybersecurity workflows
Question Distribution
10 questions per module across 10 modules = 100 total questions
Example Questions:
<!-- Detections Module -->
<qa_pair>
<question>Find all Windows hosts with critical severity detections (severity >= 80)
in the last 7 days. What hostname has the most detections?</question>
<answer>WEB-PROD-03</answer>
</qa_pair>
<!-- Intel Module -->
<qa_pair>
<question>Which threat actor tracked by CrowdStrike has the most associated
MITRE ATT&CK techniques? Provide the actor name.</question>
<answer>SCATTERED SPIDER</answer>
</qa_pair>
<!-- Cross-Module -->
<qa_pair>
<question>Find hosts with both open detections (severity >= 70) AND critical
vulnerabilities (CVSS >= 9.0). How many unique hosts match both criteria?</question>
<answer>7</answer>
</qa_pair>
Evaluation Metrics
- Accuracy: % of questions answered correctly
- Efficiency: Average tool calls per question
- Duration: Time to complete evaluation
- Failure modes: Which tools/patterns cause failures
Files to Create
/evaluations/detections_evaluation.xml
/evaluations/incidents_evaluation.xml
/evaluations/hosts_evaluation.xml
/evaluations/intel_evaluation.xml
/evaluations/spotlight_evaluation.xml
/evaluations/discover_evaluation.xml
/evaluations/cloud_evaluation.xml
/evaluations/serverless_evaluation.xml
/evaluations/sensor_usage_evaluation.xml
/evaluations/idp_evaluation.xml
.github/workflows/evaluation.yml (optional CI integration)
Acceptance Criteria
Summary
Create an evaluation framework with question sets to systematically test whether AI agents can effectively use falcon-mcp tools to answer realistic cybersecurity questions. This is supplemental to E2E testing.
Problem
There's no systematic way to measure:
Without evaluation metrics, improvements are based on intuition rather than data.
Proposed Solution
Create evaluation question sets following MCP evaluation best practices:
Question Requirements
Each question must be:
Question Distribution
10 questions per module across 10 modules = 100 total questions
Example Questions:
Evaluation Metrics
Files to Create
/evaluations/detections_evaluation.xml/evaluations/incidents_evaluation.xml/evaluations/hosts_evaluation.xml/evaluations/intel_evaluation.xml/evaluations/spotlight_evaluation.xml/evaluations/discover_evaluation.xml/evaluations/cloud_evaluation.xml/evaluations/serverless_evaluation.xml/evaluations/sensor_usage_evaluation.xml/evaluations/idp_evaluation.xml.github/workflows/evaluation.yml(optional CI integration)Acceptance Criteria