Skip to content

[Feature Request]: Create Evaluation Framework #233

@carlosmmatos

Description

@carlosmmatos

Summary

Create an evaluation framework with question sets to systematically test whether AI agents can effectively use falcon-mcp tools to answer realistic cybersecurity questions. This is supplemental to E2E testing.

Problem

There's no systematic way to measure:

  • Whether AI agents can successfully use the tools
  • Tool description quality and clarity
  • FQL documentation effectiveness
  • Response format usability
  • Overall MCP server quality

Without evaluation metrics, improvements are based on intuition rather than data.

Proposed Solution

Create evaluation question sets following MCP evaluation best practices:

Question Requirements

Each question must be:

  • READ-ONLY: Only uses search/get operations, no modifications
  • INDEPENDENT: Doesn't depend on other questions
  • STABLE: Answer won't change over time (or is clearly time-bound)
  • COMPLEX: Requires 3-10+ tool calls to answer
  • REALISTIC: Reflects real cybersecurity workflows

Question Distribution

10 questions per module across 10 modules = 100 total questions

Example Questions:

<!-- Detections Module -->
<qa_pair>
    <question>Find all Windows hosts with critical severity detections (severity >= 80) 
    in the last 7 days. What hostname has the most detections?</question>
    <answer>WEB-PROD-03</answer>
</qa_pair>

<!-- Intel Module -->
<qa_pair>
    <question>Which threat actor tracked by CrowdStrike has the most associated 
    MITRE ATT&CK techniques? Provide the actor name.</question>
    <answer>SCATTERED SPIDER</answer>
</qa_pair>

<!-- Cross-Module -->
<qa_pair>
    <question>Find hosts with both open detections (severity >= 70) AND critical 
    vulnerabilities (CVSS >= 9.0). How many unique hosts match both criteria?</question>
    <answer>7</answer>
</qa_pair>

Evaluation Metrics

  • Accuracy: % of questions answered correctly
  • Efficiency: Average tool calls per question
  • Duration: Time to complete evaluation
  • Failure modes: Which tools/patterns cause failures

Files to Create

  • /evaluations/detections_evaluation.xml
  • /evaluations/incidents_evaluation.xml
  • /evaluations/hosts_evaluation.xml
  • /evaluations/intel_evaluation.xml
  • /evaluations/spotlight_evaluation.xml
  • /evaluations/discover_evaluation.xml
  • /evaluations/cloud_evaluation.xml
  • /evaluations/serverless_evaluation.xml
  • /evaluations/sensor_usage_evaluation.xml
  • /evaluations/idp_evaluation.xml
  • .github/workflows/evaluation.yml (optional CI integration)

Acceptance Criteria

  • 100+ evaluation questions created across all modules
  • Questions cover single-module and cross-module scenarios
  • Answers verified against real CrowdStrike environment
  • Evaluation runner script functional
  • Documentation on running evaluations
  • Target: 70%+ accuracy for passing evaluation

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestmcp-complianceMCP protocol compliance improvementstestingTesting improvements and frameworks

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions