[Feature Request]: Create Evaluation Framework

## Summary

Create an evaluation framework with question sets to systematically test whether AI agents can effectively use falcon-mcp tools to answer realistic cybersecurity questions. This is supplemental to E2E testing.

## Problem

There's no systematic way to measure:
- Whether AI agents can successfully use the tools
- Tool description quality and clarity
- FQL documentation effectiveness
- Response format usability
- Overall MCP server quality

Without evaluation metrics, improvements are based on intuition rather than data.

## Proposed Solution

Create evaluation question sets following [MCP evaluation best practices](https://modelcontextprotocol.io/docs/tools/debugging#evaluation):

### Question Requirements

Each question must be:
- **READ-ONLY**: Only uses search/get operations, no modifications
- **INDEPENDENT**: Doesn't depend on other questions
- **STABLE**: Answer won't change over time (or is clearly time-bound)
- **COMPLEX**: Requires 3-10+ tool calls to answer
- **REALISTIC**: Reflects real cybersecurity workflows

### Question Distribution

10 questions per module across 10 modules = 100 total questions

**Example Questions:**

```xml

<qa_pair>
    <question>Find all Windows hosts with critical severity detections (severity >= 80) 
    in the last 7 days. What hostname has the most detections?</question>
    <answer>WEB-PROD-03</answer>
</qa_pair>


<qa_pair>
    <question>Which threat actor tracked by CrowdStrike has the most associated 
    MITRE ATT&CK techniques? Provide the actor name.</question>
    <answer>SCATTERED SPIDER</answer>
</qa_pair>


<qa_pair>
    <question>Find hosts with both open detections (severity >= 70) AND critical 
    vulnerabilities (CVSS >= 9.0). How many unique hosts match both criteria?</question>
    <answer>7</answer>
</qa_pair>
```

### Evaluation Metrics

- **Accuracy**: % of questions answered correctly
- **Efficiency**: Average tool calls per question
- **Duration**: Time to complete evaluation
- **Failure modes**: Which tools/patterns cause failures

## Files to Create

- `/evaluations/detections_evaluation.xml`
- `/evaluations/incidents_evaluation.xml`
- `/evaluations/hosts_evaluation.xml`
- `/evaluations/intel_evaluation.xml`
- `/evaluations/spotlight_evaluation.xml`
- `/evaluations/discover_evaluation.xml`
- `/evaluations/cloud_evaluation.xml`
- `/evaluations/serverless_evaluation.xml`
- `/evaluations/sensor_usage_evaluation.xml`
- `/evaluations/idp_evaluation.xml`
- `.github/workflows/evaluation.yml` (optional CI integration)

## Acceptance Criteria

- [ ] 100+ evaluation questions created across all modules
- [ ] Questions cover single-module and cross-module scenarios
- [ ] Answers verified against real CrowdStrike environment
- [ ] Evaluation runner script functional
- [ ] Documentation on running evaluations
- [ ] Target: 70%+ accuracy for passing evaluation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Create Evaluation Framework #233

Summary

Problem

Proposed Solution

Question Requirements

Question Distribution

Evaluation Metrics

Files to Create

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request]: Create Evaluation Framework #233

Description

Summary

Problem

Proposed Solution

Question Requirements

Question Distribution

Evaluation Metrics

Files to Create

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions