Conversation
- New ToolCallRecord model with rich metadata (name, args, duration, MCP server) - Enhanced SessionExecutor to capture tool call arguments, duration, and MCP server name - ReportGenerator class using Copilot SDK (LlmJudge pattern) for LLM-based report generation - ReportTemplate with generalized markdown template for benchmark reports - --report flag on run command to generate reports after benchmark execution - report <path> subcommand to generate reports from existing log files - Updated ExecutionResult, BenchmarkResult, ValidationContext to use ToolCallRecord - Updated README with new CLI options and project structure Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move the report template from an inline C# string constant to Reporting/report-template.md. The file is copied to the output directory at build time and loaded at runtime via ReportTemplate.LoadTemplate(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Extract duplicated prompt construction into BuildPrompt helper - Use Task.WhenAll for parallel log file reads - Inline scenarioData variable into return statement - Remove unused cancellationToken from CallLlmAsync - Simplify MCP server name extraction to single expression Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instead of manually mapping every property into anonymous objects, just pass the BenchmarkResult directly and let the serializer handle it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Consolidated ReportTemplate.cs into ReportGenerator as a single class - Template is loaded once in the constructor instead of per-call - SystemPrompt and DefaultModel are now private constants - Removed redundant ReportTemplate.cs file Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The ConcurrentStack for pairing pre/post hooks was unnecessary complexity. ToolArgs is available directly on PostToolUseHookInput, and duration tracking added no value since the overall execution time is already captured by the stopwatch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The conversation messages from the SDK already include tool names, args, and results. The LLM report generator reads these from the log files, making custom tool call tracking in SessionExecutor unnecessary. Reverted ToolCalls back to simple List<string> of tool names. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instead of just tool names, the OnPostToolUse hook now captures ToolName, ToolArgs, and ToolResult as anonymous objects. This gives the LLM report generator richer context without a separate model class. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds LLM-based benchmark report generation and expands tool-call logging to support richer analysis of benchmark runs.
Changes:
- Introduces a markdown report template and a
ReportGeneratorthat uses Copilot SDK to produce filled-in reports. - Adds CLI support for
run --reportand a newreport <path>command to generate reports from existing logs. - Changes tool-call tracking from a list of strings to structured
ToolCallRecordobjects.
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/report-template.md | Adds the markdown template the LLM is expected to fill. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/ReportGenerator.cs | Implements report generation via Copilot SDK using logs/results + template. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/README.md | Documents new --report option and report command usage. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Program.cs | Wires new CLI options/command and writes generated report to disk. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ValidationContext.cs | Updates tool-call type to structured records. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ToolCallRecord.cs | Introduces a new model for capturing tool calls. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExecutionResult.cs | Updates tool-call type and related documentation. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/BenchmarkResult.cs | Updates tool-call type to structured records. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/Workspace.cs | Updates execution log signature to accept structured tool calls. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/SessionExecutor.cs | Captures tool args/results into ToolCallRecord during execution. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Azure.Sdk.Tools.Cli.Benchmarks.csproj | Copies report template to output so it can be loaded at runtime. |
| .gitignore | Ignores generated *-report.md artifacts. |
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExecutionResult.cs
Outdated
Show resolved
Hide resolved
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/ReportGenerator.cs
Show resolved
Hide resolved
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/ReportGenerator.cs
Show resolved
Hide resolved
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/ReportGenerator.cs
Show resolved
Hide resolved
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/report-template.md
Outdated
Show resolved
Hide resolved
…nResult.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…t-template.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Benchmark ReportTest Run: Scenarios ExecutedList every scenario that was run.
📊 Overall Statistics
🔍 Per-Scenario ResultsFor each scenario, provide a narrative summary of what happened during execution, Scenario 1: create-release-planDescription: Verify the agent calls azsdk_create_release_plan with appropriate context. Validation Results:
What Happened: ✅ What Went Well:
❌ What Went Wrong:
🔧 Tool Usage SummaryAggregated tool call statistics across all scenarios. Tool Call Frequency
Tool Call TimelineFor each scenario, list the sequence of tool calls made.
📈 Duration Report
Aggregate Duration Summary
🪙 Token UsagePer-Scenario Token Usage
Aggregate Token Usage
🔑 Areas for ImprovementActionable suggestions based on problems discovered during testing. Each item should
Report generated on 2026-03-19T19:13:43.289Z — 1 scenario(s) across 1 total run(s). |
The parameter represents the model used during the benchmark run, not the model used to generate the report (which is always DefaultModel). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ancellationToken ToolCallRecord now captures duration (from SDK pre/post hook timestamps), MCP server name (from tool name convention), and UTC timestamp. CancellationToken is threaded through GenerateAsync -> CallLlmAsync into all SDK calls (CreateSessionAsync, SendAndWaitAsync, GetMessagesAsync). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/ReportGenerator.cs
Show resolved
Hide resolved
|
Added in token count usage as well. |
|
Need to merge in Eval Migration PR in first because |
Add benchmark report generation system
Summary
Adds an LLM-powered reporting system to the benchmarks framework. After running scenarios, you can now generate structured markdown reports that analyze execution results, tool usage, and areas for improvement.
What's new
Rich tool call tracking — Tool calls now capture arguments, duration (ms), MCP server name, and timestamps instead of just the tool name.
Report generation via Copilot SDK — A new
ReportGeneratorclass (reusing theLlmJudgepattern) sends log data + a markdown template to an LLM and gets back a filled-in report with:Two ways to generate reports:
run --all --report— generate a report immediately after a benchmark runreport <path>— generate a report from existingbenchmark-log.jsonfilesUsage
Resolves: #14168 also Resolves #14171