Benchmarks Reporting by jeo02 · Pull Request #14374 · Azure/azure-sdk-tools

jeo02 · 2026-03-05T21:23:45Z

Add benchmark report generation system

Summary

Adds an LLM-powered reporting system to the benchmarks framework. After running scenarios, you can now generate structured markdown reports that analyze execution results, tool usage, and areas for improvement.

What's new

Rich tool call tracking — Tool calls now capture arguments, duration (ms), MCP server name, and timestamps instead of just the tool name.

Report generation via Copilot SDK — A new ReportGenerator class (reusing the LlmJudge pattern) sends log data + a markdown template to an LLM and gets back a filled-in report with:

Overall pass/fail statistics
Per-scenario narrative analysis
Tool usage frequency and timeline
Duration breakdown
Actionable improvement suggestions

Two ways to generate reports:

run --all --report — generate a report immediately after a benchmark run
report <path> — generate a report from existing benchmark-log.json files

Usage

# Run benchmarks and generate a report
dotnet run -- run --all --report

# Generate a report from existing logs
dotnet run -- report /path/to/workspace --output my-report.md

Resolves: #14168 also Resolves #14171

- New ToolCallRecord model with rich metadata (name, args, duration, MCP server) - Enhanced SessionExecutor to capture tool call arguments, duration, and MCP server name - ReportGenerator class using Copilot SDK (LlmJudge pattern) for LLM-based report generation - ReportTemplate with generalized markdown template for benchmark reports - --report flag on run command to generate reports after benchmark execution - report <path> subcommand to generate reports from existing log files - Updated ExecutionResult, BenchmarkResult, ValidationContext to use ToolCallRecord - Updated README with new CLI options and project structure Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move the report template from an inline C# string constant to Reporting/report-template.md. The file is copied to the output directory at build time and loaded at runtime via ReportTemplate.LoadTemplate(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Extract duplicated prompt construction into BuildPrompt helper - Use Task.WhenAll for parallel log file reads - Inline scenarioData variable into return statement - Remove unused cancellationToken from CallLlmAsync - Simplify MCP server name extraction to single expression Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Instead of manually mapping every property into anonymous objects, just pass the BenchmarkResult directly and let the serializer handle it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Consolidated ReportTemplate.cs into ReportGenerator as a single class - Template is loaded once in the constructor instead of per-call - SystemPrompt and DefaultModel are now private constants - Removed redundant ReportTemplate.cs file Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The ConcurrentStack for pairing pre/post hooks was unnecessary complexity. ToolArgs is available directly on PostToolUseHookInput, and duration tracking added no value since the overall execution time is already captured by the stopwatch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The conversation messages from the SDK already include tool names, args, and results. The LLM report generator reads these from the log files, making custom tool call tracking in SessionExecutor unnecessary. Reverted ToolCalls back to simple List<string> of tool names. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Instead of just tool names, the OnPostToolUse hook now captures ToolName, ToolArgs, and ToolResult as anonymous objects. This gives the LLM report generator richer context without a separate model class. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds LLM-based benchmark report generation and expands tool-call logging to support richer analysis of benchmark runs.

Changes:

Introduces a markdown report template and a ReportGenerator that uses Copilot SDK to produce filled-in reports.
Adds CLI support for run --report and a new report <path> command to generate reports from existing logs.
Changes tool-call tracking from a list of strings to structured ToolCallRecord objects.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/report-template.md	Adds the markdown template the LLM is expected to fill.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/ReportGenerator.cs	Implements report generation via Copilot SDK using logs/results + template.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/README.md	Documents new `--report` option and `report` command usage.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Program.cs	Wires new CLI options/command and writes generated report to disk.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ValidationContext.cs	Updates tool-call type to structured records.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ToolCallRecord.cs	Introduces a new model for capturing tool calls.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExecutionResult.cs	Updates tool-call type and related documentation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/BenchmarkResult.cs	Updates tool-call type to structured records.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/Workspace.cs	Updates execution log signature to accept structured tool calls.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/SessionExecutor.cs	Captures tool args/results into `ToolCallRecord` during execution.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Azure.Sdk.Tools.Cli.Benchmarks.csproj	Copies report template to output so it can be loaded at runtime.
.gitignore	Ignores generated `*-report.md` artifacts.

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ToolCallRecord.cs

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExecutionResult.cs

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/ReportGenerator.cs

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/report-template.md

…nResult.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…t-template.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

jeo02 · 2026-03-05T22:39:45Z

Benchmark Report

Test Run: benchmark-20260319-191340
Date: 2026-03-19 19:13:40 UTC
Report Generated: 2026-03-19T19:13:43.289Z
Model Used: claude-opus-4.5

Scenarios Executed

List every scenario that was run.

#	Scenario Name	Description	Tags	Runs
1	create-release-plan	Verify the agent calls azsdk_create_release_plan with appropriate context.	release-plan	1

📊 Overall Statistics

Metric	Value
Total Scenarios	1
Total Individual Runs	1
Overall Pass Rate	100% (1/1)
Average Duration	47.74s
Total Duration	47.74s

🔍 Per-Scenario Results

For each scenario, provide a narrative summary of what happened during execution,
what went well, and what went wrong. One subsection per scenario.

Scenario 1: create-release-plan

Description: Verify the agent calls azsdk_create_release_plan with appropriate context.
Tags: release-plan
Prompt: "Create a release plan for the Contoso Widget Manager, no need to get it afterwards only create.
My setup has already been verified, do not run azsdk_verify_setup. Here is all the context you need:
TypeSpec project located at "specification/contosowidgetmanager/Contoso.WidgetManager".
Use service tree ID "a7f2b8e4-9c1d-4a3e-b6f9-2d8e5a7c3b1f",
product tree ID "f1a8c5d2-6e4b-4f7a-9c2d-8b5e1f3a6c9e",
target release timeline "December 2025",
API version "2022-11-01-preview",
SDK release type "beta",
and link it to the spec pull request "Azure/azure-rest-api-specs#38387"."
Pass Rate: 100% (1/1) | Duration: 00:00:47.7392746

Validation Results:

Validator	Result	Message
Expected tool: azsdk_create_release_plan	✅ Pass	All 1 expected tool(s) called in correct order with expected inputs.

What Happened:
The agent successfully executed the create-release-plan scenario by making two tool calls. First, it called report_intent to log its intention of "Creating release plan", which completed in 20ms. Second, it called the MCP tool azure-sdk-mcp-azsdk_create_release_plan with all the required parameters exactly as specified in the prompt: the TypeSpec project path, service tree ID, product tree ID, target release month/year (December 2025), spec API version (2022-11-01-preview), SDK release type (beta), and the spec pull request URL. The tool call completed successfully in 12,059ms and returned a comprehensive release plan details object. The operation status returned was "Succeeded", and the tool indicated that a release plan already existed for the specified pull request (work item ID: 33286), suggesting the user should confirm whether to use the existing plan or force create a new one. The agent completed the task as instructed without attempting to retrieve the plan afterwards.

✅ What Went Well:

The agent correctly interpreted the prompt and avoided calling azsdk_verify_setup as instructed.
All required parameters were passed accurately to azsdk_create_release_plan including service tree ID, product tree ID, target release timeline, API version, SDK release type, and spec PR URL.
The agent used report_intent to communicate its action, demonstrating good observability practices.
The tool call successfully completed and returned a valid release plan with work item details.
The validator confirmed all expected tools were called in the correct order with expected inputs.

❌ What Went Wrong:

No issues detected. The scenario passed all validation checks.

🔧 Tool Usage Summary

Aggregated tool call statistics across all scenarios.

Tool Call Frequency

Tool Name	MCP Server	Total Calls	Avg Duration (ms)	Scenarios Used In
report_intent	Built-in	1	20	create-release-plan
azure-sdk-mcp-azsdk_create_release_plan	Built-in	1	12059	create-release-plan

Tool Call Timeline

For each scenario, list the sequence of tool calls made.

Scenario	Tool Calls (in order)	Total Tool Calls
create-release-plan	report_intent → azure-sdk-mcp-azsdk_create_release_plan	2

📈 Duration Report

#	Scenario Name	Duration	Pass/Fail
1	create-release-plan	00:00:47.7392746	✅

Aggregate Duration Summary

Metric	Value
Total Duration (all scenarios)	47.74s
Longest Scenario	47.74s (create-release-plan)
Shortest Scenario	47.74s (create-release-plan)
Average Per Scenario	47.74s

🪙 Token Usage

Per-Scenario Token Usage

#	Scenario Name	Input Tokens	Output Tokens	Cache Read	Cache Write	Total Tokens
1	create-release-plan	90,848	616	0	0	91,464

Aggregate Token Usage

Metric	Value
Total Input Tokens	90,848
Total Output Tokens	616
Total Cache Read Tokens	0
Total Cache Write Tokens	0
Grand Total Tokens	91,464

🔑 Areas for Improvement

Actionable suggestions based on problems discovered during testing. Each item should
identify the problem, cite supporting evidence from test results, and propose a concrete
fix or investigation.

Token Efficiency — The scenario consumed 90,848 input tokens for a relatively straightforward task of creating a release plan. With a 100% pass rate and simple validation requirements, this suggests the agent may be receiving excessive context or making redundant tool queries. Investigation should focus on optimizing the prompt context size and ensuring the MCP tool responses are concise. Consider implementing prompt caching strategies to reduce repeated context transmission in future runs.
Execution Duration — The azsdk_create_release_plan tool call took 12,059ms (over 12 seconds) to complete, which represents 99.8% of the total scenario duration. While the operation succeeded, this latency could impact user experience in production scenarios. Investigate whether the backend service can be optimized, or if there are opportunities to provide progress feedback during long-running operations.
Duplicate Release Plan Handling — The tool returned a message indicating a release plan already exists for the specified pull request, but the agent completed without explicitly handling this case or prompting the user as suggested in the next_steps field. Future improvements should include logic to detect pre-existing release plans and either surface this to the user or implement the force-create option when appropriate.

Report generated on 2026-03-19T19:13:43.289Z — 1 scenario(s) across 1 total run(s).

The parameter represents the model used during the benchmark run, not the model used to generate the report (which is always DefaultModel). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ancellationToken ToolCallRecord now captures duration (from SDK pre/post hook timestamps), MCP server name (from tool name convention), and UTC timestamp. CancellationToken is threaded through GenerateAsync -> CallLlmAsync into all SDK calls (CreateSessionAsync, SendAndWaitAsync, GetMessagesAsync). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Program.cs

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/ReportGenerator.cs

jeo02 · 2026-03-18T23:39:23Z

Added in token count usage as well.

jeo02 · 2026-03-18T23:44:24Z

Need to merge in Eval Migration PR in first because ToolCallRecord will have merge conflicts.

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/ReportGenerator.cs

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Program.cs

jeo02 and others added 2 commits March 5, 2026 11:12

github-actions bot added the azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli label Mar 5, 2026

jeo02 and others added 10 commits March 5, 2026 13:33

Remove generated report artifact from tracking

9560cb7

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Simplify BuildReportData to pass Result directly

950a9d7

Instead of manually mapping every property into anonymous objects, just pass the BenchmarkResult directly and let the serializer handle it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add .gitignore for generated reports, remove artifact

d79d8df

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Move benchmark report gitignore to repo root

8cd9506

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use ToolCallRecord class instead of anonymous objects

780bef5

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jeo02 marked this pull request as ready for review March 5, 2026 22:29

jeo02 requested review from a team as code owners March 5, 2026 22:29

Copilot AI review requested due to automatic review settings March 5, 2026 22:29

Copilot AI reviewed Mar 5, 2026

View reviewed changes

jeo02 and others added 2 commits March 5, 2026 14:37

Update tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/Executio…

8d47535

…nResult.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/repor…

f9fa3be

…t-template.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

jeo02 and others added 2 commits March 5, 2026 14:40

Rename model parameter to agentModel for clarity

0e426a3

The parameter represents the model used during the benchmark run, not the model used to generate the report (which is always DefaultModel). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot started reviewing on behalf of jeo02 March 5, 2026 23:02 View session

praveenkuttappan reviewed Mar 11, 2026

View reviewed changes

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Program.cs Show resolved Hide resolved

praveenkuttappan reviewed Mar 11, 2026

View reviewed changes

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/ReportGenerator.cs Show resolved Hide resolved

jeo02 added 2 commits March 11, 2026 12:58

Merge branch 'main' into feature/benchmark-reports

901a280

add output option for report to run

f14356a

praveenkuttappan approved these changes Mar 16, 2026

View reviewed changes

Merge branch 'main' into feature/benchmark-reports

207ff56

jeo02 added 3 commits March 16, 2026 14:08

Merge branch 'main' into feature/benchmark-reports

49c754f

Merge branch 'main' into feature/benchmark-reports

4258f40

token count

00d5af5

Merge branch 'main' into feature/benchmark-reports

626e44e

jeo02 added 3 commits March 19, 2026 10:34

Merge branch 'main' into feature/benchmark-reports

755acf0

fixes

8399015

improvements

6cb8d47

praveenkuttappan reviewed Mar 20, 2026

View reviewed changes

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/ReportGenerator.cs Outdated Show resolved Hide resolved

praveenkuttappan reviewed Mar 20, 2026

View reviewed changes

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/ReportGenerator.cs Outdated Show resolved Hide resolved

praveenkuttappan reviewed Mar 20, 2026

View reviewed changes

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Program.cs Show resolved Hide resolved

nits

13d18ca

praveenkuttappan approved these changes Mar 20, 2026

View reviewed changes

jeo02 merged commit 5e7ed57 into main Mar 20, 2026
12 checks passed

jeo02 deleted the feature/benchmark-reports branch March 20, 2026 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks Reporting#14374

Benchmarks Reporting#14374
jeo02 merged 27 commits intomainfrom
feature/benchmark-reports

jeo02 commented Mar 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeo02 commented Mar 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jeo02 commented Mar 18, 2026

Uh oh!

jeo02 commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jeo02 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add benchmark report generation system

Summary

What's new

Usage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeo02 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Report

Scenarios Executed

📊 Overall Statistics

🔍 Per-Scenario Results

Scenario 1: create-release-plan

🔧 Tool Usage Summary

Tool Call Frequency

Tool Call Timeline

📈 Duration Report

Aggregate Duration Summary

🪙 Token Usage

Per-Scenario Token Usage

Aggregate Token Usage

🔑 Areas for Improvement

Uh oh!

Uh oh!

Uh oh!

jeo02 commented Mar 18, 2026

Uh oh!

jeo02 commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jeo02 commented Mar 5, 2026 •

edited

Loading

jeo02 commented Mar 5, 2026 •

edited

Loading