Skip to content

Benchmarks Reporting#14374

Merged
jeo02 merged 27 commits intomainfrom
feature/benchmark-reports
Mar 20, 2026
Merged

Benchmarks Reporting#14374
jeo02 merged 27 commits intomainfrom
feature/benchmark-reports

Conversation

@jeo02
Copy link
Member

@jeo02 jeo02 commented Mar 5, 2026

Add benchmark report generation system

Summary

Adds an LLM-powered reporting system to the benchmarks framework. After running scenarios, you can now generate structured markdown reports that analyze execution results, tool usage, and areas for improvement.

What's new

Rich tool call tracking — Tool calls now capture arguments, duration (ms), MCP server name, and timestamps instead of just the tool name.

Report generation via Copilot SDK — A new ReportGenerator class (reusing the LlmJudge pattern) sends log data + a markdown template to an LLM and gets back a filled-in report with:

  • Overall pass/fail statistics
  • Per-scenario narrative analysis
  • Tool usage frequency and timeline
  • Duration breakdown
  • Actionable improvement suggestions

Two ways to generate reports:

  • run --all --report — generate a report immediately after a benchmark run
  • report <path> — generate a report from existing benchmark-log.json files

Usage

# Run benchmarks and generate a report
dotnet run -- run --all --report

# Generate a report from existing logs
dotnet run -- report /path/to/workspace --output my-report.md

Resolves: #14168 also Resolves #14171

jeo02 and others added 2 commits March 5, 2026 11:12
- New ToolCallRecord model with rich metadata (name, args, duration, MCP server)
- Enhanced SessionExecutor to capture tool call arguments, duration, and MCP server name
- ReportGenerator class using Copilot SDK (LlmJudge pattern) for LLM-based report generation
- ReportTemplate with generalized markdown template for benchmark reports
- --report flag on run command to generate reports after benchmark execution
- report <path> subcommand to generate reports from existing log files
- Updated ExecutionResult, BenchmarkResult, ValidationContext to use ToolCallRecord
- Updated README with new CLI options and project structure

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move the report template from an inline C# string constant to
Reporting/report-template.md. The file is copied to the output
directory at build time and loaded at runtime via ReportTemplate.LoadTemplate().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions bot added the azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli label Mar 5, 2026
jeo02 and others added 10 commits March 5, 2026 13:33
- Extract duplicated prompt construction into BuildPrompt helper
- Use Task.WhenAll for parallel log file reads
- Inline scenarioData variable into return statement
- Remove unused cancellationToken from CallLlmAsync
- Simplify MCP server name extraction to single expression

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instead of manually mapping every property into anonymous objects,
just pass the BenchmarkResult directly and let the serializer handle it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Consolidated ReportTemplate.cs into ReportGenerator as a single class
- Template is loaded once in the constructor instead of per-call
- SystemPrompt and DefaultModel are now private constants
- Removed redundant ReportTemplate.cs file

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The ConcurrentStack for pairing pre/post hooks was unnecessary complexity.
ToolArgs is available directly on PostToolUseHookInput, and duration
tracking added no value since the overall execution time is already
captured by the stopwatch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The conversation messages from the SDK already include tool names, args,
and results. The LLM report generator reads these from the log files,
making custom tool call tracking in SessionExecutor unnecessary.
Reverted ToolCalls back to simple List<string> of tool names.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instead of just tool names, the OnPostToolUse hook now captures
ToolName, ToolArgs, and ToolResult as anonymous objects. This gives
the LLM report generator richer context without a separate model class.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeo02 jeo02 marked this pull request as ready for review March 5, 2026 22:29
@jeo02 jeo02 requested review from a team as code owners March 5, 2026 22:29
Copilot AI review requested due to automatic review settings March 5, 2026 22:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds LLM-based benchmark report generation and expands tool-call logging to support richer analysis of benchmark runs.

Changes:

  • Introduces a markdown report template and a ReportGenerator that uses Copilot SDK to produce filled-in reports.
  • Adds CLI support for run --report and a new report <path> command to generate reports from existing logs.
  • Changes tool-call tracking from a list of strings to structured ToolCallRecord objects.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/report-template.md Adds the markdown template the LLM is expected to fill.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Reporting/ReportGenerator.cs Implements report generation via Copilot SDK using logs/results + template.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/README.md Documents new --report option and report command usage.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Program.cs Wires new CLI options/command and writes generated report to disk.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ValidationContext.cs Updates tool-call type to structured records.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ToolCallRecord.cs Introduces a new model for capturing tool calls.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExecutionResult.cs Updates tool-call type and related documentation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/BenchmarkResult.cs Updates tool-call type to structured records.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/Workspace.cs Updates execution log signature to accept structured tool calls.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/SessionExecutor.cs Captures tool args/results into ToolCallRecord during execution.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Azure.Sdk.Tools.Cli.Benchmarks.csproj Copies report template to output so it can be loaded at runtime.
.gitignore Ignores generated *-report.md artifacts.

jeo02 and others added 2 commits March 5, 2026 14:37
…nResult.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…t-template.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@jeo02
Copy link
Member Author

jeo02 commented Mar 5, 2026

Benchmark Report

Test Run: benchmark-20260319-191340
Date: 2026-03-19 19:13:40 UTC
Report Generated: 2026-03-19T19:13:43.289Z
Model Used: claude-opus-4.5


Scenarios Executed

List every scenario that was run.

# Scenario Name Description Tags Runs
1 create-release-plan Verify the agent calls azsdk_create_release_plan with appropriate context. release-plan 1

📊 Overall Statistics

Metric Value
Total Scenarios 1
Total Individual Runs 1
Overall Pass Rate 100% (1/1)
Average Duration 47.74s
Total Duration 47.74s

🔍 Per-Scenario Results

For each scenario, provide a narrative summary of what happened during execution,
what went well, and what went wrong. One subsection per scenario.

Scenario 1: create-release-plan

Description: Verify the agent calls azsdk_create_release_plan with appropriate context.
Tags: release-plan
Prompt: "Create a release plan for the Contoso Widget Manager, no need to get it afterwards only create.
My setup has already been verified, do not run azsdk_verify_setup. Here is all the context you need:
TypeSpec project located at "specification/contosowidgetmanager/Contoso.WidgetManager".
Use service tree ID "a7f2b8e4-9c1d-4a3e-b6f9-2d8e5a7c3b1f",
product tree ID "f1a8c5d2-6e4b-4f7a-9c2d-8b5e1f3a6c9e",
target release timeline "December 2025",
API version "2022-11-01-preview",
SDK release type "beta",
and link it to the spec pull request "Azure/azure-rest-api-specs#38387"."
Pass Rate: 100% (1/1) | Duration: 00:00:47.7392746

Validation Results:

Validator Result Message
Expected tool: azsdk_create_release_plan ✅ Pass All 1 expected tool(s) called in correct order with expected inputs.

What Happened:
The agent successfully executed the create-release-plan scenario by making two tool calls. First, it called report_intent to log its intention of "Creating release plan", which completed in 20ms. Second, it called the MCP tool azure-sdk-mcp-azsdk_create_release_plan with all the required parameters exactly as specified in the prompt: the TypeSpec project path, service tree ID, product tree ID, target release month/year (December 2025), spec API version (2022-11-01-preview), SDK release type (beta), and the spec pull request URL. The tool call completed successfully in 12,059ms and returned a comprehensive release plan details object. The operation status returned was "Succeeded", and the tool indicated that a release plan already existed for the specified pull request (work item ID: 33286), suggesting the user should confirm whether to use the existing plan or force create a new one. The agent completed the task as instructed without attempting to retrieve the plan afterwards.

✅ What Went Well:

  • The agent correctly interpreted the prompt and avoided calling azsdk_verify_setup as instructed.
  • All required parameters were passed accurately to azsdk_create_release_plan including service tree ID, product tree ID, target release timeline, API version, SDK release type, and spec PR URL.
  • The agent used report_intent to communicate its action, demonstrating good observability practices.
  • The tool call successfully completed and returned a valid release plan with work item details.
  • The validator confirmed all expected tools were called in the correct order with expected inputs.

❌ What Went Wrong:

  • No issues detected. The scenario passed all validation checks.

🔧 Tool Usage Summary

Aggregated tool call statistics across all scenarios.

Tool Call Frequency

Tool Name MCP Server Total Calls Avg Duration (ms) Scenarios Used In
report_intent Built-in 1 20 create-release-plan
azure-sdk-mcp-azsdk_create_release_plan Built-in 1 12059 create-release-plan

Tool Call Timeline

For each scenario, list the sequence of tool calls made.

Scenario Tool Calls (in order) Total Tool Calls
create-release-plan report_intent → azure-sdk-mcp-azsdk_create_release_plan 2

📈 Duration Report

# Scenario Name Duration Pass/Fail
1 create-release-plan 00:00:47.7392746

Aggregate Duration Summary

Metric Value
Total Duration (all scenarios) 47.74s
Longest Scenario 47.74s (create-release-plan)
Shortest Scenario 47.74s (create-release-plan)
Average Per Scenario 47.74s

🪙 Token Usage

Per-Scenario Token Usage

# Scenario Name Input Tokens Output Tokens Cache Read Cache Write Total Tokens
1 create-release-plan 90,848 616 0 0 91,464

Aggregate Token Usage

Metric Value
Total Input Tokens 90,848
Total Output Tokens 616
Total Cache Read Tokens 0
Total Cache Write Tokens 0
Grand Total Tokens 91,464

🔑 Areas for Improvement

Actionable suggestions based on problems discovered during testing. Each item should
identify the problem, cite supporting evidence from test results, and propose a concrete
fix or investigation.

  1. Token Efficiency — The scenario consumed 90,848 input tokens for a relatively straightforward task of creating a release plan. With a 100% pass rate and simple validation requirements, this suggests the agent may be receiving excessive context or making redundant tool queries. Investigation should focus on optimizing the prompt context size and ensuring the MCP tool responses are concise. Consider implementing prompt caching strategies to reduce repeated context transmission in future runs.

  2. Execution Duration — The azsdk_create_release_plan tool call took 12,059ms (over 12 seconds) to complete, which represents 99.8% of the total scenario duration. While the operation succeeded, this latency could impact user experience in production scenarios. Investigate whether the backend service can be optimized, or if there are opportunities to provide progress feedback during long-running operations.

  3. Duplicate Release Plan Handling — The tool returned a message indicating a release plan already exists for the specified pull request, but the agent completed without explicitly handling this case or prompting the user as suggested in the next_steps field. Future improvements should include logic to detect pre-existing release plans and either surface this to the user or implement the force-create option when appropriate.


Report generated on 2026-03-19T19:13:43.289Z — 1 scenario(s) across 1 total run(s).

jeo02 and others added 2 commits March 5, 2026 14:40
The parameter represents the model used during the benchmark run,
not the model used to generate the report (which is always DefaultModel).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ancellationToken

ToolCallRecord now captures duration (from SDK pre/post hook timestamps),
MCP server name (from tool name convention), and UTC timestamp.
CancellationToken is threaded through GenerateAsync -> CallLlmAsync
into all SDK calls (CreateSessionAsync, SendAndWaitAsync, GetMessagesAsync).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jeo02
Copy link
Member Author

jeo02 commented Mar 18, 2026

Added in token count usage as well.

@jeo02
Copy link
Member Author

jeo02 commented Mar 18, 2026

Need to merge in Eval Migration PR in first because ToolCallRecord will have merge conflicts.

@jeo02 jeo02 merged commit 5e7ed57 into main Mar 20, 2026
12 checks passed
@jeo02 jeo02 deleted the feature/benchmark-reports branch March 20, 2026 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Token usage in report Report results using template and LLM

3 participants