Benchmark Eval Migration by jeo02 · Pull Request #14507 · Azure/azure-sdk-tools

jeo02 · 2026-03-13T18:33:56Z

Migrated end to end scenarios from the eval framework into the benchmark framework.
Resolves #14469

Copilot

Pull request overview

Migrates end-to-end tool-invocation scenarios from the Azure.Sdk.Tools.Cli.Evaluations framework into the Azure.Sdk.Tools.Cli.Benchmarks framework, aligning scenario coverage with the benchmark runner/validators and removing the old evaluation harness artifacts.

Changes:

Removed multiple evaluation scenarios/test-data traces and related evaluation-only helper/evaluator code.
Added benchmark scenarios covering TypeSpec, GitHub PR lookup, pipeline status checks, and release-plan actions.
Introduced benchmark-side tool-call recording models and a new ToolCallValidator to validate tool invocation sequences/inputs.

Reviewed changes

Copilot reviewed 37 out of 39 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/TestData/example2.json	Removed evaluation chat trace test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/TestData/example.json	Removed evaluation chat trace test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_LinkNamespaceApprovalIssue.cs	Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_GetPullRequestLinkForCurrentBranch.cs	Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_CreateReleasePlan.cs	Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_ValidateTypespec.cs	Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_GetModifiedTypespecProjects.cs	Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckSDKGenerationStatus.cs	Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckPublicRepoThenValidate.cs	Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckPublicRepo.cs	Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/AzsdkTypeSpecGeneration_Step02_TypespecValidation.cs	Removed evaluation workflow-step test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenario.cs	Simplified global setup by removing chat-completion/tool-name caching.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Models/RepositoryCategories.cs	Removed evaluation-only category constants.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Models/ExpectedToolInputEvaluatorContext.cs	Removed evaluation-only context type used by input evaluator.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/TestSetup.cs	Removed evaluation-only `ChatCompletion` construction helper.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/EvaluationHelper.cs	Removed tool-input scenario runner logic tied to evaluation harness.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/ChatCompletion.cs	Removed evaluation-only chat completion wrapper.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Evaluators/ExpectedToolInputEvaluator.cs	Removed evaluation-only tool-input evaluator.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/ToolCallValidator.cs	Added benchmark validator for expected tool calls and optional input checks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/ValidateTypespecScenario.cs	Added benchmark scenario for TypeSpec validation tool invocation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/TypespecGenerationStep02Scenario.cs	Added benchmark scenario for TypeSpec workflow step validation behavior.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/RenameClientPropertyScenario.cs	Added POC benchmark scenario validating a specific TypeSpec edit via expected diff.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/GetModifiedTypespecProjectsScenario.cs	Added benchmark scenario for modified TypeSpec project discovery tool invocation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoThenValidateScenario.cs	Added benchmark scenario for public-repo check + validation tool invocation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoScenario.cs	Added benchmark scenario for public-repo check tool invocation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/AddArmResourceScenario.cs	Adjusted validation patterns and compilation command in ARM resource authoring scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/LinkNamespaceApprovalIssueScenario.cs	Added benchmark scenario for linking namespace approval issues.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/CreateReleasePlanScenario.cs	Added benchmark scenario for creating a release plan.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/Pipeline/CheckSdkGenerationStatusScenario.cs	Added benchmark scenario for pipeline status checking tool invocation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/GitHub/GetPrLinkCurrentBranchScenario.cs	Added benchmark scenario for “get PR link for current branch” tool invocation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ValidationContext.cs	Updated to store structured tool-call records instead of strings.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ToolCallRecord.cs	Added structured representation of tool calls (name/args/result/duration).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExpectedToolCall.cs	Added expected-tool-call model with optional expected-input matching.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExecutionResult.cs	Updated to emit structured tool-call records.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/BenchmarkResult.cs	Updated to emit structured tool-call records.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/WorkspaceManager.cs	Expanded sparse-checkout defaults to include `.vscode` and `eng/common` (plus `.github`).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/Workspace.cs	Updated execution log writing to accept structured tool-call records.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/SessionExecutor.cs	Captures tool args/results and duration into `ToolCallRecord`s via hooks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/McpConfigLoader.cs	Defaults MCP server `cwd` to workspace root when not specified.

Comments suppressed due to low confidence (1)

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/AddArmResourceScenario.cs:82

The CRUD-operation ContainsValidator patterns were loosened to very broad substrings (e.g., "ArmResourceCreateOrReplace" and "update is Arm"). This makes the scenario much less precise and may allow incorrect generated code to pass validation. Consider restoring the full expected operation signatures (or otherwise tightening these patterns) so the scenario actually verifies the intended CRUD operations.

You can also share your feedback on Copilot code review. Take the survey.

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/ToolCallValidator.cs

...cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoThenValidateScenario.cs

Copilot

Pull request overview

Migrates several end-to-end evaluation scenarios (previously run with mocked tool execution) into the Benchmarks framework, adding a validator to assert expected tool calls/inputs while removing the old evaluation harness and test data.

Changes:

Removed evaluation scenario tests, tool-input evaluator, and JSON conversation traces from Azure.Sdk.Tools.Cli.Evaluations.
Added benchmark scenarios (TypeSpec, GitHub, pipeline, release plan) plus a new ToolCallValidator and structured tool-call recording.
Updated benchmark infrastructure to capture tool arguments/results and expand sparse checkout to include .vscode/eng/common.

Reviewed changes

Copilot reviewed 37 out of 39 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/TestData/example2.json	Removed large JSON trace used by evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/TestData/example.json	Removed large JSON trace used by evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_LinkNamespaceApprovalIssue.cs	Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_GetPullRequestLinkForCurrentBranch.cs	Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_CreateReleasePlan.cs	Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_ValidateTypespec.cs	Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_GetModifiedTypespecProjects.cs	Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckSDKGenerationStatus.cs	Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckPublicRepoThenValidate.cs	Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckPublicRepo.cs	Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/AzsdkTypeSpecGeneration_Step02_TypespecValidation.cs	Removed evaluation workflow step scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenario.cs	Removed unused evaluation scaffolding fields after migration.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Models/RepositoryCategories.cs	Removed no-longer-used evaluation category constants.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Models/ExpectedToolInputEvaluatorContext.cs	Removed evaluation-only context type.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/TestSetup.cs	Removed `ChatCompletion` factory no longer used.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/EvaluationHelper.cs	Removed tool-input scenario runner and related helpers.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/ChatCompletion.cs	Removed evaluation-only mocked tool execution helper.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Evaluators/ExpectedToolInputEvaluator.cs	Removed evaluator replaced by benchmark validator.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/ToolCallValidator.cs	Added benchmark validator for tool-call name/order/input allow/deny validation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/ValidateTypespecScenario.cs	Added benchmark replacement for ValidateTypespec evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/TypespecGenerationStep02Scenario.cs	Added benchmark replacement for workflow step 2 (public repo check) scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/RenameClientPropertyScenario.cs	Added TypeSpec authoring POC scenario for file edit validation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/GetModifiedTypespecProjectsScenario.cs	Added benchmark replacement for “get modified TypeSpec projects” evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoThenValidateScenario.cs	Added benchmark replacement for “validate then check public repo” evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoScenario.cs	Added benchmark replacement for “check public repo” evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/AddArmResourceScenario.cs	Adjusted scenario validation patterns and compile command invocation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/LinkNamespaceApprovalIssueScenario.cs	Added benchmark replacement for namespace approval linking evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/CreateReleasePlanScenario.cs	Added benchmark replacement for release plan creation evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/Pipeline/CheckSdkGenerationStatusScenario.cs	Added benchmark replacement for SDK generation pipeline status evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/GitHub/GetPrLinkCurrentBranchScenario.cs	Added benchmark replacement for PR status evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ValidationContext.cs	Updated to carry structured tool call records.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ToolCallRecord.cs	Added structured representation for tool calls (name/args/result/duration).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExpectedToolCall.cs	Added expected tool call model with optional input validation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExecutionResult.cs	Updated to expose structured tool calls.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/BenchmarkResult.cs	Updated to expose structured tool calls.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/WorkspaceManager.cs	Expanded sparse checkout include set for benchmark worktrees.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/Workspace.cs	Updated execution logging to include structured tool calls.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/SessionExecutor.cs	Added tool-call capture (args/result/duration) via SDK hooks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/McpConfigLoader.cs	Defaulted MCP server `cwd` to workspace root when unspecified.

Comments suppressed due to low confidence (1)

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/AddArmResourceScenario.cs:82

In ContainsValidator("Asset file has CRUD operations"), the patterns for create/update look truncated ("ArmResourceCreateOrReplace", "update is Arm"). This makes the validator much less strict than the others and could let the scenario pass even if the generated operations/signatures are wrong or missing. Consider restoring the full expected operation lines (e.g., including the specific helper type / generic arguments) to keep the benchmark outcome validation meaningful.

You can also share your feedback on Copilot code review. Take the survey.

.../azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/CreateReleasePlanScenario.cs

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/ToolCallValidator.cs

...i/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/LinkNamespaceApprovalIssueScenario.cs

...cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/GetModifiedTypespecProjectsScenario.cs

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/SessionExecutor.cs

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/WorkspaceManager.cs

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExpectedToolCall.cs

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ToolCallRecord.cs

…zure-sdk-tools into benchmark-eval-migration

jeo02 added 4 commits March 13, 2026 10:14

first iteration of migration

58d8311

simplify

2ddb427

structure based off tools

08f9a12

remove evaluations + remove unused

60cf5ba

jeo02 requested a review from praveenkuttappan March 13, 2026 18:33

jeo02 requested a review from a team as a code owner March 13, 2026 18:33

Copilot AI review requested due to automatic review settings March 13, 2026 18:33

github-actions bot added the azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli label Mar 13, 2026

Copilot started reviewing on behalf of jeo02 March 13, 2026 18:35 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/ToolCallValidator.cs Show resolved Hide resolved

...cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoThenValidateScenario.cs Outdated Show resolved Hide resolved

jeo02 added 2 commits March 13, 2026 13:39

path fix + copilot fix

105bf40

organize tags + copilot comment

a34ba42

jeo02 requested a review from Copilot March 13, 2026 21:37

Copilot started reviewing on behalf of jeo02 March 13, 2026 21:38 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

jeo02 added 6 commits March 17, 2026 10:32

Merge branch 'main' into benchmark-eval-migration

b338595

test mode

9d2e7b6

copilot nit

e256964

copilot nit

65bbfed

remove param

de7e67c

Merge branch 'main' into benchmark-eval-migration

3bed443

praveenkuttappan reviewed Mar 18, 2026

View reviewed changes

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/SessionExecutor.cs Outdated Show resolved Hide resolved

praveenkuttappan reviewed Mar 18, 2026

View reviewed changes

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/WorkspaceManager.cs Outdated Show resolved Hide resolved

praveenkuttappan reviewed Mar 18, 2026

View reviewed changes

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExpectedToolCall.cs Outdated Show resolved Hide resolved

praveenkuttappan reviewed Mar 18, 2026

View reviewed changes

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ToolCallRecord.cs Show resolved Hide resolved

jeo02 added 4 commits March 18, 2026 15:50

dictionary

e76ad42

method for setup env

6a3f360

nit

62473d2

Merge branch 'benchmark-eval-migration' of https://github.com/jeo02/a…

38ff717

…zure-sdk-tools into benchmark-eval-migration

jeo02 enabled auto-merge (squash) March 18, 2026 22:55

jeo02 mentioned this pull request Mar 18, 2026

Benchmarks Reporting #14374

Merged

praveenkuttappan approved these changes Mar 19, 2026

View reviewed changes

jeo02 merged commit 11ba695 into Azure:main Mar 19, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Eval Migration#14507

Benchmark Eval Migration#14507
jeo02 merged 16 commits intoAzure:mainfrom
jeo02:benchmark-eval-migration

jeo02 commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jeo02 commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants