Skip to content

Benchmark Eval Migration#14507

Merged
jeo02 merged 16 commits intoAzure:mainfrom
jeo02:benchmark-eval-migration
Mar 19, 2026
Merged

Benchmark Eval Migration#14507
jeo02 merged 16 commits intoAzure:mainfrom
jeo02:benchmark-eval-migration

Conversation

@jeo02
Copy link
Member

@jeo02 jeo02 commented Mar 13, 2026

Migrated end to end scenarios from the eval framework into the benchmark framework.
Resolves #14469

@jeo02 jeo02 requested a review from praveenkuttappan March 13, 2026 18:33
@jeo02 jeo02 requested a review from a team as a code owner March 13, 2026 18:33
Copilot AI review requested due to automatic review settings March 13, 2026 18:33
@github-actions github-actions bot added the azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli label Mar 13, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates end-to-end tool-invocation scenarios from the Azure.Sdk.Tools.Cli.Evaluations framework into the Azure.Sdk.Tools.Cli.Benchmarks framework, aligning scenario coverage with the benchmark runner/validators and removing the old evaluation harness artifacts.

Changes:

  • Removed multiple evaluation scenarios/test-data traces and related evaluation-only helper/evaluator code.
  • Added benchmark scenarios covering TypeSpec, GitHub PR lookup, pipeline status checks, and release-plan actions.
  • Introduced benchmark-side tool-call recording models and a new ToolCallValidator to validate tool invocation sequences/inputs.

Reviewed changes

Copilot reviewed 37 out of 39 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/TestData/example2.json Removed evaluation chat trace test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/TestData/example.json Removed evaluation chat trace test data.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_LinkNamespaceApprovalIssue.cs Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_GetPullRequestLinkForCurrentBranch.cs Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_CreateReleasePlan.cs Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_ValidateTypespec.cs Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_GetModifiedTypespecProjects.cs Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckSDKGenerationStatus.cs Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckPublicRepoThenValidate.cs Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckPublicRepo.cs Removed evaluation scenario test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/AzsdkTypeSpecGeneration_Step02_TypespecValidation.cs Removed evaluation workflow-step test (migrated to benchmarks).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenario.cs Simplified global setup by removing chat-completion/tool-name caching.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Models/RepositoryCategories.cs Removed evaluation-only category constants.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Models/ExpectedToolInputEvaluatorContext.cs Removed evaluation-only context type used by input evaluator.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/TestSetup.cs Removed evaluation-only ChatCompletion construction helper.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/EvaluationHelper.cs Removed tool-input scenario runner logic tied to evaluation harness.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/ChatCompletion.cs Removed evaluation-only chat completion wrapper.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Evaluators/ExpectedToolInputEvaluator.cs Removed evaluation-only tool-input evaluator.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/ToolCallValidator.cs Added benchmark validator for expected tool calls and optional input checks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/ValidateTypespecScenario.cs Added benchmark scenario for TypeSpec validation tool invocation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/TypespecGenerationStep02Scenario.cs Added benchmark scenario for TypeSpec workflow step validation behavior.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/RenameClientPropertyScenario.cs Added POC benchmark scenario validating a specific TypeSpec edit via expected diff.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/GetModifiedTypespecProjectsScenario.cs Added benchmark scenario for modified TypeSpec project discovery tool invocation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoThenValidateScenario.cs Added benchmark scenario for public-repo check + validation tool invocation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoScenario.cs Added benchmark scenario for public-repo check tool invocation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/AddArmResourceScenario.cs Adjusted validation patterns and compilation command in ARM resource authoring scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/LinkNamespaceApprovalIssueScenario.cs Added benchmark scenario for linking namespace approval issues.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/CreateReleasePlanScenario.cs Added benchmark scenario for creating a release plan.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/Pipeline/CheckSdkGenerationStatusScenario.cs Added benchmark scenario for pipeline status checking tool invocation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/GitHub/GetPrLinkCurrentBranchScenario.cs Added benchmark scenario for “get PR link for current branch” tool invocation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ValidationContext.cs Updated to store structured tool-call records instead of strings.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ToolCallRecord.cs Added structured representation of tool calls (name/args/result/duration).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExpectedToolCall.cs Added expected-tool-call model with optional expected-input matching.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExecutionResult.cs Updated to emit structured tool-call records.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/BenchmarkResult.cs Updated to emit structured tool-call records.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/WorkspaceManager.cs Expanded sparse-checkout defaults to include .vscode and eng/common (plus .github).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/Workspace.cs Updated execution log writing to accept structured tool-call records.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/SessionExecutor.cs Captures tool args/results and duration into ToolCallRecords via hooks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/McpConfigLoader.cs Defaults MCP server cwd to workspace root when not specified.
Comments suppressed due to low confidence (1)

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/AddArmResourceScenario.cs:82

  • The CRUD-operation ContainsValidator patterns were loosened to very broad substrings (e.g., "ArmResourceCreateOrReplace" and "update is Arm"). This makes the scenario much less precise and may allow incorrect generated code to pass validation. Consider restoring the full expected operation signatures (or otherwise tightening these patterns) so the scenario actually verifies the intended CRUD operations.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates several end-to-end evaluation scenarios (previously run with mocked tool execution) into the Benchmarks framework, adding a validator to assert expected tool calls/inputs while removing the old evaluation harness and test data.

Changes:

  • Removed evaluation scenario tests, tool-input evaluator, and JSON conversation traces from Azure.Sdk.Tools.Cli.Evaluations.
  • Added benchmark scenarios (TypeSpec, GitHub, pipeline, release plan) plus a new ToolCallValidator and structured tool-call recording.
  • Updated benchmark infrastructure to capture tool arguments/results and expand sparse checkout to include .vscode/eng/common.

Reviewed changes

Copilot reviewed 37 out of 39 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/TestData/example2.json Removed large JSON trace used by evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/TestData/example.json Removed large JSON trace used by evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_LinkNamespaceApprovalIssue.cs Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_GetPullRequestLinkForCurrentBranch.cs Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_CreateReleasePlan.cs Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_ValidateTypespec.cs Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_GetModifiedTypespecProjects.cs Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckSDKGenerationStatus.cs Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckPublicRepoThenValidate.cs Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckPublicRepo.cs Removed evaluation scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/AzsdkTypeSpecGeneration_Step02_TypespecValidation.cs Removed evaluation workflow step scenario migrated to Benchmarks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenario.cs Removed unused evaluation scaffolding fields after migration.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Models/RepositoryCategories.cs Removed no-longer-used evaluation category constants.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Models/ExpectedToolInputEvaluatorContext.cs Removed evaluation-only context type.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/TestSetup.cs Removed ChatCompletion factory no longer used.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/EvaluationHelper.cs Removed tool-input scenario runner and related helpers.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/ChatCompletion.cs Removed evaluation-only mocked tool execution helper.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Evaluators/ExpectedToolInputEvaluator.cs Removed evaluator replaced by benchmark validator.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/ToolCallValidator.cs Added benchmark validator for tool-call name/order/input allow/deny validation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/ValidateTypespecScenario.cs Added benchmark replacement for ValidateTypespec evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/TypespecGenerationStep02Scenario.cs Added benchmark replacement for workflow step 2 (public repo check) scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/RenameClientPropertyScenario.cs Added TypeSpec authoring POC scenario for file edit validation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/GetModifiedTypespecProjectsScenario.cs Added benchmark replacement for “get modified TypeSpec projects” evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoThenValidateScenario.cs Added benchmark replacement for “validate then check public repo” evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoScenario.cs Added benchmark replacement for “check public repo” evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/AddArmResourceScenario.cs Adjusted scenario validation patterns and compile command invocation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/LinkNamespaceApprovalIssueScenario.cs Added benchmark replacement for namespace approval linking evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/CreateReleasePlanScenario.cs Added benchmark replacement for release plan creation evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/Pipeline/CheckSdkGenerationStatusScenario.cs Added benchmark replacement for SDK generation pipeline status evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/GitHub/GetPrLinkCurrentBranchScenario.cs Added benchmark replacement for PR status evaluation scenario.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ValidationContext.cs Updated to carry structured tool call records.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ToolCallRecord.cs Added structured representation for tool calls (name/args/result/duration).
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExpectedToolCall.cs Added expected tool call model with optional input validation.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExecutionResult.cs Updated to expose structured tool calls.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/BenchmarkResult.cs Updated to expose structured tool calls.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/WorkspaceManager.cs Expanded sparse checkout include set for benchmark worktrees.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/Workspace.cs Updated execution logging to include structured tool calls.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/SessionExecutor.cs Added tool-call capture (args/result/duration) via SDK hooks.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/McpConfigLoader.cs Defaulted MCP server cwd to workspace root when unspecified.
Comments suppressed due to low confidence (1)

tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/AddArmResourceScenario.cs:82

  • In ContainsValidator("Asset file has CRUD operations"), the patterns for create/update look truncated ("ArmResourceCreateOrReplace", "update is Arm"). This makes the validator much less strict than the others and could let the scenario pass even if the generated operations/signatures are wrong or missing. Consider restoring the full expected operation lines (e.g., including the specific helper type / generic arguments) to keep the benchmark outcome validation meaningful.

You can also share your feedback on Copilot code review. Take the survey.

@jeo02 jeo02 enabled auto-merge (squash) March 18, 2026 22:55
@jeo02 jeo02 mentioned this pull request Mar 18, 2026
@jeo02 jeo02 merged commit 11ba695 into Azure:main Mar 19, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate Evaluation End to End tests

3 participants