Conversation
There was a problem hiding this comment.
Pull request overview
Migrates end-to-end tool-invocation scenarios from the Azure.Sdk.Tools.Cli.Evaluations framework into the Azure.Sdk.Tools.Cli.Benchmarks framework, aligning scenario coverage with the benchmark runner/validators and removing the old evaluation harness artifacts.
Changes:
- Removed multiple evaluation scenarios/test-data traces and related evaluation-only helper/evaluator code.
- Added benchmark scenarios covering TypeSpec, GitHub PR lookup, pipeline status checks, and release-plan actions.
- Introduced benchmark-side tool-call recording models and a new
ToolCallValidatorto validate tool invocation sequences/inputs.
Reviewed changes
Copilot reviewed 37 out of 39 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/TestData/example2.json | Removed evaluation chat trace test data. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/TestData/example.json | Removed evaluation chat trace test data. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_LinkNamespaceApprovalIssue.cs | Removed evaluation scenario test (migrated to benchmarks). |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_GetPullRequestLinkForCurrentBranch.cs | Removed evaluation scenario test (migrated to benchmarks). |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_CreateReleasePlan.cs | Removed evaluation scenario test (migrated to benchmarks). |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_ValidateTypespec.cs | Removed evaluation scenario test (migrated to benchmarks). |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_GetModifiedTypespecProjects.cs | Removed evaluation scenario test (migrated to benchmarks). |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckSDKGenerationStatus.cs | Removed evaluation scenario test (migrated to benchmarks). |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckPublicRepoThenValidate.cs | Removed evaluation scenario test (migrated to benchmarks). |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckPublicRepo.cs | Removed evaluation scenario test (migrated to benchmarks). |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/AzsdkTypeSpecGeneration_Step02_TypespecValidation.cs | Removed evaluation workflow-step test (migrated to benchmarks). |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenario.cs | Simplified global setup by removing chat-completion/tool-name caching. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Models/RepositoryCategories.cs | Removed evaluation-only category constants. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Models/ExpectedToolInputEvaluatorContext.cs | Removed evaluation-only context type used by input evaluator. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/TestSetup.cs | Removed evaluation-only ChatCompletion construction helper. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/EvaluationHelper.cs | Removed tool-input scenario runner logic tied to evaluation harness. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/ChatCompletion.cs | Removed evaluation-only chat completion wrapper. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Evaluators/ExpectedToolInputEvaluator.cs | Removed evaluation-only tool-input evaluator. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/ToolCallValidator.cs | Added benchmark validator for expected tool calls and optional input checks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/ValidateTypespecScenario.cs | Added benchmark scenario for TypeSpec validation tool invocation. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/TypespecGenerationStep02Scenario.cs | Added benchmark scenario for TypeSpec workflow step validation behavior. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/RenameClientPropertyScenario.cs | Added POC benchmark scenario validating a specific TypeSpec edit via expected diff. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/GetModifiedTypespecProjectsScenario.cs | Added benchmark scenario for modified TypeSpec project discovery tool invocation. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoThenValidateScenario.cs | Added benchmark scenario for public-repo check + validation tool invocation. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoScenario.cs | Added benchmark scenario for public-repo check tool invocation. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/AddArmResourceScenario.cs | Adjusted validation patterns and compilation command in ARM resource authoring scenario. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/LinkNamespaceApprovalIssueScenario.cs | Added benchmark scenario for linking namespace approval issues. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/CreateReleasePlanScenario.cs | Added benchmark scenario for creating a release plan. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/Pipeline/CheckSdkGenerationStatusScenario.cs | Added benchmark scenario for pipeline status checking tool invocation. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/GitHub/GetPrLinkCurrentBranchScenario.cs | Added benchmark scenario for “get PR link for current branch” tool invocation. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ValidationContext.cs | Updated to store structured tool-call records instead of strings. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ToolCallRecord.cs | Added structured representation of tool calls (name/args/result/duration). |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExpectedToolCall.cs | Added expected-tool-call model with optional expected-input matching. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExecutionResult.cs | Updated to emit structured tool-call records. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/BenchmarkResult.cs | Updated to emit structured tool-call records. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/WorkspaceManager.cs | Expanded sparse-checkout defaults to include .vscode and eng/common (plus .github). |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/Workspace.cs | Updated execution log writing to accept structured tool-call records. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/SessionExecutor.cs | Captures tool args/results and duration into ToolCallRecords via hooks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/McpConfigLoader.cs | Defaults MCP server cwd to workspace root when not specified. |
Comments suppressed due to low confidence (1)
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/AddArmResourceScenario.cs:82
- The CRUD-operation
ContainsValidatorpatterns were loosened to very broad substrings (e.g.,"ArmResourceCreateOrReplace"and"update is Arm"). This makes the scenario much less precise and may allow incorrect generated code to pass validation. Consider restoring the full expected operation signatures (or otherwise tightening these patterns) so the scenario actually verifies the intended CRUD operations.
You can also share your feedback on Copilot code review. Take the survey.
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/ToolCallValidator.cs
Show resolved
Hide resolved
...cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoThenValidateScenario.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Migrates several end-to-end evaluation scenarios (previously run with mocked tool execution) into the Benchmarks framework, adding a validator to assert expected tool calls/inputs while removing the old evaluation harness and test data.
Changes:
- Removed evaluation scenario tests, tool-input evaluator, and JSON conversation traces from
Azure.Sdk.Tools.Cli.Evaluations. - Added benchmark scenarios (TypeSpec, GitHub, pipeline, release plan) plus a new
ToolCallValidatorand structured tool-call recording. - Updated benchmark infrastructure to capture tool arguments/results and expand sparse checkout to include
.vscode/eng/common.
Reviewed changes
Copilot reviewed 37 out of 39 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/TestData/example2.json | Removed large JSON trace used by evaluation scenario. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/TestData/example.json | Removed large JSON trace used by evaluation scenario. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_LinkNamespaceApprovalIssue.cs | Removed evaluation scenario migrated to Benchmarks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_GetPullRequestLinkForCurrentBranch.cs | Removed evaluation scenario migrated to Benchmarks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/General/Evaluate_CreateReleasePlan.cs | Removed evaluation scenario migrated to Benchmarks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_ValidateTypespec.cs | Removed evaluation scenario migrated to Benchmarks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_GetModifiedTypespecProjects.cs | Removed evaluation scenario migrated to Benchmarks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckSDKGenerationStatus.cs | Removed evaluation scenario migrated to Benchmarks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckPublicRepoThenValidate.cs | Removed evaluation scenario migrated to Benchmarks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/Evaluate_CheckPublicRepo.cs | Removed evaluation scenario migrated to Benchmarks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenarios/AzureRestApiSpecs/AzsdkTypeSpecGeneration_Step02_TypespecValidation.cs | Removed evaluation workflow step scenario migrated to Benchmarks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Scenario.cs | Removed unused evaluation scaffolding fields after migration. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Models/RepositoryCategories.cs | Removed no-longer-used evaluation category constants. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Models/ExpectedToolInputEvaluatorContext.cs | Removed evaluation-only context type. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/TestSetup.cs | Removed ChatCompletion factory no longer used. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/EvaluationHelper.cs | Removed tool-input scenario runner and related helpers. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Helpers/ChatCompletion.cs | Removed evaluation-only mocked tool execution helper. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Evaluations/Evaluators/ExpectedToolInputEvaluator.cs | Removed evaluator replaced by benchmark validator. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/ToolCallValidator.cs | Added benchmark validator for tool-call name/order/input allow/deny validation. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/ValidateTypespecScenario.cs | Added benchmark replacement for ValidateTypespec evaluation scenario. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/TypespecGenerationStep02Scenario.cs | Added benchmark replacement for workflow step 2 (public repo check) scenario. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/RenameClientPropertyScenario.cs | Added TypeSpec authoring POC scenario for file edit validation. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/GetModifiedTypespecProjectsScenario.cs | Added benchmark replacement for “get modified TypeSpec projects” evaluation scenario. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoThenValidateScenario.cs | Added benchmark replacement for “validate then check public repo” evaluation scenario. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/CheckPublicRepoScenario.cs | Added benchmark replacement for “check public repo” evaluation scenario. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/AddArmResourceScenario.cs | Adjusted scenario validation patterns and compile command invocation. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/LinkNamespaceApprovalIssueScenario.cs | Added benchmark replacement for namespace approval linking evaluation scenario. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/CreateReleasePlanScenario.cs | Added benchmark replacement for release plan creation evaluation scenario. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/Pipeline/CheckSdkGenerationStatusScenario.cs | Added benchmark replacement for SDK generation pipeline status evaluation scenario. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/GitHub/GetPrLinkCurrentBranchScenario.cs | Added benchmark replacement for PR status evaluation scenario. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ValidationContext.cs | Updated to carry structured tool call records. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ToolCallRecord.cs | Added structured representation for tool calls (name/args/result/duration). |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExpectedToolCall.cs | Added expected tool call model with optional input validation. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExecutionResult.cs | Updated to expose structured tool calls. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/BenchmarkResult.cs | Updated to expose structured tool calls. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/WorkspaceManager.cs | Expanded sparse checkout include set for benchmark worktrees. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/Workspace.cs | Updated execution logging to include structured tool calls. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/SessionExecutor.cs | Added tool-call capture (args/result/duration) via SDK hooks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/McpConfigLoader.cs | Defaulted MCP server cwd to workspace root when unspecified. |
Comments suppressed due to low confidence (1)
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/AddArmResourceScenario.cs:82
- In
ContainsValidator("Asset file has CRUD operations"), the patterns for create/update look truncated ("ArmResourceCreateOrReplace","update is Arm"). This makes the validator much less strict than the others and could let the scenario pass even if the generated operations/signatures are wrong or missing. Consider restoring the full expected operation lines (e.g., including the specific helper type / generic arguments) to keep the benchmark outcome validation meaningful.
You can also share your feedback on Copilot code review. Take the survey.
.../azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/CreateReleasePlanScenario.cs
Show resolved
Hide resolved
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Validation/Validators/ToolCallValidator.cs
Show resolved
Hide resolved
...i/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/ReleasePlan/LinkNamespaceApprovalIssueScenario.cs
Show resolved
Hide resolved
...cli/Azure.Sdk.Tools.Cli.Benchmarks/Scenarios/TypeSpec/GetModifiedTypespecProjectsScenario.cs
Show resolved
Hide resolved
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/SessionExecutor.cs
Outdated
Show resolved
Hide resolved
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Infrastructure/WorkspaceManager.cs
Outdated
Show resolved
Hide resolved
tools/azsdk-cli/Azure.Sdk.Tools.Cli.Benchmarks/Models/ExpectedToolCall.cs
Outdated
Show resolved
Hide resolved
…zure-sdk-tools into benchmark-eval-migration
Migrated end to end scenarios from the eval framework into the benchmark framework.
Resolves #14469