Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124)#15811
Open
helen229 wants to merge 28 commits into
Open
Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124)#15811helen229 wants to merge 28 commits into
helen229 wants to merge 28 commits into
Conversation
Adds a new Vally eval suite under tools/azsdk-cli/Azure.Sdk.Tools.Vally/ for MCP tool / scenario evaluations, replacing the deleted Azure.Sdk.Tools.Cli.Benchmarks project (#15697). - README documents project intent, layout, local run instructions, and how to add a new scenario. - .vally.yaml wires the azsdk-mcp environment (stdio dotnet run against Azure.Sdk.Tools.Cli) and defines 'typespec' and 'all' suites. - evals/check-public-repo.eval.yaml is the first ported scenario (from the deleted CheckPublicRepoScenario): verifies the agent invokes azsdk_typespec_check_project_in_public_repo for a public-repo check prompt. Lints clean via 'vally lint --eval-spec'. - fixtures/.gitkeep reserves the per-scenario fixtures layout. Remaining scenarios from the deleted benchmark are tracked as a checklist in the project README and in #15124.
Adds eval YAMLs for every scenario that was deleted from Azure.Sdk.Tools.Cli.Benchmarks in #15697: - check-public-repo-then-validate - validate-typespec - typespec-generation-step02 - get-modified-typespec-projects (stub — needs git-repo fixture / setup hook) - add-arm-resource (stub — needs fixtures + npx tsp compile post-check) - create-release-plan - link-namespace-approval-issue - get-pr-link-current-branch - check-sdk-generation-status Each eval uses the built-in tool-calls grader for presence checks; the original benchmark's argument/order/forbidden/optional assertions are captured in prompt text + inline TODOs (require custom graders or upstream Vally support, documented in README). Also adds release-plan/github/pipeline suites to .vally.yaml. All 10 evals pass 'vally lint --eval-spec'.
Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README.
This was referenced Jun 2, 2026
#15183 - Move 11 multi-step scenario evals to evals/scenarios/ - Port 9 per-tool trigger evals from jeo02/migrate-evaluations-to-vally (PR #15183) to evals/triggers/, stripped azure-sdk-mcp- prefix from graders to match bare MCP tool names - Port Validate-EvalTools.ps1 to scripts/, retargeted at evals/triggers/ with bare-name regex - Update .vally.yaml suites for new layout (scenarios, triggers, all) - Update README to document the split and per-trigger-file tool coverage - Add .gitignore for vally-results/ and results/
Replace per-area folders (scenarios/, triggers/) with tier-based folders. Feature area moves to a YAML tag, enabling tag-filtered suites. Add composite suites (pr-gate, nightly) and area-filtered suites in .vally.yaml. Update Validate-EvalTools.ps1 to scan evals/unit for triggers-*.eval.yaml. Refresh README and Run-LiveEvals.ps1 paths.
…Azure/azure-sdk-tools into feat/vally-tool-scenarios-15124
Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'.
helen229
added a commit
that referenced
this pull request
Jun 4, 2026
…list (#15852) (#15854) * Add MCP tool-coverage drift check for Azure.Sdk.Tools.Mock (#15852) - New eng/scripts/Get-McpToolInventory.ps1 boots the live Azure.Sdk.Tools.Cli MCP server (via 'azsdk list -o json'), enumerates the IMockToolHandler implementations under Azure.Sdk.Tools.Mock, and reports the diff in three buckets: both / live-only / mock-only. - Cross-references mock-tier eval YAMLs under tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/ when present; gracefully no-ops when that folder hasn't landed yet (PR #15811). - '-CheckOnly' exits non-zero on (a) any stale handler that no longer maps to a live tool, or (b) any tool referenced by a mock-tier eval without a handler -- intended for the CI job tracked in #15829. - Documents the drift workflow in Azure.Sdk.Tools.Mock/README.md so a contributor flagged by the script knows how to add a handler. No stale handlers detected against the current live tool set. * Add mock handlers for remaining live MCP tools; drop eval scanning from inventory script (#15852) - 13 new handler files covering 63 live tools that previously fell back to the default response (APIView, Codeowners, EngSys, GitHub, Package, Pipeline, ReleasePlan, TypeSpec, Verify, Core, Example). - Get-McpToolInventory.ps1: pure live-vs-mock parity (removes Vally eval cross-reference); -CheckOnly fails if either bucket is non-empty. - README: updated sync workflow to reflect parity-only check. * Simplify Get-McpToolInventory.ps1: no parameters, always exits non-zero on drift (#15852) * Fix 3 release-plan handler response types to match live tools (#15852) Addresses Copilot review on PR #15854: - azsdk_get_kpi_attestation_status: ReleaseWorkflowResponse -> ReleasePlanListResponse - azsdk_get_service_details_by_typespec_path: ReleaseWorkflowResponse -> ProductInfoResponse - azsdk_update_language_exclusion_justification: ReleaseWorkflowResponse -> DefaultCommandResponse * Drop Get-McpToolInventory.ps1 (#15852) Per review discussion: the script only checked that an IMockToolHandler exists with the right ToolName; it could not detect handlers that exist but just return the placeholder DefaultCommandResponse. That blind spot makes the script of limited value. A unit test in Cli.Tests is a better fit for actual drift enforcement and is tracked as a follow-up. README updated to drop the script reference. * Update Mock README: drop reference to removed inventory script (#15852)
All 5 release-planner mock stimuli now use environment.git worktree pointing at the per-user azure-rest-api-specs cache (matching the live e2e fixture), plus a structured e2e-style prompt that supplies the Contoso fixture IDs the mock handlers expect (TypeSpec project, service/product tree IDs, work-item ID 29262). Also document the --skill-dir requirement and worker-cap caveat in README, and fix one stale path in .vally.yaml comment.
Member
Author
|
Tracking the underlying MCP boot-race root cause + fix rationale in #15948. |
- Launch pre-built DLLs via 'dotnet <dll>' in both .vally.yaml files instead of 'dotnet run', so N parallel workers no longer race on Roslyn's exclusive write lock for the output DLL. - Add 'Build MCP servers' step to eng/pipelines/skill-eval.yml so the CI runner has the DLLs ready before vally starts. - Drop the skill-invocation grader from generate-sdk-for-existing-release-plan (no preflight reasoning step required; tools-only). - Strip 'I'm in a checkout of azure-rest-api-specs.' preamble from prompts; the worktree already provides that context. - Remove stray '// tools skills response' artifact in live release-planner.eval.yaml. - README: document 'dotnet build' as a prereq; rewrite workers warning. Validated: scenarios-mock at --workers 6 -> 5/5 stimuli pass, 0 race hits, ~4 min.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR establishes tools/azsdk-cli/Azure.Sdk.Tools.Vally as the unified home for Azure SDK MCP tool invocation evals and multi-step workflow scenarios using @microsoft/vally-cli, porting prior benchmark coverage and consolidating per-tool trigger evals into a single surface area. It also updates existing skill-eval infrastructure to launch pre-built MCP server DLLs (avoiding dotnet run/MSBuild races under parallel workers).
Changes:
- Added the new
Azure.Sdk.Tools.Vallyproject structure, including Vally config, eval suites (tool triggers + workflow scenarios), and local helper scripts. - Ported and organized trigger eval YAMLs and scenario eval YAMLs to cover tool invocation drift and multi-tool workflows.
- Updated skill-eval pipeline/config to run MCP servers via pre-built DLLs (
dotnet <dll>) rather thandotnet run.
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/azsdk-cli/docs/specs/8-operations-agent-eval-strategy.spec.md | Adds a design/spec document describing the eval strategy and intended suite structure. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/.vally.yaml | Defines Vally environments (mock/live MCP) and suites for running the new evals. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/.gitignore | Ignores local Vally output folders. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/README.md | Documents purpose, layout, and how to run Vally evals for tool scenarios and workflows. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/fixtures/.gitkeep | Establishes fixture folder conventions for eval inputs. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/scripts/Validate-EvalTools.ps1 | Adds a drift/coverage validator to cross-check trigger eval tool references vs server tool catalog. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/setup/ensure-specs-clone.ps1 | Adds helper to maintain a cached sparse clone of azure-rest-api-specs for scenarios needing a repo on disk. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/check-public-repo.eval.yaml | Adds a unit-tier tool-call eval for public-repo presence checks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/validate-typespec.eval.yaml | Adds a unit-tier tool-call eval for TypeSpec validation. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/get-modified-typespec-projects.eval.yaml | Adds a unit-tier tool-call eval for listing modified TypeSpec projects. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/add-arm-resource.eval.yaml | Adds a (currently stub-like) authoring scenario expecting plan generation + edits. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/create-release-plan.eval.yaml | Adds a unit-tier tool-call eval for creating a release plan. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/link-namespace-approval-issue.eval.yaml | Adds a unit-tier tool-call eval for linking namespace approval to a release plan. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/get-pr-link-current-branch.eval.yaml | Adds a unit-tier tool-call eval for resolving PR link for current branch. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/check-sdk-generation-status.eval.yaml | Adds a unit-tier tool-call eval for pipeline status checks. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-apiview.eval.yaml | Adds trigger stimuli covering APIView-related MCP tools. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-config.eval.yaml | Adds trigger stimuli covering config/label MCP tools. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-engsys.eval.yaml | Adds trigger stimuli covering engineering-system MCP tools (logs/tests/etc.). |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-github.eval.yaml | Adds trigger stimuli covering GitHub MCP tools. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-package.eval.yaml | Adds trigger stimuli covering package generation/build/test/release MCP tools. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-pipeline.eval.yaml | Adds trigger stimuli covering pipeline MCP tools. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-releaseplan.eval.yaml | Adds trigger stimuli covering release-plan MCP tools. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-typespec.eval.yaml | Adds trigger stimuli covering TypeSpec MCP tools. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-verify.eval.yaml | Adds trigger stimuli covering setup verification MCP tool. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/check-public-repo-then-validate.eval.yaml | Adds a mock multi-tool workflow scenario (validate then public-repo check). |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/typespec-generation-step02.eval.yaml | Adds a mock workflow scenario for TypeSpec generation step 2 behavior. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/rename-client-property.eval.yaml | Adds a stub workflow scenario intended for a future expected-diff grader. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/release-planner-workflows.eval.yaml | Adds mock workflow stimuli for key release-planner scenarios. |
| tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/live/release-planner.eval.yaml | Adds a live end-to-end scenario that creates a plan, generates SDK, and links PR back. |
| eng/pipelines/skill-eval.yml | Pre-builds MCP servers so Vally can launch pre-built DLLs (reducing parallel-run flakiness). |
| .github/skills/.vally.yaml | Updates skill eval environment config to launch MCP servers via pre-built DLLs. |
|
|
||
| **Tool-scenario evals (this project)** — organised by the standard test pyramid under [`evals/`](evals/). The folder is the **cost tier** (and CI cadence); the feature **area** is a tag inside each YAML so cross-cuts work via `.vally.yaml` suite filters. | ||
|
|
||
| #### `evals/unit/` — hermetic single-tool evals (18) |
Comment on lines
+71
to
+82
| #### `evals/scenarios/` — multi-tool scenarios (4) | ||
|
|
||
| Multi-step prompts that exercise 2+ MCP tools end-to-end. Split into | ||
| `mock/` (hermetic, runs on PR gate) and `live/` (real DevOps / GitHub / | ||
| pipelines, runs nightly). | ||
|
|
||
| | Scenario | Area | Mode | Shape | | ||
| |---|---|---|---| | ||
| | [`check-public-repo-then-validate`](evals/scenarios/mock/check-public-repo-then-validate.eval.yaml) | typespec | mock | Validate, then check public-repo presence | | ||
| | [`typespec-generation-step02`](evals/scenarios/mock/typespec-generation-step02.eval.yaml) | typespec | mock | Step in the spec-PR generation flow | | ||
| | [`rename-client-property`](evals/scenarios/mock/rename-client-property.eval.yaml) | typespec | mock | Stub — needs `expected-diff` grader + sparse clone | | ||
| | [`release-planner`](evals/scenarios/live/release-planner.eval.yaml) | release-plan | **live** | Create + re-fetch a release plan, kick off SDK gen, link PR back — real DevOps test-area writes | |
Comment on lines
+6
to
+10
| This script: | ||
| 1. Runs `azsdk list` to get all registered MCP tool names from the server. | ||
| 2. Parses all `triggers-*.eval.yaml` files under the unit/ directory. | ||
| 3. Reports any eval tool references that don't exist on the server, | ||
| and any server tools that are missing eval coverage. |
Comment on lines
+15
to
+18
| .PARAMETER EvalPath | ||
| Path to the directory containing `triggers-*.eval.yaml` files. | ||
| Defaults to ../evals/unit relative to this script. | ||
|
|
Comment on lines
+39
to
+41
| if (-not $EvalPath) { | ||
| $EvalPath = Join-Path $vallyRoot "evals/unit" | ||
| } |
Comment on lines
+11
to
+15
| # - Not an end-to-end flow (see release-planner-e2e.eval.yaml for that). | ||
| # - Does not validate argument values yet — see TODO below + #15833. | ||
| # - Does not need azure-rest-api-specs cloned; runs against the live MCP | ||
| # server in agent-testing mode (AZSDKTOOLS_AGENT_TESTING=true, set in | ||
| # ../../.vally.yaml). |
Comment on lines
+17
to
+20
| # How to run locally: | ||
| # cd tools/azsdk-cli/Azure.Sdk.Tools.Vally | ||
| # ../../../eng/skill-eval/node_modules/.bin/vally.cmd eval \ | ||
| # --eval-spec evals/unit/create-release-plan.eval.yaml --verbose |
Comment on lines
+16
to
+19
| Bound to the mock MCP — these graders only inspect skill routing and tool | ||
| selection, not real DevOps writes. The full live e2e flow lives in | ||
| evals/scenarios/live/release-planner.eval.yaml. | ||
|
|
Comment on lines
+42
to
+45
| git: | ||
| type: worktree | ||
| source: C:/Users/gaoh/.vally-cache/azure-rest-api-specs | ||
| ref: main |
Comment on lines
+55
to
+64
| # Source is the per-user cache populated by evals/setup/ensure-specs-clone.ps1 | ||
| # (idempotent shallow+sparse clone, auto-refresh every 24h). | ||
| # NOTE: hardcoded absolute path — Vally does not currently expand | ||
| # ${USERPROFILE} / env vars in env.git.source. Adjust per machine | ||
| # or replace with a CI-provided path. See upstream issue: | ||
| # https://github.com/microsoft/vally/issues (TODO: file env-var expansion) | ||
| git: | ||
| type: worktree | ||
| source: C:/Users/gaoh/.vally-cache/azure-rest-api-specs | ||
| ref: main |
…ot found' lookup The create-release-plan-and-generate-sdk mock stimulus required the agent to call azsdk_update_sdk_details_in_release_plan, but neither the prompt nor the azsdk-common-prepare-release-plan skill's create flow asks for it. The agent correctly skipped the tool, and the grader flapped. The dedicated update-sdk-details-in-release-plan stimulus already covers that tool with an explicit prompt. Drop it from the create+generate grader so mock matches the live release-planner-e2e contract (create / get / generate / link). Also patch GetReleasePlanForSpecPrHandler to return a deterministic 'not found' response (ReleasePlanDetails = null). The mock previously returned a 'plan exists' result for any spec PR, pushing the agent down the update path instead of the create path that the stimulus exercises. Stimuli that target an existing plan pass the work-item ID directly and call azsdk_get_release_plan, so this is safe.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #15124.
Stands up
Azure.Sdk.Tools.Vallyas the home for MCP-tool scenario and trigger evals, ports the legacyAzure.Sdk.Tools.Cli.Evaluationsbenchmarks, and folds in the per-tool trigger evals from #15183 so we have a single eval surface.What's in the PR
New project:
tools/azsdk-cli/Azure.Sdk.Tools.Vally/.vally.yaml— singleazsdk-mcpenvironment that spawnsAzure.Sdk.Tools.Cliviadotnet run; named suites for selective execution (typespec,release-plan,github,pipeline,scenarios,triggers,all)..gitignore— excludes localvally-results/andresults/.README.md— explains how Vally evals relate to the per-skill evals under.github/skills/, lists scenario + trigger coverage, documents the run loop.evals/scenarios/— 11 multi-step workflow evals (the #15124 port)Ported from
Azure.Sdk.Tools.Cli.Evaluationsand reshaped for Vally'stool-callsgrader:check-public-repoazure-rest-api-specs?check-public-repo-then-validatevalidate-typespectsplinter/validationtypespec-generation-step02get-modified-typespec-projectsadd-arm-resourceazsdk_typespec_generate_authoring_planfor an ARM resourcecreate-release-planlink-namespace-approval-issueget-pr-link-current-branchcheck-sdk-generation-statusrename-client-propertyexpected-diffgrader (follow-up)evals/triggers/— 9 per-tool trigger evals (ported from #15183)One YAML per tool category; each stimulus is a single prompt expected to invoke one MCP tool. Used to catch tool-rename / description-drift regressions.
apiview,config,engsys,github,package,pipeline,releaseplan,typespec,verify— covering the bulk of theazsdk_*tool surface.scripts/Validate-EvalTools.ps1(ported from #15183)Drift detector. Runs
azsdk list --output jsonand cross-checks:evals/triggers/exists on the running MCP server (catches renames)hello_world,upgrade, codeowner helpers) are filtered outWhat's not in this PR (deliberate)
AZSDKTOOLS_AGENT_TESTINGtoggle — currentlyfalse. There's a real-e2e vs. safe-replay tradeoff worth a separate discussion; flipping it would need either a secondazsdk-mcp-liveenvironment or a CI policy. Left for a follow-up.rename-client-propertygrader — still a stub awaiting a Vallyexpected-diffgrader.ci.ymlundereng/pipelines/templatesis a follow-up.Acknowledgements
Trigger evals +
Validate-EvalTools.ps1ported from @jeo02's #15183 (migrate-evaluations-to-vally); prefixes stripped fromazure-sdk-mcp-azsdk_*→azsdk_*to match the bare MCP tool names emitted in trajectories. Once this PR merges, #15183 is superseded.Verification
dotnet buildonAzure.Sdk.Tools.Vally— green.vally eval --eval-spec evals/scenarios/check-public-repo-then-validate.eval.yaml— runs end-to-end against the MCP server and grades againsttool-calls(trajectory captured undervally-results/).scripts/Validate-EvalTools.ps1— runs against a live MCP server and produces the expected coverage report.