Skip to content

Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124)#15811

Open
helen229 wants to merge 28 commits into
mainfrom
feat/vally-tool-scenarios-15124
Open

Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124)#15811
helen229 wants to merge 28 commits into
mainfrom
feat/vally-tool-scenarios-15124

Conversation

@helen229
Copy link
Copy Markdown
Member

@helen229 helen229 commented Jun 1, 2026

Closes #15124.

Stands up Azure.Sdk.Tools.Vally as the home for MCP-tool scenario and trigger evals, ports the legacy Azure.Sdk.Tools.Cli.Evaluations benchmarks, and folds in the per-tool trigger evals from #15183 so we have a single eval surface.

What's in the PR

New project: tools/azsdk-cli/Azure.Sdk.Tools.Vally/

  • .vally.yaml — single azsdk-mcp environment that spawns Azure.Sdk.Tools.Cli via dotnet run; named suites for selective execution (typespec, release-plan, github, pipeline, scenarios, triggers, all).
  • .gitignore — excludes local vally-results/ and results/.
  • README.md — explains how Vally evals relate to the per-skill evals under .github/skills/, lists scenario + trigger coverage, documents the run loop.

evals/scenarios/ — 11 multi-step workflow evals (the #15124 port)

Ported from Azure.Sdk.Tools.Cli.Evaluations and reshaped for Vally's tool-calls grader:

Scenario Shape
check-public-repo Single-tool: is a TypeSpec project published in azure-rest-api-specs?
check-public-repo-then-validate Multi-tool, ordered: validate then check
validate-typespec Single-tool: tsp linter/validation
typespec-generation-step02 Step in the spec-PR generation flow
get-modified-typespec-projects Git-aware tool against current branch
add-arm-resource Calls azsdk_typespec_generate_authoring_plan for an ARM resource
create-release-plan Single-tool: create a release-plan work item
link-namespace-approval-issue Link an existing approval issue to a release plan
get-pr-link-current-branch Resolve the PR for the active git branch
check-sdk-generation-status Pipeline status lookup
rename-client-property Stub — needs expected-diff grader (follow-up)

evals/triggers/ — 9 per-tool trigger evals (ported from #15183)

One YAML per tool category; each stimulus is a single prompt expected to invoke one MCP tool. Used to catch tool-rename / description-drift regressions.

apiview, config, engsys, github, package, pipeline, releaseplan, typespec, verify — covering the bulk of the azsdk_* tool surface.

scripts/Validate-EvalTools.ps1 (ported from #15183)

Drift detector. Runs azsdk list --output json and cross-checks:

  • every tool referenced in evals/triggers/ exists on the running MCP server (catches renames)
  • every server tool has at least one trigger eval (catches new tools landing without coverage)
  • known-excluded tools (examples, hello_world, upgrade, codeowner helpers) are filtered out

What's not in this PR (deliberate)

  • AZSDKTOOLS_AGENT_TESTING toggle — currently false. There's a real-e2e vs. safe-replay tradeoff worth a separate discussion; flipping it would need either a second azsdk-mcp-live environment or a CI policy. Left for a follow-up.
  • rename-client-property grader — still a stub awaiting a Vally expected-diff grader.
  • CI wiring — the project builds and runs locally; a ci.yml under eng/pipelines/templates is a follow-up.

Acknowledgements

Trigger evals + Validate-EvalTools.ps1 ported from @jeo02's #15183 (migrate-evaluations-to-vally); prefixes stripped from azure-sdk-mcp-azsdk_*azsdk_* to match the bare MCP tool names emitted in trajectories. Once this PR merges, #15183 is superseded.

Verification

  • dotnet build on Azure.Sdk.Tools.Vally — green.
  • vally eval --eval-spec evals/scenarios/check-public-repo-then-validate.eval.yaml — runs end-to-end against the MCP server and grades against tool-calls (trajectory captured under vally-results/).
  • scripts/Validate-EvalTools.ps1 — runs against a live MCP server and produces the expected coverage report.

helen229 added 3 commits June 1, 2026 10:22
Adds a new Vally eval suite under tools/azsdk-cli/Azure.Sdk.Tools.Vally/ for MCP tool / scenario evaluations, replacing the deleted Azure.Sdk.Tools.Cli.Benchmarks project (#15697).

- README documents project intent, layout, local run instructions, and how to add a new scenario.

- .vally.yaml wires the azsdk-mcp environment (stdio dotnet run against Azure.Sdk.Tools.Cli) and defines 'typespec' and 'all' suites.

- evals/check-public-repo.eval.yaml is the first ported scenario (from the deleted CheckPublicRepoScenario): verifies the agent invokes azsdk_typespec_check_project_in_public_repo for a public-repo check prompt. Lints clean via 'vally lint --eval-spec'.

- fixtures/.gitkeep reserves the per-scenario fixtures layout.

Remaining scenarios from the deleted benchmark are tracked as a checklist in the project README and in #15124.
Adds eval YAMLs for every scenario that was deleted from Azure.Sdk.Tools.Cli.Benchmarks in #15697:

- check-public-repo-then-validate

- validate-typespec

- typespec-generation-step02

- get-modified-typespec-projects (stub — needs git-repo fixture / setup hook)

- add-arm-resource (stub — needs fixtures + npx tsp compile post-check)

- create-release-plan

- link-namespace-approval-issue

- get-pr-link-current-branch

- check-sdk-generation-status

Each eval uses the built-in tool-calls grader for presence checks; the original benchmark's argument/order/forbidden/optional assertions are captured in prompt text + inline TODOs (require custom graders or upstream Vally support, documented in README). Also adds release-plan/github/pipeline suites to .vally.yaml. All 10 evals pass 'vally lint --eval-spec'.
Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README.
helen229 added 10 commits June 2, 2026 11:48
#15183

- Move 11 multi-step scenario evals to evals/scenarios/
- Port 9 per-tool trigger evals from jeo02/migrate-evaluations-to-vally (PR #15183) to evals/triggers/, stripped azure-sdk-mcp- prefix from graders to match bare MCP tool names
- Port Validate-EvalTools.ps1 to scripts/, retargeted at evals/triggers/ with bare-name regex
- Update .vally.yaml suites for new layout (scenarios, triggers, all)
- Update README to document the split and per-trigger-file tool coverage
- Add .gitignore for vally-results/ and results/
Replace per-area folders (scenarios/, triggers/) with tier-based folders. Feature area moves to a YAML tag, enabling tag-filtered suites. Add composite suites (pr-gate, nightly) and area-filtered suites in .vally.yaml. Update Validate-EvalTools.ps1 to scan evals/unit for triggers-*.eval.yaml. Refresh README and Run-LiveEvals.ps1 paths.
Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'.
helen229 added a commit that referenced this pull request Jun 4, 2026
…list (#15852) (#15854)

* Add MCP tool-coverage drift check for Azure.Sdk.Tools.Mock (#15852)

- New eng/scripts/Get-McpToolInventory.ps1 boots the live Azure.Sdk.Tools.Cli MCP server (via 'azsdk list -o json'), enumerates the IMockToolHandler implementations under Azure.Sdk.Tools.Mock, and reports the diff in three buckets: both / live-only / mock-only.

- Cross-references mock-tier eval YAMLs under tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/ when present; gracefully no-ops when that folder hasn't landed yet (PR #15811).

- '-CheckOnly' exits non-zero on (a) any stale handler that no longer maps to a live tool, or (b) any tool referenced by a mock-tier eval without a handler -- intended for the CI job tracked in #15829.

- Documents the drift workflow in Azure.Sdk.Tools.Mock/README.md so a contributor flagged by the script knows how to add a handler. No stale handlers detected against the current live tool set.

* Add mock handlers for remaining live MCP tools; drop eval scanning from inventory script (#15852)

- 13 new handler files covering 63 live tools that previously fell back to the default response (APIView, Codeowners, EngSys, GitHub, Package, Pipeline, ReleasePlan, TypeSpec, Verify, Core, Example).
- Get-McpToolInventory.ps1: pure live-vs-mock parity (removes Vally eval cross-reference); -CheckOnly fails if either bucket is non-empty.
- README: updated sync workflow to reflect parity-only check.

* Simplify Get-McpToolInventory.ps1: no parameters, always exits non-zero on drift (#15852)

* Fix 3 release-plan handler response types to match live tools (#15852)

Addresses Copilot review on PR #15854:
- azsdk_get_kpi_attestation_status: ReleaseWorkflowResponse -> ReleasePlanListResponse
- azsdk_get_service_details_by_typespec_path: ReleaseWorkflowResponse -> ProductInfoResponse
- azsdk_update_language_exclusion_justification: ReleaseWorkflowResponse -> DefaultCommandResponse

* Drop Get-McpToolInventory.ps1 (#15852)

Per review discussion: the script only checked that an IMockToolHandler exists with the right ToolName; it could not detect handlers that exist but just return the placeholder DefaultCommandResponse. That blind spot makes the script of limited value. A unit test in Cli.Tests is a better fit for actual drift enforcement and is tracked as a follow-up. README updated to drop the script reference.

* Update Mock README: drop reference to removed inventory script (#15852)
helen229 added 6 commits June 4, 2026 07:33
All 5 release-planner mock stimuli now use environment.git worktree pointing at the per-user azure-rest-api-specs cache (matching the live e2e fixture), plus a structured e2e-style prompt that supplies the Contoso fixture IDs the mock handlers expect (TypeSpec project, service/product tree IDs, work-item ID 29262). Also document the --skill-dir requirement and worker-cap caveat in README, and fix one stale path in .vally.yaml comment.
@helen229
Copy link
Copy Markdown
Member Author

helen229 commented Jun 5, 2026

Tracking the underlying MCP boot-race root cause + fix rationale in #15948.

- Launch pre-built DLLs via 'dotnet <dll>' in both .vally.yaml files instead of 'dotnet run', so N parallel workers no longer race on Roslyn's exclusive write lock for the output DLL.

- Add 'Build MCP servers' step to eng/pipelines/skill-eval.yml so the CI runner has the DLLs ready before vally starts.

- Drop the skill-invocation grader from generate-sdk-for-existing-release-plan (no preflight reasoning step required; tools-only).

- Strip 'I'm in a checkout of azure-rest-api-specs.' preamble from prompts; the worktree already provides that context.

- Remove stray '// tools skills response' artifact in live release-planner.eval.yaml.

- README: document 'dotnet build' as a prereq; rewrite workers warning.

Validated: scenarios-mock at --workers 6 -> 5/5 stimuli pass, 0 race hits, ~4 min.
@helen229 helen229 marked this pull request as ready for review June 5, 2026 20:54
@helen229 helen229 requested a review from a team as a code owner June 5, 2026 20:54
Copilot AI review requested due to automatic review settings June 5, 2026 20:54
@helen229 helen229 requested a review from a team as a code owner June 5, 2026 20:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR establishes tools/azsdk-cli/Azure.Sdk.Tools.Vally as the unified home for Azure SDK MCP tool invocation evals and multi-step workflow scenarios using @microsoft/vally-cli, porting prior benchmark coverage and consolidating per-tool trigger evals into a single surface area. It also updates existing skill-eval infrastructure to launch pre-built MCP server DLLs (avoiding dotnet run/MSBuild races under parallel workers).

Changes:

  • Added the new Azure.Sdk.Tools.Vally project structure, including Vally config, eval suites (tool triggers + workflow scenarios), and local helper scripts.
  • Ported and organized trigger eval YAMLs and scenario eval YAMLs to cover tool invocation drift and multi-tool workflows.
  • Updated skill-eval pipeline/config to run MCP servers via pre-built DLLs (dotnet <dll>) rather than dotnet run.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
tools/azsdk-cli/docs/specs/8-operations-agent-eval-strategy.spec.md Adds a design/spec document describing the eval strategy and intended suite structure.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/.vally.yaml Defines Vally environments (mock/live MCP) and suites for running the new evals.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/.gitignore Ignores local Vally output folders.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/README.md Documents purpose, layout, and how to run Vally evals for tool scenarios and workflows.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/fixtures/.gitkeep Establishes fixture folder conventions for eval inputs.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/scripts/Validate-EvalTools.ps1 Adds a drift/coverage validator to cross-check trigger eval tool references vs server tool catalog.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/setup/ensure-specs-clone.ps1 Adds helper to maintain a cached sparse clone of azure-rest-api-specs for scenarios needing a repo on disk.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/check-public-repo.eval.yaml Adds a unit-tier tool-call eval for public-repo presence checks.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/validate-typespec.eval.yaml Adds a unit-tier tool-call eval for TypeSpec validation.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/get-modified-typespec-projects.eval.yaml Adds a unit-tier tool-call eval for listing modified TypeSpec projects.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/add-arm-resource.eval.yaml Adds a (currently stub-like) authoring scenario expecting plan generation + edits.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/create-release-plan.eval.yaml Adds a unit-tier tool-call eval for creating a release plan.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/link-namespace-approval-issue.eval.yaml Adds a unit-tier tool-call eval for linking namespace approval to a release plan.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/get-pr-link-current-branch.eval.yaml Adds a unit-tier tool-call eval for resolving PR link for current branch.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/check-sdk-generation-status.eval.yaml Adds a unit-tier tool-call eval for pipeline status checks.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-apiview.eval.yaml Adds trigger stimuli covering APIView-related MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-config.eval.yaml Adds trigger stimuli covering config/label MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-engsys.eval.yaml Adds trigger stimuli covering engineering-system MCP tools (logs/tests/etc.).
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-github.eval.yaml Adds trigger stimuli covering GitHub MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-package.eval.yaml Adds trigger stimuli covering package generation/build/test/release MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-pipeline.eval.yaml Adds trigger stimuli covering pipeline MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-releaseplan.eval.yaml Adds trigger stimuli covering release-plan MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-typespec.eval.yaml Adds trigger stimuli covering TypeSpec MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-verify.eval.yaml Adds trigger stimuli covering setup verification MCP tool.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/check-public-repo-then-validate.eval.yaml Adds a mock multi-tool workflow scenario (validate then public-repo check).
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/typespec-generation-step02.eval.yaml Adds a mock workflow scenario for TypeSpec generation step 2 behavior.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/rename-client-property.eval.yaml Adds a stub workflow scenario intended for a future expected-diff grader.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/release-planner-workflows.eval.yaml Adds mock workflow stimuli for key release-planner scenarios.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/live/release-planner.eval.yaml Adds a live end-to-end scenario that creates a plan, generates SDK, and links PR back.
eng/pipelines/skill-eval.yml Pre-builds MCP servers so Vally can launch pre-built DLLs (reducing parallel-run flakiness).
.github/skills/.vally.yaml Updates skill eval environment config to launch MCP servers via pre-built DLLs.


**Tool-scenario evals (this project)** — organised by the standard test pyramid under [`evals/`](evals/). The folder is the **cost tier** (and CI cadence); the feature **area** is a tag inside each YAML so cross-cuts work via `.vally.yaml` suite filters.

#### `evals/unit/` — hermetic single-tool evals (18)
Comment on lines +71 to +82
#### `evals/scenarios/` — multi-tool scenarios (4)

Multi-step prompts that exercise 2+ MCP tools end-to-end. Split into
`mock/` (hermetic, runs on PR gate) and `live/` (real DevOps / GitHub /
pipelines, runs nightly).

| Scenario | Area | Mode | Shape |
|---|---|---|---|
| [`check-public-repo-then-validate`](evals/scenarios/mock/check-public-repo-then-validate.eval.yaml) | typespec | mock | Validate, then check public-repo presence |
| [`typespec-generation-step02`](evals/scenarios/mock/typespec-generation-step02.eval.yaml) | typespec | mock | Step in the spec-PR generation flow |
| [`rename-client-property`](evals/scenarios/mock/rename-client-property.eval.yaml) | typespec | mock | Stub — needs `expected-diff` grader + sparse clone |
| [`release-planner`](evals/scenarios/live/release-planner.eval.yaml) | release-plan | **live** | Create + re-fetch a release plan, kick off SDK gen, link PR back — real DevOps test-area writes |
Comment on lines +6 to +10
This script:
1. Runs `azsdk list` to get all registered MCP tool names from the server.
2. Parses all `triggers-*.eval.yaml` files under the unit/ directory.
3. Reports any eval tool references that don't exist on the server,
and any server tools that are missing eval coverage.
Comment on lines +15 to +18
.PARAMETER EvalPath
Path to the directory containing `triggers-*.eval.yaml` files.
Defaults to ../evals/unit relative to this script.

Comment on lines +39 to +41
if (-not $EvalPath) {
$EvalPath = Join-Path $vallyRoot "evals/unit"
}
Comment on lines +11 to +15
# - Not an end-to-end flow (see release-planner-e2e.eval.yaml for that).
# - Does not validate argument values yet — see TODO below + #15833.
# - Does not need azure-rest-api-specs cloned; runs against the live MCP
# server in agent-testing mode (AZSDKTOOLS_AGENT_TESTING=true, set in
# ../../.vally.yaml).
Comment on lines +17 to +20
# How to run locally:
# cd tools/azsdk-cli/Azure.Sdk.Tools.Vally
# ../../../eng/skill-eval/node_modules/.bin/vally.cmd eval \
# --eval-spec evals/unit/create-release-plan.eval.yaml --verbose
Comment on lines +16 to +19
Bound to the mock MCP — these graders only inspect skill routing and tool
selection, not real DevOps writes. The full live e2e flow lives in
evals/scenarios/live/release-planner.eval.yaml.

Comment on lines +42 to +45
git:
type: worktree
source: C:/Users/gaoh/.vally-cache/azure-rest-api-specs
ref: main
Comment on lines +55 to +64
# Source is the per-user cache populated by evals/setup/ensure-specs-clone.ps1
# (idempotent shallow+sparse clone, auto-refresh every 24h).
# NOTE: hardcoded absolute path — Vally does not currently expand
# ${USERPROFILE} / env vars in env.git.source. Adjust per machine
# or replace with a CI-provided path. See upstream issue:
# https://github.com/microsoft/vally/issues (TODO: file env-var expansion)
git:
type: worktree
source: C:/Users/gaoh/.vally-cache/azure-rest-api-specs
ref: main
helen229 added 2 commits June 5, 2026 22:44
…ot found' lookup

The create-release-plan-and-generate-sdk mock stimulus required the agent to call
azsdk_update_sdk_details_in_release_plan, but neither the prompt nor the
azsdk-common-prepare-release-plan skill's create flow asks for it. The agent
correctly skipped the tool, and the grader flapped. The dedicated
update-sdk-details-in-release-plan stimulus already covers that tool with an
explicit prompt. Drop it from the create+generate grader so mock matches the
live release-planner-e2e contract (create / get / generate / link).

Also patch GetReleasePlanForSpecPrHandler to return a deterministic
'not found' response (ReleasePlanDetails = null). The mock previously
returned a 'plan exists' result for any spec PR, pushing the agent down
the update path instead of the create path that the stimulus exercises.
Stimuli that target an existing plan pass the work-item ID directly and
call azsdk_get_release_plan, so this is safe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate Benchmarks + Tool invocation from evaluate

2 participants