Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124) by helen229 · Pull Request #15811 · Azure/azure-sdk-tools

helen229 · 2026-06-01T17:22:50Z

Stands up Azure.Sdk.Tools.Vally as the home for MCP-tool scenario and trigger evals, ports the legacy Azure.Sdk.Tools.Cli.Evaluations benchmarks, and folds in the per-tool trigger evals from #15183 so we have a single eval surface.

What's in the PR

New project: `tools/azsdk-cli/Azure.Sdk.Tools.Vally/`

.vally.yaml — single azsdk-mcp environment that spawns Azure.Sdk.Tools.Cli via dotnet run; named suites for selective execution (typespec, release-plan, github, pipeline, scenarios, triggers, all).
.gitignore — excludes local vally-results/ and results/.
README.md — explains how Vally evals relate to the per-skill evals under .github/skills/, lists scenario + trigger coverage, documents the run loop.

`evals/scenarios/` — 11 multi-step workflow evals (the #15124 port)

Ported from Azure.Sdk.Tools.Cli.Evaluations and reshaped for Vally's tool-calls grader:

Scenario	Shape
`check-public-repo`	Single-tool: is a TypeSpec project published in `azure-rest-api-specs`?
`check-public-repo-then-validate`	Multi-tool, ordered: validate then check
`validate-typespec`	Single-tool: `tsp` linter/validation
`typespec-generation-step02`	Step in the spec-PR generation flow
`get-modified-typespec-projects`	Git-aware tool against current branch
`add-arm-resource`	Calls `azsdk_typespec_generate_authoring_plan` for an ARM resource
`create-release-plan`	Single-tool: create a release-plan work item
`link-namespace-approval-issue`	Link an existing approval issue to a release plan
`get-pr-link-current-branch`	Resolve the PR for the active git branch
`check-sdk-generation-status`	Pipeline status lookup
`rename-client-property`	Stub — needs `expected-diff` grader (follow-up)

`evals/triggers/` — 9 per-tool trigger evals (ported from #15183)

One YAML per tool category; each stimulus is a single prompt expected to invoke one MCP tool. Used to catch tool-rename / description-drift regressions.

apiview, config, engsys, github, package, pipeline, releaseplan, typespec, verify — covering the bulk of the azsdk_* tool surface.

`scripts/Validate-EvalTools.ps1` (ported from #15183)

Drift detector. Runs azsdk list --output json and cross-checks:

every tool referenced in evals/triggers/ exists on the running MCP server (catches renames)
every server tool has at least one trigger eval (catches new tools landing without coverage)
known-excluded tools (examples, hello_world, upgrade, codeowner helpers) are filtered out

What's not in this PR (deliberate)

AZSDKTOOLS_AGENT_TESTING toggle — currently false. There's a real-e2e vs. safe-replay tradeoff worth a separate discussion; flipping it would need either a second azsdk-mcp-live environment or a CI policy. Left for a follow-up.
rename-client-property grader — still a stub awaiting a Vally expected-diff grader.
CI wiring — the project builds and runs locally; a ci.yml under eng/pipelines/templates is a follow-up.

Acknowledgements

Trigger evals + Validate-EvalTools.ps1 ported from @jeo02's #15183 (migrate-evaluations-to-vally); prefixes stripped from azure-sdk-mcp-azsdk_* → azsdk_* to match the bare MCP tool names emitted in trajectories. Once this PR merges, #15183 is superseded.

Verification

dotnet build on Azure.Sdk.Tools.Vally — green.
vally eval --eval-spec evals/scenarios/check-public-repo-then-validate.eval.yaml — runs end-to-end against the MCP server and grades against tool-calls (trajectory captured under vally-results/).
scripts/Validate-EvalTools.ps1 — runs against a live MCP server and produces the expected coverage report.

Adds a new Vally eval suite under tools/azsdk-cli/Azure.Sdk.Tools.Vally/ for MCP tool / scenario evaluations, replacing the deleted Azure.Sdk.Tools.Cli.Benchmarks project (#15697). - README documents project intent, layout, local run instructions, and how to add a new scenario. - .vally.yaml wires the azsdk-mcp environment (stdio dotnet run against Azure.Sdk.Tools.Cli) and defines 'typespec' and 'all' suites. - evals/check-public-repo.eval.yaml is the first ported scenario (from the deleted CheckPublicRepoScenario): verifies the agent invokes azsdk_typespec_check_project_in_public_repo for a public-repo check prompt. Lints clean via 'vally lint --eval-spec'. - fixtures/.gitkeep reserves the per-scenario fixtures layout. Remaining scenarios from the deleted benchmark are tracked as a checklist in the project README and in #15124.

Adds eval YAMLs for every scenario that was deleted from Azure.Sdk.Tools.Cli.Benchmarks in #15697: - check-public-repo-then-validate - validate-typespec - typespec-generation-step02 - get-modified-typespec-projects (stub — needs git-repo fixture / setup hook) - add-arm-resource (stub — needs fixtures + npx tsp compile post-check) - create-release-plan - link-namespace-approval-issue - get-pr-link-current-branch - check-sdk-generation-status Each eval uses the built-in tool-calls grader for presence checks; the original benchmark's argument/order/forbidden/optional assertions are captured in prompt text + inline TODOs (require custom graders or upstream Vally support, documented in README). Also adds release-plan/github/pipeline suites to .vally.yaml. All 10 evals pass 'vally lint --eval-spec'.

Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README.

#15183 - Move 11 multi-step scenario evals to evals/scenarios/ - Port 9 per-tool trigger evals from jeo02/migrate-evaluations-to-vally (PR #15183) to evals/triggers/, stripped azure-sdk-mcp- prefix from graders to match bare MCP tool names - Port Validate-EvalTools.ps1 to scripts/, retargeted at evals/triggers/ with bare-name regex - Update .vally.yaml suites for new layout (scenarios, triggers, all) - Update README to document the split and per-trigger-file tool coverage - Add .gitignore for vally-results/ and results/

Replace per-area folders (scenarios/, triggers/) with tier-based folders. Feature area moves to a YAML tag, enabling tag-filtered suites. Add composite suites (pr-gate, nightly) and area-filtered suites in .vally.yaml. Update Validate-EvalTools.ps1 to scan evals/unit for triggers-*.eval.yaml. Refresh README and Run-LiveEvals.ps1 paths.

…Azure/azure-sdk-tools into feat/vally-tool-scenarios-15124

Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'.

…list (#15852) (#15854) * Add MCP tool-coverage drift check for Azure.Sdk.Tools.Mock (#15852) - New eng/scripts/Get-McpToolInventory.ps1 boots the live Azure.Sdk.Tools.Cli MCP server (via 'azsdk list -o json'), enumerates the IMockToolHandler implementations under Azure.Sdk.Tools.Mock, and reports the diff in three buckets: both / live-only / mock-only. - Cross-references mock-tier eval YAMLs under tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/ when present; gracefully no-ops when that folder hasn't landed yet (PR #15811). - '-CheckOnly' exits non-zero on (a) any stale handler that no longer maps to a live tool, or (b) any tool referenced by a mock-tier eval without a handler -- intended for the CI job tracked in #15829. - Documents the drift workflow in Azure.Sdk.Tools.Mock/README.md so a contributor flagged by the script knows how to add a handler. No stale handlers detected against the current live tool set. * Add mock handlers for remaining live MCP tools; drop eval scanning from inventory script (#15852) - 13 new handler files covering 63 live tools that previously fell back to the default response (APIView, Codeowners, EngSys, GitHub, Package, Pipeline, ReleasePlan, TypeSpec, Verify, Core, Example). - Get-McpToolInventory.ps1: pure live-vs-mock parity (removes Vally eval cross-reference); -CheckOnly fails if either bucket is non-empty. - README: updated sync workflow to reflect parity-only check. * Simplify Get-McpToolInventory.ps1: no parameters, always exits non-zero on drift (#15852) * Fix 3 release-plan handler response types to match live tools (#15852) Addresses Copilot review on PR #15854: - azsdk_get_kpi_attestation_status: ReleaseWorkflowResponse -> ReleasePlanListResponse - azsdk_get_service_details_by_typespec_path: ReleaseWorkflowResponse -> ProductInfoResponse - azsdk_update_language_exclusion_justification: ReleaseWorkflowResponse -> DefaultCommandResponse * Drop Get-McpToolInventory.ps1 (#15852) Per review discussion: the script only checked that an IMockToolHandler exists with the right ToolName; it could not detect handlers that exist but just return the placeholder DefaultCommandResponse. That blind spot makes the script of limited value. A unit test in Cli.Tests is a better fit for actual drift enforcement and is tracked as a follow-up. README updated to drop the script reference. * Update Mock README: drop reference to removed inventory script (#15852)

…rios-15124

All 5 release-planner mock stimuli now use environment.git worktree pointing at the per-user azure-rest-api-specs cache (matching the live e2e fixture), plus a structured e2e-style prompt that supplies the Contoso fixture IDs the mock handlers expect (TypeSpec project, service/product tree IDs, work-item ID 29262). Also document the --skill-dir requirement and worker-cap caveat in README, and fix one stale path in .vally.yaml comment.

helen229 · 2026-06-05T20:23:09Z

Tracking the underlying MCP boot-race root cause + fix rationale in #15948.

- Launch pre-built DLLs via 'dotnet <dll>' in both .vally.yaml files instead of 'dotnet run', so N parallel workers no longer race on Roslyn's exclusive write lock for the output DLL. - Add 'Build MCP servers' step to eng/pipelines/skill-eval.yml so the CI runner has the DLLs ready before vally starts. - Drop the skill-invocation grader from generate-sdk-for-existing-release-plan (no preflight reasoning step required; tools-only). - Strip 'I'm in a checkout of azure-rest-api-specs.' preamble from prompts; the worktree already provides that context. - Remove stray '// tools skills response' artifact in live release-planner.eval.yaml. - README: document 'dotnet build' as a prereq; rewrite workers warning. Validated: scenarios-mock at --workers 6 -> 5/5 stimuli pass, 0 race hits, ~4 min.

Copilot

Pull request overview

This PR establishes tools/azsdk-cli/Azure.Sdk.Tools.Vally as the unified home for Azure SDK MCP tool invocation evals and multi-step workflow scenarios using @microsoft/vally-cli, porting prior benchmark coverage and consolidating per-tool trigger evals into a single surface area. It also updates existing skill-eval infrastructure to launch pre-built MCP server DLLs (avoiding dotnet run/MSBuild races under parallel workers).

Changes:

Added the new Azure.Sdk.Tools.Vally project structure, including Vally config, eval suites (tool triggers + workflow scenarios), and local helper scripts.
Ported and organized trigger eval YAMLs and scenario eval YAMLs to cover tool invocation drift and multi-tool workflows.
Updated skill-eval pipeline/config to run MCP servers via pre-built DLLs (dotnet <dll>) rather than dotnet run.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
tools/azsdk-cli/docs/specs/8-operations-agent-eval-strategy.spec.md	Adds a design/spec document describing the eval strategy and intended suite structure.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/.vally.yaml	Defines Vally environments (mock/live MCP) and suites for running the new evals.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/.gitignore	Ignores local Vally output folders.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/README.md	Documents purpose, layout, and how to run Vally evals for tool scenarios and workflows.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/fixtures/.gitkeep	Establishes fixture folder conventions for eval inputs.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/scripts/Validate-EvalTools.ps1	Adds a drift/coverage validator to cross-check trigger eval tool references vs server tool catalog.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/setup/ensure-specs-clone.ps1	Adds helper to maintain a cached sparse clone of `azure-rest-api-specs` for scenarios needing a repo on disk.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/check-public-repo.eval.yaml	Adds a unit-tier tool-call eval for public-repo presence checks.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/validate-typespec.eval.yaml	Adds a unit-tier tool-call eval for TypeSpec validation.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/get-modified-typespec-projects.eval.yaml	Adds a unit-tier tool-call eval for listing modified TypeSpec projects.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/add-arm-resource.eval.yaml	Adds a (currently stub-like) authoring scenario expecting plan generation + edits.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/create-release-plan.eval.yaml	Adds a unit-tier tool-call eval for creating a release plan.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/link-namespace-approval-issue.eval.yaml	Adds a unit-tier tool-call eval for linking namespace approval to a release plan.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/get-pr-link-current-branch.eval.yaml	Adds a unit-tier tool-call eval for resolving PR link for current branch.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/check-sdk-generation-status.eval.yaml	Adds a unit-tier tool-call eval for pipeline status checks.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-apiview.eval.yaml	Adds trigger stimuli covering APIView-related MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-config.eval.yaml	Adds trigger stimuli covering config/label MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-engsys.eval.yaml	Adds trigger stimuli covering engineering-system MCP tools (logs/tests/etc.).
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-github.eval.yaml	Adds trigger stimuli covering GitHub MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-package.eval.yaml	Adds trigger stimuli covering package generation/build/test/release MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-pipeline.eval.yaml	Adds trigger stimuli covering pipeline MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-releaseplan.eval.yaml	Adds trigger stimuli covering release-plan MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-typespec.eval.yaml	Adds trigger stimuli covering TypeSpec MCP tools.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/tools/triggers-verify.eval.yaml	Adds trigger stimuli covering setup verification MCP tool.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/check-public-repo-then-validate.eval.yaml	Adds a mock multi-tool workflow scenario (validate then public-repo check).
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/typespec-generation-step02.eval.yaml	Adds a mock workflow scenario for TypeSpec generation step 2 behavior.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/rename-client-property.eval.yaml	Adds a stub workflow scenario intended for a future expected-diff grader.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/mock/release-planner-workflows.eval.yaml	Adds mock workflow stimuli for key release-planner scenarios.
tools/azsdk-cli/Azure.Sdk.Tools.Vally/evals/workflow-scenarios/live/release-planner.eval.yaml	Adds a live end-to-end scenario that creates a plan, generates SDK, and links PR back.
eng/pipelines/skill-eval.yml	Pre-builds MCP servers so Vally can launch pre-built DLLs (reducing parallel-run flakiness).
.github/skills/.vally.yaml	Updates skill eval environment config to launch MCP servers via pre-built DLLs.

+
+**Tool-scenario evals (this project)** — organised by the standard test pyramid under [`evals/`](evals/). The folder is the **cost tier** (and CI cadence); the feature **area** is a tag inside each YAML so cross-cuts work via `.vally.yaml` suite filters.
+
+#### `evals/unit/` — hermetic single-tool evals (18)


+#### `evals/scenarios/` — multi-tool scenarios (4)
+
+Multi-step prompts that exercise 2+ MCP tools end-to-end. Split into
+`mock/` (hermetic, runs on PR gate) and `live/` (real DevOps / GitHub /
+pipelines, runs nightly).
+
+| Scenario | Area | Mode | Shape |
+|---|---|---|---|
+| [`check-public-repo-then-validate`](evals/scenarios/mock/check-public-repo-then-validate.eval.yaml) | typespec | mock | Validate, then check public-repo presence |
+| [`typespec-generation-step02`](evals/scenarios/mock/typespec-generation-step02.eval.yaml) | typespec | mock | Step in the spec-PR generation flow |
+| [`rename-client-property`](evals/scenarios/mock/rename-client-property.eval.yaml) | typespec | mock | Stub — needs `expected-diff` grader + sparse clone |
+| [`release-planner`](evals/scenarios/live/release-planner.eval.yaml) | release-plan | **live** | Create + re-fetch a release plan, kick off SDK gen, link PR back — real DevOps test-area writes |


+    This script:
+    1. Runs `azsdk list` to get all registered MCP tool names from the server.
+    2. Parses all `triggers-*.eval.yaml` files under the unit/ directory.
+    3. Reports any eval tool references that don't exist on the server,
+       and any server tools that are missing eval coverage.


+.PARAMETER EvalPath
+    Path to the directory containing `triggers-*.eval.yaml` files.
+    Defaults to ../evals/unit relative to this script.
+


+if (-not $EvalPath) {
+    $EvalPath = Join-Path $vallyRoot "evals/unit"
+}


+#   - Not an end-to-end flow (see release-planner-e2e.eval.yaml for that).
+#   - Does not validate argument values yet — see TODO below + #15833.
+#   - Does not need azure-rest-api-specs cloned; runs against the live MCP
+#     server in agent-testing mode (AZSDKTOOLS_AGENT_TESTING=true, set in
+#     ../../.vally.yaml).


+# How to run locally:
+#   cd tools/azsdk-cli/Azure.Sdk.Tools.Vally
+#   ../../../eng/skill-eval/node_modules/.bin/vally.cmd eval \
+#     --eval-spec evals/unit/create-release-plan.eval.yaml --verbose


+  Bound to the mock MCP — these graders only inspect skill routing and tool
+  selection, not real DevOps writes. The full live e2e flow lives in
+  evals/scenarios/live/release-planner.eval.yaml.
+


+      git:
+        type: worktree
+        source: C:/Users/gaoh/.vally-cache/azure-rest-api-specs
+        ref: main


+      # Source is the per-user cache populated by evals/setup/ensure-specs-clone.ps1
+      # (idempotent shallow+sparse clone, auto-refresh every 24h).
+      # NOTE: hardcoded absolute path — Vally does not currently expand
+      # ${USERPROFILE} / env vars in env.git.source. Adjust per machine
+      # or replace with a CI-provided path. See upstream issue:
+      # https://github.com/microsoft/vally/issues (TODO: file env-var expansion)
+      git:
+        type: worktree
+        source: C:/Users/gaoh/.vally-cache/azure-rest-api-specs
+        ref: main


…ot found' lookup The create-release-plan-and-generate-sdk mock stimulus required the agent to call azsdk_update_sdk_details_in_release_plan, but neither the prompt nor the azsdk-common-prepare-release-plan skill's create flow asks for it. The agent correctly skipped the tool, and the grader flapped. The dedicated update-sdk-details-in-release-plan stimulus already covers that tool with an explicit prompt. Drop it from the create+generate grader so mock matches the live release-planner-e2e contract (create / get / generate / link). Also patch GetReleasePlanForSpecPrHandler to return a deterministic 'not found' response (ReleasePlanDetails = null). The mock previously returned a 'plan exists' result for any spec PR, pushing the agent down the update path instead of the create path that the stimulus exercises. Stimuli that target an existing plan pass the work-item ID directly and call azsdk_get_release_plan, so this is safe.

helen229 added 3 commits June 1, 2026 10:22

Add rename-client-property stub eval to Vally suite (#15124)

26cc6ef

Ports the deleted RenameClientPropertyScenario as a tool-calls-only stub. Full expected-diff grading + sparse-clone setup hook are tracked as follow-ups in the README.

github-actions Bot added the azsdk-cli Issues related to Azure/azure-sdk-tools::tools/azsdk-cli label Jun 1, 2026

This was referenced Jun 2, 2026

Wire vally eval CI job for Azure.Sdk.Tools.Vally tool-scenario evals #15829

Open

Wire workspace setup hooks + mock MCP environment for Vally tool-scenario evals #15831

Open

helen229 added 10 commits June 2, 2026 11:48

Fix tool name prefix in graders, timeout format, expand README

8e4f524

Merge branch 'main' into feat/vally-tool-scenarios-15124

c10063b

update the config and use gpt-5.4 model

02aee34

add disallowed

d1f212f

Merge branch 'feat/vally-tool-scenarios-15124' of https://github.com/…

66216b0

…Azure/azure-sdk-tools into feat/vally-tool-scenarios-15124

Vally: remove Run-LiveEvals.ps1 (local-only test wrapper)

a88ae11

Drop the local-only convenience wrapper and refer directly to evals/setup/ensure-specs-clone.ps1 in docs and YAML comments. Users prime the spec clone manually and invoke 'vally eval --suite e2e'.

some docs and test e2e one

bb47139

update docs

4d89bac

This was referenced Jun 3, 2026

Refresh Azure.Sdk.Tools.Mock + add MCP tool-coverage drift check #15852

Closed

Refresh Azure.Sdk.Tools.Mock handler coverage to match live MCP tool list (#15852) #15854

Merged

Vally results UX: trajectory HTML + history CSV + artifact upload #15861

Open

helen229 added 4 commits June 3, 2026 13:35

udpate design

f6f5c80

update with skill evals

3a8d609

reorg based on the design

b7005b2

remove the duplicates

6db7c5f

helen229 added 6 commits June 4, 2026 07:33

add new scenarios

b77dccb

update the doc

1264e9a

update doc

aa714ab

Merge remote-tracking branch 'origin/main' into feat/vally-tool-scena…

f26cf1f

…rios-15124

update names

fda9ef9

update doc

af3db0c

helen229 mentioned this pull request Jun 4, 2026

docs: add agent-eval-strategy spec for azsdk-cli operations #15918

Draft

helen229 marked this pull request as ready for review June 5, 2026 20:54

Merge branch 'main' into feat/vally-tool-scenarios-15124

12714ae

helen229 requested a review from a team as a code owner June 5, 2026 20:54

Copilot AI review requested due to automatic review settings June 5, 2026 20:54

helen229 requested a review from a team as a code owner June 5, 2026 20:54

Copilot started reviewing on behalf of helen229 June 5, 2026 20:54 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

helen229 added 2 commits June 5, 2026 22:44

update readme for runing steps

2ce5e7b

helen229 mentioned this pull request Jun 6, 2026

Skill: orphan tool azsdk_run_generate_sdk mis-routes to generate-sdk-locally #15950

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124)#15811

Port tool-scenario benchmarks from Azure.Sdk.Tools.Cli.Evaluations to Vally (#15124)#15811
helen229 wants to merge 28 commits into
mainfrom
feat/vally-tool-scenarios-15124

helen229 commented Jun 1, 2026 •

edited

Loading

Uh oh!

helen229 commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Tool-scenario evals (this project) — organised by the standard test pyramid under [`evals/`](evals/). The folder is the cost tier (and CI cadence); the feature area is a tag inside each YAML so cross-cuts work via `.vally.yaml` suite filters.

		#### `evals/unit/` — hermetic single-tool evals (18)

Conversation

helen229 commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in the PR

New project: tools/azsdk-cli/Azure.Sdk.Tools.Vally/

evals/scenarios/ — 11 multi-step workflow evals (the #15124 port)

evals/triggers/ — 9 per-tool trigger evals (ported from #15183)

scripts/Validate-EvalTools.ps1 (ported from #15183)

What's not in this PR (deliberate)

Acknowledgements

Verification

Uh oh!

helen229 commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

helen229 commented Jun 1, 2026 •

edited

Loading

New project: `tools/azsdk-cli/Azure.Sdk.Tools.Vally/`

`evals/scenarios/` — 11 multi-step workflow evals (the #15124 port)

`evals/triggers/` — 9 per-tool trigger evals (ported from #15183)

`scripts/Validate-EvalTools.ps1` (ported from #15183)