feat: add azd CLI evaluation and testing framework#7202
feat: add azd CLI evaluation and testing framework#7202
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new cli/azd/test/eval/ evaluation framework intended to measure how well GitHub Copilot CLI (and humans) can discover and use azd commands, plus scheduled GitHub Actions workflows to run these evals and publish artifacts/reports.
Changes:
- Introduces a Node/TypeScript Jest test harness for unit-style CLI surface validation (help text, flags, sequencing).
- Adds Waza task YAMLs (deploy/troubleshoot/environment/lifecycle/negative scenarios) and Python grader scripts for infra/app validation.
- Adds GitHub Actions workflows to run unit tests on PRs and scheduled Waza/E2E/report jobs.
Reviewed changes
Copilot reviewed 36 out of 38 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| cli/azd/test/eval/tsconfig.json | TypeScript build configuration for eval tests/tools |
| cli/azd/test/eval/package.json | Node package + scripts for running Jest/Waza/reporting |
| cli/azd/test/eval/package-lock.json | Locked dependency tree for reproducible installs |
| cli/azd/test/eval/jest.config.ts | Jest configuration (ts-jest + junit in CI) |
| cli/azd/test/eval/.gitignore | Ignores build outputs and generated reports |
| cli/azd/test/eval/reports/.gitkeep | Keeps reports/ directory in git |
| cli/azd/test/eval/eval.yaml | Waza eval configuration (executor/model/metrics/task globs) |
| cli/azd/test/eval/README.md | Documentation for running/extending eval framework |
| cli/azd/test/eval/tests/unit/command-registry.test.ts | Verifies core commands exist and respond to --help |
| cli/azd/test/eval/tests/unit/help-text-quality.test.ts | Checks help output contains expected sections/descriptions |
| cli/azd/test/eval/tests/unit/flag-validation.test.ts | Validates key flags appear/behave as expected |
| cli/azd/test/eval/tests/unit/command-sequencing.test.ts | Ensures commands fail with guidance in empty dirs |
| cli/azd/test/eval/tests/human/cli-workflow.test.ts | “Human baseline” tests for responsiveness/basic UX expectations |
| cli/azd/test/eval/tests/human/command-discovery.test.ts | “Human baseline” tests focused on discovering commands/flags |
| cli/azd/test/eval/tests/human/error-recovery.test.ts | “Human baseline” tests for actionable errors and recovery hints |
| cli/azd/test/eval/tasks/deploy/deploy-python-webapp.yaml | Waza task: deploy Python app guidance |
| cli/azd/test/eval/tasks/deploy/deploy-node-api.yaml | Waza task: deploy Node API guidance |
| cli/azd/test/eval/tasks/deploy/deploy-existing-project.yaml | Waza task: deploy existing azd project (avoid init) |
| cli/azd/test/eval/tasks/environment/create-staging.yaml | Waza task: create staging environment workflow |
| cli/azd/test/eval/tasks/environment/switch-env.yaml | Waza task: switch environments |
| cli/azd/test/eval/tasks/environment/delete-env.yaml | Waza task: teardown + delete environment workflow |
| cli/azd/test/eval/tasks/lifecycle/full-lifecycle.yaml | Waza task: init→provision→deploy→down sequence |
| cli/azd/test/eval/tasks/lifecycle/teardown-only.yaml | Waza task: down/cleanup guidance |
| cli/azd/test/eval/tasks/troubleshoot/auth-error.yaml | Waza task: troubleshoot auth error guidance |
| cli/azd/test/eval/tasks/troubleshoot/config-error.yaml | Waza task: troubleshoot malformed azure.yaml |
| cli/azd/test/eval/tasks/troubleshoot/quota-error.yaml | Waza task: troubleshoot quota error |
| cli/azd/test/eval/tasks/troubleshoot/provision-role-conflict.yaml | Waza task: troubleshoot RBAC role assignment conflict |
| cli/azd/test/eval/tasks/negative/raw-azure-cli.yaml | Waza negative task: use az not azd |
| cli/azd/test/eval/tasks/negative/not-azure.yaml | Waza negative task: non-Azure question should avoid azd |
| cli/azd/test/eval/tasks/negative/general-coding.yaml | Waza negative task: general coding response without azd |
| cli/azd/test/eval/graders/infra_validator.py | Python grader stub for ARM resource existence validation |
| cli/azd/test/eval/graders/cleanup_validator.py | Python grader stub for post-azd down cleanup validation |
| cli/azd/test/eval/graders/app_health.py | Python grader stub for HTTP endpoint health validation |
| cli/azd/.vscode/cspell.yaml | Adds spelling dictionary overrides for eval docs |
| .github/workflows/eval-unit.yml | PR workflow to build azd + run Jest unit suite |
| .github/workflows/eval-waza.yml | Scheduled workflow to run Waza evaluations |
| .github/workflows/eval-e2e.yml | Scheduled workflow intended for E2E lifecycle evals with Azure login |
| .github/workflows/eval-report.yml | Scheduled workflow intended to generate weekly comparison/regression issues |
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Good initiative - adding eval coverage for Copilot CLI interactions with azd fills a real gap. The Waza task definitions are well-structured, grader weights are mathematically correct across all 14 tasks, and the CI workflow design (unit on PR, Waza 3x/day, E2E weekly) is sensible.
However, there are structural and reliability issues that should be addressed before merge:
- The
azd()test helper is copy-pasted across 7 files with subtle inconsistencies (NO_COLORvsAZD_DEBUG_FORCE_NO_TTY,e: anyvse: unknown) - this is already causing bugs and will make maintenance painful - Human test files don't set
NO_COLOR: "1", so regex assertions against help text will be flaky when ANSI escape codes are present - The
eval.yamlsystem prompt omitsazd env delete, butdelete-env.yamlexpects the LLM to suggest it - this task will score poorly by design app_health.pyhas inconsistent retry logic: status mismatches retry, but body-content mismatches return failure immediately- Two npm devDependencies (
@azure/arm-resources,@azure/identity) are never imported anywhere
I've excluded items already covered by the existing review.
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
59c2cb2 to
ef33cf4
Compare
|
@jongio - thanks for the feedback. Everything has been addressed here. Ready for another review. @rajeshkamal5050 / @wbreza - going to need someone here with admin to setup the CI portions and TOKENS and subscription. See PR description on the how. |
|
@jongio All review feedback addressed and CI is fully green. Changes: shared test-utils.ts helper, 15 pytest grader tests, app_health.py retry fix, azd env delete in eval.yaml, removed unused deps, jest timeout reduction, workflow fixes (permissions, PATH, artifacts, cleanup), cspell overrides. PR body updated with full setup instructions. Ready for re-review! |
jongio
left a comment
There was a problem hiding this comment.
Solid foundation for measuring Copilot CLI + azd interactions. The task YAML structure is well-designed, grader weight math is correct, and the CI pipeline layout (unit on PR, Waza scheduled, E2E weekly) makes sense. I've skipped items already raised in the existing reviews and focused on issues I haven't seen mentioned.
The graders have a logic gap in how urlopen handles non-2xx responses - it throws before your status comparison runs, so expected_status only works for 2xx codes. The get_access_token() function is copy-pasted across two grader files. The report workflow is entirely non-functional (placeholder echo + missing dependency file). A couple of the task YAML graders are either redundant or too strict in what they require from the LLM response.
Add a comprehensive eval/test framework for measuring how GitHub Copilot CLI interacts with azd. Includes: - 75 unit tests across 4 suites: command registry, help text quality, command sequencing, and flag validation - Human-scenario test stubs for CLI workflow, command discovery, and error recovery evaluation - Waza-compatible task YAML definitions for LLM eval (deploy, lifecycle, environment, troubleshoot, negative scenarios) - Custom graders for infrastructure validation, app health, and cleanup - CI workflows for unit tests, E2E tests, Waza runs, and report generation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d tests Covers: - Step-by-step Waza LLM eval task creation with full YAML reference - Grader reference table (text, action_sequence, behavior, code) - Custom Python grader authoring with Azure ARM API example - Jest unit test and human scenario test templates - Directory structure reference - Regex tips and common patterns Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- command-registry: exclude beta-gated commands (build) from root --help assertion - command-sequencing: accept auth-related errors in CI where no Azure login exists - cspell: add waza/urlopen to eval README overrides Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Removing AZD_CONFIG_DIR override prevents tests from wiping auth state and triggering browser-based Azure login. Tests only need an empty cwd to verify azd fails gracefully without a project. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ntee Adds test tier table showing which tests need auth, step-by-step service principal setup for CI, and local subscription configuration. Clarifies that no test should ever open a browser. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolves all issues raised in PR #7202 review: - Extract shared azd() test helper to tests/test-utils.ts (eliminates duplication across 7 files, consistent NO_COLOR + AZD_FORCE_TTY) - Fix AZD_DEBUG_FORCE_NO_TTY → AZD_FORCE_TTY=false in all test files - Add NO_COLOR=1 to human tests (prevents ANSI flakiness) - Use catch(e: unknown) with proper type narrowing everywhere - Add azd env delete to eval.yaml system prompt (fixes delete-env task) - Fix app_health.py retry logic for body-content mismatches - Remove unused @azure/arm-resources and @azure/identity deps - Remove missing scripts/ references from package.json and tsconfig - Reduce jest timeout from 5min to 30s - eval-unit.yml: add permissions block and waza:validate step - eval-waza.yml: fix PATH via GITHUB_PATH instead of env.PATH - eval-e2e.yml: align waza install, fix cleanup step working directory - eval-report.yml: use gh CLI for cross-run artifact download - Remove non-existent eval-human.yml from README CI table - Add cspell overrides for grader/task/test files All 7 suites pass (125 tests + 4 skipped E2E). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- app_health.py: handle non-2xx expected_status via HTTPError.code comparison - Extract shared get_access_token() to graders/azure_auth.py (was duplicated in cleanup_validator.py and infra_validator.py) - eval-report.yml: remove non-functional regression issue step, drop issues:write permission, add TODO for future report generation script - teardown-only.yaml: relax --purge from must_match to must_match_any (--force without --purge is a valid response) - deploy-existing-project.yaml: replace duplicate grader with check for --no-prompt, azure.yaml, service, or --all - test-utils.ts: add .exe extension on Windows for cross-platform support - Add 2 new pytest tests for HTTPError expected_status matching Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ef33cf4 to
24d2af3
Compare
|
@jongio Round 2 feedback addressed, rebased on main, and all CI passing locally. Changes: HTTPError non-2xx handling in app_health.py, shared azure_auth.py module, cleaned up eval-report.yml, relaxed teardown-only.yaml grader, fixed duplicate grader in deploy-existing-project.yaml, Windows .exe support in test-utils.ts. Replied to each comment individually. Ready for re-review! |
Azure Dev CLI Install InstructionsInstall scriptsMacOS/Linux
bash: pwsh: WindowsPowerShell install MSI install Standalone Binary
MSI
Documentationlearn.microsoft.com documentationtitle: Azure Developer CLI reference
|
Problem
We have no visibility into how GitHub Copilot CLI interacts with
azd. Unlike the microsoft/github-copilot-for-azure skills repo — which has a comprehensive test, eval, and CI setup — the azd CLI has zero coverage for measuring LLM interactions, command discoverability, or human usability patterns. We have no idea whether Copilot CLI can successfully discover commands, interpret help text, handle errors gracefully, or guide users through common workflows.Solution
This PR adds a comprehensive evaluation and testing framework at
cli/azd/test/eval/inspired by the GHCP4A setup. It covers both LLM eval (how well an AI agent uses azd) and non-LLM unit tests (how well azd surfaces information for human and AI consumption).What's included
125 passing tests across 7 suites:
--help--output json,--no-prompt,-e/--environmentflags15 Python grader unit tests (pytest):
app_health.py— HTTP health checks with retry logiccleanup_validator.py— ARM API validation for post-azd downcleanupinfra_validator.py— ARM API validation for post-azd provisionresources14 Waza LLM eval task definitions (YAML):
4 CI workflows:
eval-unit.yml— runs unit tests + waza validate on PReval-waza.yml— Waza LLM evals 3x/day (Tue-Sat)eval-e2e.yml— weekly E2E with Azure resource validationeval-report.yml— weekly report generation + auto-issue creationSetup Required Before Going Live
Secrets to configure (Settings → Secrets and variables → Actions)
AZURE_CLIENT_IDeval-e2e.ymlAZURE_TENANT_IDeval-e2e.ymlAZURE_SUBSCRIPTION_IDeval-e2e.yml+ gradersCOPILOT_CLI_TOKENeval-waza.yml,eval-e2e.ymlGITHUB_TOKENeval-report.ymlService principal setup
What works without any setup
npm run test:unitnpm run test:humannpm run waza:run:mockeval-unit.ymlCIcli/azd/test/eval/**What needs secrets
eval-waza.ymlCOPILOT_CLI_TOKENeval-e2e.ymlAZURE_CLIENT_ID,AZURE_TENANT_ID,AZURE_SUBSCRIPTION_ID,COPILOT_CLI_TOKENeval-report.ymlGITHUB_TOKEN(auto)Testing
All tests pass locally: