ci(notify): file/update ci-broken GitHub issue when internal AzDO build breaks on main/release/* by radical · Pull Request #17920 · microsoft/aspire

radical · 2026-06-04T21:57:21Z

Internal AzDO build failures on main and release/* had no automated visibility on the public tracker — a broken build could sit unnoticed until someone happened to look at the AzDO build list.

What it does

On every non-PR build of main or release/*:

One or more stages Failed → file (or update) a ci-broken issue on microsoft/aspire, titled Internal build broken on <branch>, assigned to @joperezr + @radical. The body carries a managed failures table that grows a row per subsequent failure on the same branch (build link, commit, failed stages). A follow-up comment fires on each failure so @-mentions notify.
All stages Succeeded → close the open ci-broken issue for that branch with a green-build comment.

One open issue per branch, deduplicated by a hidden HTML-comment marker in the body. Full contract (labels, marker syntax, dedupe behavior, dry-run, manual cleanup) in docs/ci/internal-build-failure-notifications.md.

How

Notify-GitHubOnBuildResult.ps1 invoked from two new pipeline stages (notify_failure, notify_success), authenticated via an aspire-repo-bot GitHub App installation token, driving the gh CLI. Always exits 0 — a flaky notification path must never red an otherwise-correct build.

A notifyOnFailureDryRun queue-time parameter logs the would-be gh calls without mutating GitHub.

Validated on AzDO

Dry-run: build 2992742 — pwsh + script + -FailedStages on Linux, exits 0 cleanly.
Live-mode: build 2992761 — filed #17938 on a throwaway release/aspire-internal-notify-validation marker, then closed it in the same build via gh issue close.

github-actions · 2026-06-04T21:57:50Z

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 17920

Or

Run remotely in PowerShell:

iex "& { $(irm https://raw.githubusercontent.com/microsoft/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 17920"

PowerShell helper that files / updates / closes a ci-broken GitHub issue on microsoft/aspire for a given branch. Two modes: - Failure: GETs open ci-broken issues, filters by a hidden HTML-comment marker (), and either creates a new issue (with a managed failures-table region in the body) or appends a row to the existing one and posts a follow-up @-mention comment. - Success: closes the matching open issue with a green-build comment. Drives the gh CLI throughout, authenticated via $env:GH_TOKEN that the caller sets after minting an aspire-repo-bot installation token. List uses the strongly-consistent /issues endpoint, not /search/issues, so near-simultaneous failures don't each file a duplicate; a post-create re-list catches the rare race past that window and closes ours as a duplicate of the older. Always exits 0. Warnings surface via task.logissue + task.complete result=SucceededWithIssues so a silent regression (bot loses permission, label deleted, gh API shape changes) goes yellow rather than green. -DryRun logs the would-be gh calls without mutating GitHub. In dry-run the script skips token mint entirely so the wrapper can validate pipeline plumbing without resolving aspire-repo-bot credentials. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ipeline Adds two stages to azure-pipelines.yml that gate on the upstream build_sign_native / build / prepare_installers stage results and run Notify-GitHubOnBuildResult.ps1 on 1es-ubuntu-2204: - notify_failure: fires when at least one upstream stage Failed. Composes a comma-separated -FailedStages list from dependencies.<stage>.result so the filed / updated issue body identifies which stage broke. - notify_success: fires when all three upstream stages Succeeded / SucceededWithIssues (prepare_installers may legitimately Skip on stable GA release builds). Closes the open ci-broken issue for the branch. Both stages mint the aspire-repo-bot installation token via Get-AspireBotInstallationToken.ps1 and export it as GH_TOKEN so the gh CLI invocations in the script authenticate as the bot. Adds _IsNotificationBranch in common-variables.yml — exact match on refs/heads/main (NOT startsWith, the pipeline trigger's `main*` wildcard would otherwise sweep in branches like main-something) plus startsWith refs/heads/release/. Excludes internal/release/* so internal branch names don't leak into the public tracker. Aspire-Release-Secrets variable group is imported at pipeline scope with the same non-PR + main/release/* gate, so manual feature-branch and PR runs don't pay the variable-group auth check at queue time. A notifyOnFailureDryRun queue-time parameter logs would-be gh calls without mutating GitHub; applies to both stages so a green-build dry-run can't accidentally close real open issues. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds docs/ci/internal-build-failure-notifications.md describing the contract (labels, marker syntax, failures-table region, dedupe behavior, dry-run, manual cleanup) and pointer from eng/pipelines/README.md so anyone reading the pipeline docs lands on the notification system. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds automated public GitHub issue visibility for internal AzDO build breaks on main and release/*, so internal failures don’t remain unnoticed without someone manually checking AzDO.

Changes:

Introduces a PowerShell notifier that files/updates a branch-deduped ci-broken issue on failures and closes it on the next green build.
Adds notify_failure / notify_success stages to the internal pipeline, plus a queue-time dry-run parameter and branch gating.
Documents the notification contract and links it from pipeline docs.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
eng/pipelines/scripts/Notify-GitHubOnBuildResult.ps1	New notifier script that uses `gh` + an installation token to create/update/close `ci-broken` issues per branch.
eng/pipelines/README.md	Adds a short pointer describing the internal build-result notification behavior and links to full docs.
eng/pipelines/common-variables.yml	Adds `_IsNotificationBranch` to gate notifications to `main`/`release/` (excluding `internal/release/`).
eng/pipelines/azure-pipelines.yml	Adds dry-run parameter, imports secrets conditionally, and adds `notify_failure`/`notify_success` stages.
docs/ci/internal-build-failure-notifications.md	New contract documentation for behavior, dedupe strategy, and operational expectations.

- Notify-GitHubOnBuildResult.ps1: register the GitHub App installation token via `##vso[task.setsecret]` instead of `##vso[task.setvariable ...;issecret=true]`. The only purpose was log masking, but task.setvariable also persists the value as a job-scoped variable that other tasks could accidentally reference via $(__notifyGhToken). task.setsecret gives us the masking without the persistence. - azure-pipelines.yml + docs: dry-run mode logs the `gh` CLI commands it would run, not GitHub REST calls. Fix the parameter displayName, the parameter comment, and the corresponding docs paragraph. - docs: Azure Pipelines stage results use `Skipped`, not `Skip` (the YAML stage condition correctly checks for 'Skipped'). Fix the prose. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The internal-build-failure notification script had no automated coverage, despite the established tests/Infrastructure.Tests/PowerShellScripts/ pattern and the script's highly testable pure helpers. Add NotifyGitHubOnBuildResultTests covering: - Test-NotifiableBranch: main (exact), release/*, and the negatives that matter (main-something, mainline, internal/release/*, feature/*) - New-FailureTableRow: pipe-escaping, short-SHA shortening with full-SHA link, sub-7-char SHA fallback, em-dash for empty FailedStages - Update-FailuresTableInBody: append + surrounding-prose preservation, max-rows rollover into the "earlier failures omitted" summary, omitted-count carry into the next index, and the missing-markers warning path (body left unchanged) - -DryRun (Failure + Success): exits 0 and launches zero gh processes, verified with a recording fake gh placed ahead of the real one on PATH The helpers are exercised by dot-sourcing the shipped script unmodified with a non-notifiable branch, so its main routine bails before any token mint or gh call and leaves the functions in scope. 'exit' inside a dot-sourced script only unwinds that script, not the test harness, so no testability hook in the production script is needed. Repo-root lookup reuses the shared worktree-aware TestUtils.FindRepoRoot() rather than adding another private Aspire.slnx walk. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…al-build-failure

radical · 2026-06-09T18:18:25Z

I'll address the feedback, and also update this to account for the recent azdo pipeline changes.

The internal pipeline restructure (microsoft#17760) added build_extension, template_tests, and assemble (publish) stages. The notify stages only watched build / build_sign_native / prepare_installers, so: - a failure isolated to a new stage (e.g. assemble) filed NO issue, and - notify_success could falsely CLOSE an open ci-broken issue, because a dependency failure surfaces downstream as Skipped, not Failed. Coverage fix: - both notify stages now dependsOn and check all six build stages - notify_success allows Skipped only for prepare_installers (defensive; on notifiable branches it runs whenever build_sign_native succeeded) - FailedStages list and docs extended to the full stage set Also addresses review feedback on the notifier script: - probe for gh up front (Test-GhAvailable) so a missing CLI warns instead of silently swallowing every API call - anchor the dedupe-marker match to start-of-line (Test-IssueBodyMatchesMarker) so it can't match the marker pasted mid-prose into an unrelated issue - drop the in-body failures table; per-failure comments are the history. Re-home the failed-stages list into the issue body and the follow-up comment - cross-reference the three places the notifiable-branch policy lives - correct the task.setsecret wording in the docs - tests updated to cover the new helpers The YAML coverage change cannot run on GitHub PRs; validate with an AzDO dry-run (notifyOnFailureDryRun=true) before merge. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

notify_failure and notify_success were ~50 lines of duplicated job scaffolding (the jobs.yml@self wrapper, pool, checkout, and the pwsh notifier call) that could silently drift apart. Extract that job into eng/pipelines/templates/notify-build-result.yml, parameterized by {mode, jobName, jobDisplayName, dryRun}. Each stage keeps only what must differ and live at stage scope: dependsOn, the condition, and the six *StageResult variables. Also drops the per-stage NotifyDryRunFlag variable indirection in favor of parsing the queue-time parameter directly in the template ($dryRun = [bool]::Parse('${{ parameters.dryRun }}')), keeping the typed -DryRun:$dryRun call. The template follows the existing double-nest convention (stage jobs: -> repo template -> jobs.yml@self) already used by the build_extension stage. This is internal-pipeline YAML that does not run on GitHub PRs; validate with an AzDO dry-run (notifyOnFailureDryRun=true) before merge. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The first-failure path created the issue, then re-listed and self-closed if it wasn't the oldest carrying the marker — guarding against two builds of the same branch filing duplicates within the gh-issue-list consistency window. Builds are rolling, so that race is vanishingly rare, and the guard cost an extra `gh issue list` round-trip on every first failure. Drop it: on the rare double-file, the existing "found N open issues, update the oldest, leave the rest for human cleanup" path already degrades gracefully and a human closes the duplicate. Addresses review feedback on PR microsoft#17920. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

github-actions · 2026-06-09T20:44:38Z

Re-running the failed jobs in the CI workflow for this pull request because 1 job was identified as retry-safe transient failures in the CI run attempt.
GitHub was asked to rerun all failed jobs for that attempt, and the rerun is being tracked in the rerun attempt.
The job links below point to the failed attempt jobs that matched the retry-safe transient failure rules.

Tests / Templates-BuildAndRunTemplateTests / Templates-BuildAndRunTemplateTests (windows-latest) - Job-level runner or infrastructure failure matched the transient allowlist.

joperezr · 2026-06-09T22:08:58Z

+    return $FailedStages
+}
+
+function New-IssueBody {


Idea: would it be worth dropping a one-line first error excerpt from the failing job's log into the body? Today triage requires clicking through to AzDO just to see what broke. Worth noting if you go this route: the destination repo is public and the pipeline is internal, so anything embedded from logs needs scrubbing for internal agent names, D:\a\_work\... / /mnt/vss/_work/... paths, subscription/tenant GUIDs, portal.microsofticm.com links, and anything token-shaped.

I didn't do this to avoid exposing internal stuff inadvertently. Is there a recommended way to scrub these which we can depend on?

joperezr · 2026-06-09T22:08:58Z

+        # `gh issue create` accepts --label and --assignee multiple times.
+        # Build flag pairs rather than relying on comma-separation, which is
+        # more brittle if a label/assignee ever contains a comma.
+        $labelFlags = @($Script:IssueLabels | ForEach-Object { '--label'; $_ })


Tiny nit: assuming the ci-broken label exists. A gh label create ci-broken --force no-op at the top of the failure path would make the script self-bootstrapping if it's ever renamed/deleted. Not blocking, just defensive.

You're right to doubt the "long-standing" framing — I checked: ci-broken exists (color B60205, description "Internal ADO pipeline is failing") but it's recent, first used 2026-05-14 (#17107 / #17108), not long-standing. So scratch that part of my reasoning.

I also had the failure mode backwards: gh issue list --label ci-broken returns [] exit 0 when the label is missing, not an error. So a deleted label would silently no-op the success/close path (an open issue never gets found or closed), while the failure/create path is loud (gh issue create --label ci-broken fails → SucceededWithIssues).

The catch with gh label create --force: deleting a label also strips it from every issue, and our lookup filters server-side on --label ci-broken, so re-creating the definition doesn't recover the lost label→issue links — pre-existing issues still won't be found. It only unbreaks future creates, and --force would reset the color/description on every build.

Given the label is new/purpose-built (so more churn-prone than I implied), the right shape if we want it is a minimal missing-only create — explicit color+description, no --force, on the failure path. Want me to add that, or leave it?

joperezr · 2026-06-09T22:08:58Z

+          in(dependencies.build.result, 'Succeeded', 'SucceededWithIssues'),
+          in(dependencies.template_tests.result, 'Succeeded', 'SucceededWithIssues'),
+          in(dependencies.assemble.result, 'Succeeded', 'SucceededWithIssues'),
+          in(dependencies.prepare_installers.result, 'Succeeded', 'SucceededWithIssues', 'Skipped')


What's the intended behavior for mixed stage results — e.g., one stage Canceled (1ES timeout) and others Succeeded? Today neither stage fires, which is fine on the failure side, but if there's a pre-existing open ci-broken issue it'll stay open across subsequent mixed-result builds until a fully-clean build happens. Worth either widening the success condition to tolerate Canceled/Skipped on non-critical stages, or documenting that operators may need to manually close stuck issues after intermittent 1ES failures.

We also have Partially Succeeded which I'm not sure I follow how is that counted now.

Documented this in 835b033 rather than changing the conditions, because the limbo behavior is actually the conservative-correct outcome.

On PartiallySucceeded: at the YAML stage-dependency level dependencies.<stage>.result is one of Succeeded / SucceededWithIssues / Failed / Canceled / Skipped. Partial/warning surfaces as SucceededWithIssues (which the success condition already tolerates); PartiallySucceeded is a run-level / Classic-release label, not a stage-dependency result. So there's no separate bucket to handle.

On Canceled (the real limbo case — all green except one stage canceled by a 1ES timeout): neither stage fires by design. We don't file (a cancellation is infra noise, not a code break) and we don't close (a canceled stage never verified its work, so there's no fully-green signal). An existing ci-broken issue stays open until a genuinely green build; operators close it manually if they confirm the cancellation was spurious.

I rejected widening the success condition to tolerate Canceled because that would auto-close on an incomplete build — a false all-clear. Now spelled out in the stage comment + docs/ci/internal-build-failure-notifications.md.

…vior A build that finishes with no stage Failed but a watched stage Canceled (1ES timeout) or Skipped matches neither the notify_failure (any Failed) nor notify_success (all Succeeded/SucceededWithIssues) condition, so neither stage fires. This "limbo" outcome is intentional, but it was undocumented and mildly surprising: an existing open ci-broken issue is left open until a fully green build closes it, so operators may need to close it manually after a spurious cancellation. Document the behavior and its rationale (no green signal => no auto-close; widening success to tolerate Canceled would risk a false all-clear) in docs/ci/internal-build-failure-notifications.md and the notify_success stage comment. No condition change — the conservative behavior is correct. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

adamint

I do not think we should import the full Aspire-Release-Secrets group into the internal build just for this notifier. The template only needs aspire-bot-app-id and aspire-bot-private-key, but that group also carries release-only secrets like VscePublishToken; dry-run builds import the group too before the template skips the env mapping. Could we move these two bot credentials to a narrower variable group, or otherwise avoid loading release secrets for notify_failure/notify_success?

adamint

I think this one still has a security issue: azure-pipelines.yml imports the full Aspire-Release-Secrets variable group for every run of the notifier, but the notifier template only needs the bot app id/private key. That group also carries broader release credentials like the VSCE publish token, and the dry-run path still loads the group before the template decides not to pass secrets into the script.

Could we either split this into a narrow bot-only variable group or move the notifier to a path that does not load the release secrets at all? I would avoid giving a status-file/update job access to the full release secret set.

radical force-pushed the radical/notify-internal-build-failure branch from 0e6133b to fe60937 Compare June 4, 2026 22:59

radical closed this Jun 4, 2026

radical reopened this Jun 4, 2026

microsoft-github-policy-service Bot added this to the 13.5 milestone Jun 4, 2026

radical closed this Jun 4, 2026

radical reopened this Jun 4, 2026

radical changed the title ~~ci: file GitHub issue when internal AzDO build breaks on main/release~~ ci(notify): file/update ci-broken GitHub issue when internal AzDO build breaks on main/release/* Jun 5, 2026

radical marked this pull request as ready for review June 5, 2026 05:23

Copilot AI review requested due to automatic review settings June 5, 2026 05:23

Copilot started reviewing on behalf of radical June 5, 2026 05:23 View session

radical requested review from davidfowl, joperezr and mitchdenny June 5, 2026 05:23

radical and others added 3 commits June 5, 2026 01:26

radical force-pushed the radical/notify-internal-build-failure branch from b3883d4 to 7870f74 Compare June 5, 2026 05:26

radical marked this pull request as draft June 5, 2026 05:27

radical marked this pull request as ready for review June 5, 2026 05:29

radical added the area-engineering-systems infrastructure helix infra engineering repo stuff label Jun 5, 2026

Copilot AI reviewed Jun 5, 2026

View reviewed changes

radical requested review from adamint and sebastienros June 8, 2026 17:12

radical mentioned this pull request Jun 9, 2026

Tracking: CI improvements #18036

Open

13 tasks

joperezr reviewed Jun 9, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into radical/notify-intern…

dec041c

…al-build-failure

radical and others added 3 commits June 9, 2026 14:45

radical requested review from Copilot and joperezr June 9, 2026 19:39

Copilot started reviewing on behalf of radical June 9, 2026 19:39 View session

Copilot AI reviewed Jun 9, 2026

joperezr reviewed Jun 9, 2026

View reviewed changes

radical requested a review from joperezr June 10, 2026 18:44

adamint reviewed Jun 12, 2026

View reviewed changes

radical marked this pull request as draft June 12, 2026 17:04

Conversation

radical commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it does

How

Validated on AzDO

Uh oh!

github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

radical commented Jun 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

joperezr Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

radical Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

joperezr Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

radical Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

joperezr Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

joperezr Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

radical Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

adamint left a comment

Choose a reason for hiding this comment

Uh oh!

adamint left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

radical commented Jun 4, 2026 •

edited

Loading

github-actions Bot commented Jun 4, 2026 •

edited

Loading