-
Notifications
You must be signed in to change notification settings - Fork 925
ci(notify): file/update ci-broken GitHub issue when internal AzDO build breaks on main/release/* #17920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ci(notify): file/update ci-broken GitHub issue when internal AzDO build breaks on main/release/* #17920
Changes from 9 commits
ea8b409
ededbfc
7870f74
d62bd59
089eef1
dec041c
719f843
6613656
ccda4d4
835b033
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,169 @@ | ||
| # Internal build failure notifications | ||
|
|
||
| The internal Azure DevOps pipeline (`microsoft-aspire`, definition 1602, | ||
| defined in [`eng/pipelines/azure-pipelines.yml`](../../eng/pipelines/azure-pipelines.yml)) | ||
| files a GitHub issue on [microsoft/aspire](https://github.com/microsoft/aspire/issues) | ||
| when it breaks on a publishing branch, and closes that issue when the next | ||
| build of the same branch goes green. | ||
|
|
||
| This document describes the contract so future maintainers can reason about | ||
| the behavior without re-reading the pipeline YAML. | ||
|
|
||
| ## What gets notified | ||
|
|
||
| Two stages run at the end of every non-PR internal build, after every | ||
| other stage: | ||
|
|
||
| - `notify_failure` — files or updates a GitHub issue when at least one | ||
| build stage ends with `Failed`. It depends on **every** non-notify | ||
| stage — `build_sign_native`, `build_extension`, `build`, | ||
| `template_tests`, `assemble`, and `prepare_installers` — so a break | ||
| anywhere (including the publish stage `assemble`) files an issue. | ||
| - `notify_success` — closes any open `ci-broken` issue for the branch | ||
| when all of those stages end with `Succeeded` or `SucceededWithIssues` | ||
| (`prepare_installers` may also end with `Skipped`; see below). | ||
|
|
||
| A stage whose dependency failed is reported as `Skipped` (not `Failed`) | ||
| by Azure Pipelines, so each stage must be watched directly — a single | ||
| downstream stage cannot be relied on to roll failures up. This is why | ||
| both conditions enumerate the full stage set rather than gating on one | ||
| terminal stage. | ||
|
|
||
| `prepare_installers` is allowed to end with `Skipped` in the success | ||
| condition only as a defensive measure. On a notifiable branch | ||
| (`main` / `release/*`) it runs whenever `build_sign_native` succeeded, so | ||
| a `Skipped` there implies `build_sign_native` did not succeed — a failure | ||
| path already covered by `notify_failure`. (Its own condition only skips on | ||
| the *stable* channel for **non**-notifiable branches.) | ||
|
|
||
| Both stages gate on the branch being either: | ||
|
|
||
| - `refs/heads/main` (exact — the trigger uses the wildcard `main*` so an | ||
| exact match here is load-bearing to avoid sweeping in branches like | ||
| `main-something`), or | ||
| - `refs/heads/release/*`. | ||
|
|
||
| `internal/release/*` is deliberately excluded so internal branch names | ||
| don't leak into the public issue tracker. Pull-request builds are also | ||
| excluded. | ||
|
|
||
| The two stages must be at the stage level (not as two jobs in a single | ||
| stage) because cross-stage dependency results can only be referenced | ||
| from a stage condition via `dependencies.<stage>.result`; from a job | ||
| condition the only available form is | ||
| `stageDependencies.<stage>.<job>.result`, which has no stage-aggregate | ||
| equivalent. | ||
|
|
||
| ## What gets filed | ||
|
|
||
| When the `notify_failure` stage fires, it creates (or appends a comment to) | ||
| a single GitHub issue per affected branch: | ||
|
|
||
| - **Title:** `Internal build broken on <branch>` | ||
| - **Labels:** `area-engineering-systems`, `ci-broken`, `blocking-clean-ci` | ||
| - **Assignees:** `joperezr`, `radical` | ||
| - **Body marker:** the first line is a hidden HTML comment | ||
| `<!-- aspire-internal-build-broken:<branch> -->` used for dedup. | ||
|
|
||
| Only one open issue per branch exists at a time. | ||
|
|
||
| The body records the first failure's build link, commit SHA, and the | ||
| comma-separated list of failed stages (any of `build_sign_native`, | ||
| `build_extension`, `build`, `template_tests`, `assemble`, | ||
| `prepare_installers`). It is written once at creation and is **not** | ||
| rewritten afterwards. | ||
|
|
||
| On each subsequent failure the script **posts a follow-up comment** with | ||
| that build's link, commit SHA, failed stages, and `cc @joperezr @radical`. | ||
| The comments are the per-failure history — and the comment is what fires | ||
| notifications, since editing the issue body would not. | ||
|
|
||
| `Canceled` stage results (operator cancellation, 1ES timeouts) intentionally | ||
| do not file an issue — the stage condition uses explicit `in(..., 'Failed')` | ||
| checks which exclude `Canceled`. | ||
|
|
||
| ## What gets closed | ||
|
|
||
| The `notify_success` stage lists open `ci-broken` issues, filters by the | ||
| branch marker, and for each match posts a "build is green again" comment | ||
| and closes the issue with `state_reason: completed`. | ||
|
|
||
| ## Dedup and race handling | ||
|
|
||
| Issue lookup uses `GET /repos/microsoft/aspire/issues?labels=ci-broken&state=open` | ||
| (strongly consistent) plus a local body-marker filter. The Search API is | ||
| intentionally avoided because its 1–2 minute eventual-consistency window | ||
| would cause near-simultaneous failed builds to each see "0 hits" and file | ||
| duplicate issues. | ||
|
|
||
| Two builds of the same branch failing within that window can still briefly | ||
| create two issues. Because builds are rolling this is rare, and the cost of | ||
| auto-deduping it (an extra `gh issue list` round-trip on every first-failure) | ||
| isn't worth it — the duplicate is left for a human to close. | ||
|
|
||
| ## Auth | ||
|
|
||
| The script mints an installation access token for the **aspire-repo-bot** | ||
| GitHub App via [`Get-AspireBotInstallationToken.ps1`](../../eng/pipelines/scripts/Get-AspireBotInstallationToken.ps1) | ||
| (the same helper used by the release pipeline's | ||
| `dispatch-release-github-tasks.ps1`). The token is immediately registered | ||
| as a secret with the agent via `##vso[task.setsecret]` so any incidental | ||
| log echo is redacted; it is consumed by `gh` through the `GH_TOKEN` process | ||
| environment variable and is not persisted as a pipeline variable. | ||
|
|
||
| The App's `aspire-bot-app-id` and `aspire-bot-private-key` secrets come | ||
| from the `Aspire-Release-Secrets` variable group, imported at pipeline | ||
| scope in `eng/pipelines/azure-pipelines.yml` and gated on non-PR builds | ||
| of `refs/heads/main` or `refs/heads/release/*` — the same condition the | ||
| notify stages use. Manual runs on feature branches and PR builds skip | ||
| the import entirely. | ||
|
|
||
| **Prerequisite**: the aspire-repo-bot install on microsoft/aspire must have | ||
| `issues:write` permission. If missing, the script will 403 on every call | ||
| (but never break the build — see below). | ||
|
|
||
| ## Disabling for a single run | ||
|
|
||
| Queue the pipeline manually and set `Notify on failure: dry-run` to true. | ||
| In dry-run mode, both stages log the `gh` CLI commands they *would* run | ||
| without mutating anything on GitHub. This applies to both the failure | ||
| and success paths — a green-build dry-run will not accidentally close | ||
| real open issues. | ||
|
|
||
| Dry-run mode is fully decoupled from the aspire-repo-bot credentials: | ||
| the wrapper omits the `ASPIRE_BOT_APP_ID` / `ASPIRE_BOT_PRIVATE_KEY` env | ||
| block and the script's `-AppId` / `-PrivateKeyPem` parameters are | ||
| non-mandatory, so a dry-run validation works without Aspire-Release-Secrets | ||
| variable group access and never mints a token. | ||
|
|
||
| ## Why this never breaks the build | ||
|
|
||
| [`Notify-GitHubOnBuildResult.ps1`](../../eng/pipelines/scripts/Notify-GitHubOnBuildResult.ps1) | ||
| wraps the entire body in `try`/`catch` and always exits 0. Any GitHub API | ||
| error, network blip, or 401/403 from a missing App permission produces a | ||
| `Write-Warning` in the job log but leaves the build result unchanged. A | ||
| flaky notification path must never turn an otherwise-correct build red. | ||
|
|
||
| However, a silently-skipped notification is its own failure mode — operators | ||
| need to see when the notification path itself broke (e.g., revoked App | ||
| permission, GitHub API shape change, deleted label). The catch block emits | ||
| AzDO logging commands so failures are visible without breaking the build: | ||
|
|
||
| - `##vso[task.logissue type=warning]` surfaces the warning in the build | ||
| summary, in 1ES dashboards, and on the badge. | ||
| - `##vso[task.complete result=SucceededWithIssues;]` bumps the job result | ||
| to `SucceededWithIssues`, which renders as a yellow badge instead of | ||
| green. Notifications and dashboards can filter on this. | ||
|
|
||
| A build that finishes "green-but-yellow" means the upstream build itself | ||
| succeeded, but the notify stage's call to GitHub failed for some reason — | ||
| worth investigating, but does not block anything that depends on the build. | ||
|
|
||
| ## Manually filing or closing | ||
|
|
||
| If you need to file or close a `ci-broken` issue by hand (e.g. during | ||
| recovery), use the existing label and add the marker `<!-- aspire-internal-build-broken:<branch> -->` | ||
| as the first line of the body. The marker must start a line — the script | ||
| matches it anchored to the start of a line so the text can't accidentally | ||
| match if pasted mid-prose into an unrelated issue. The script's next run | ||
| will treat the issue as the canonical open one and append/close accordingly. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,6 +21,15 @@ parameters: | |
| - stable | ||
| - staging | ||
| - daily | ||
| # When true, the notify_failure and notify_success stages run in dry-run | ||
| # mode: they log the `gh` CLI commands they would run to file/update/close | ||
| # the ci-broken issue, but do not actually mutate anything on | ||
| # microsoft/aspire. Use this when validating pipeline plumbing changes | ||
| # without spamming issues. | ||
| - name: notifyOnFailureDryRun | ||
| displayName: 'Notify on failure: dry-run (log gh commands, do not mutate)' | ||
| type: boolean | ||
| default: false | ||
|
|
||
| trigger: | ||
| batch: true | ||
|
|
@@ -64,6 +73,13 @@ variables: | |
| - template: /eng/pipelines/common-variables.yml@self | ||
| - template: /eng/common/templates-official/variables/pool-providers.yml@self | ||
|
|
||
| # aspire-bot-app-id + aspire-bot-private-key for notify_failure / notify_success. | ||
| # Gate mirrors the notify stage conditions: non-PR + main/release/* only. | ||
| # A manual run on a feature branch never needs these, so don't pay the | ||
| # variable-group auth check at queue time. | ||
| - ${{ if and(notin(variables['Build.Reason'], 'PullRequest'), or(eq(variables['Build.SourceBranch'], 'refs/heads/main'), startsWith(variables['Build.SourceBranch'], 'refs/heads/release/'))) }}: | ||
| - group: Aspire-Release-Secrets | ||
|
|
||
| - name: _BuildConfig | ||
| value: Release | ||
| - name: Build.Arcade.ArtifactsPath | ||
|
|
@@ -876,3 +892,130 @@ extends: | |
| Write-Host "Resolved macOS npm validation artifact: $artifactName" | ||
| Write-Host "##vso[task.setvariable variable=NpmValidationArtifactName]$artifactName" | ||
| displayName: 🟣Resolve native macOS RID | ||
|
|
||
| # ---------------------------------------------------------------- | ||
| # On every non-PR build of main / release/*, file or update a GitHub | ||
| # issue on microsoft/aspire when the build fails, and close any open | ||
| # ci-broken issue for the branch when the build succeeds. | ||
| # | ||
| # Split into TWO stages so we can gate at the STAGE level — only | ||
| # stage conditions can reference dependencies.<stage>.result; from a | ||
| # JOB condition in a different stage, the only available form is | ||
| # stageDependencies.<stage>.<job>.result (no stage-aggregate form), | ||
| # which would force us to enumerate specific job names from upstream | ||
| # stages. | ||
| # | ||
| # See docs/ci/internal-build-failure-notifications.md for the full | ||
| # contract (labels, marker, dedupe behavior, dry-run). | ||
| # ---------------------------------------------------------------- | ||
| - stage: notify_failure | ||
| displayName: Notify Build Failure | ||
| dependsOn: | ||
| - build_sign_native | ||
| - build_extension | ||
| - build | ||
| - template_tests | ||
| - assemble | ||
| - prepare_installers | ||
|
radical marked this conversation as resolved.
|
||
| # Files an issue when at least one watched stage Failed. We depend on | ||
| # EVERY non-notify stage so a break anywhere — including the publish | ||
| # stage `assemble`, the VSIX build `build_extension`, or `template_tests` | ||
| # — files an issue. A stage whose dependency failed is reported as | ||
| # 'Skipped' (not 'Failed'), so each stage must be watched directly | ||
| # rather than relying on one downstream stage to roll failures up. | ||
| # Canceled stage results match neither 'Failed' nor 'Succeeded', so | ||
| # operator/timeout cancellations produce no notification. | ||
| condition: | | ||
| and( | ||
| ne(variables['Build.Reason'], 'PullRequest'), | ||
| eq(variables['_IsNotificationBranch'], 'true'), | ||
| or( | ||
| in(dependencies.build_sign_native.result, 'Failed'), | ||
| in(dependencies.build_extension.result, 'Failed'), | ||
| in(dependencies.build.result, 'Failed'), | ||
| in(dependencies.template_tests.result, 'Failed'), | ||
| in(dependencies.assemble.result, 'Failed'), | ||
| in(dependencies.prepare_installers.result, 'Failed') | ||
| ) | ||
| ) | ||
| variables: | ||
| # Stage results exposed as runtime variables so the shared notify | ||
| # template can compose a "failed stages" list for the notification. | ||
| - name: BuildSignNativeStageResult | ||
| value: $[ dependencies.build_sign_native.result ] | ||
| - name: BuildExtensionStageResult | ||
| value: $[ dependencies.build_extension.result ] | ||
| - name: BuildStageResult | ||
| value: $[ dependencies.build.result ] | ||
| - name: TemplateTestsStageResult | ||
| value: $[ dependencies.template_tests.result ] | ||
| - name: AssembleStageResult | ||
| value: $[ dependencies.assemble.result ] | ||
| - name: PrepareInstallersStageResult | ||
| value: $[ dependencies.prepare_installers.result ] | ||
| jobs: | ||
| - template: /eng/pipelines/templates/notify-build-result.yml@self | ||
| parameters: | ||
| mode: Failure | ||
| jobName: NotifyOnFailure | ||
| jobDisplayName: 'File / update ci-broken issue' | ||
| dryRun: ${{ parameters.notifyOnFailureDryRun }} | ||
|
|
||
| - stage: notify_success | ||
|
radical marked this conversation as resolved.
|
||
| displayName: Notify Build Success | ||
| dependsOn: | ||
| - build_sign_native | ||
| - build_extension | ||
| - build | ||
| - template_tests | ||
| - assemble | ||
| - prepare_installers | ||
| # Only closes the issue when EVERY watched stage is Succeeded / | ||
| # SucceededWithIssues (the same stage set as notify_failure). Requiring | ||
| # the full set means a build that fixes one stage but breaks another | ||
| # (e.g. green build/build_sign_native but a failed `assemble`) does NOT | ||
| # falsely close the open ci-broken issue. | ||
| # | ||
| # prepare_installers also permits 'Skipped': on a notifiable branch | ||
| # (main / release/*) it runs whenever build_sign_native succeeded, so a | ||
| # Skip there implies build_sign_native did not succeed — a failure path | ||
| # already covered by notify_failure. (The stable-channel skip in its own | ||
| # condition only takes effect on non-notifiable branches.) Allowing | ||
| # 'Skipped' is therefore defensive and cannot cause a false close. | ||
| # | ||
| # Canceled stage results don't match either branch and produce no | ||
| # notification. | ||
| condition: | | ||
| and( | ||
| ne(variables['Build.Reason'], 'PullRequest'), | ||
| eq(variables['_IsNotificationBranch'], 'true'), | ||
| in(dependencies.build_sign_native.result, 'Succeeded', 'SucceededWithIssues'), | ||
| in(dependencies.build_extension.result, 'Succeeded', 'SucceededWithIssues'), | ||
| in(dependencies.build.result, 'Succeeded', 'SucceededWithIssues'), | ||
| in(dependencies.template_tests.result, 'Succeeded', 'SucceededWithIssues'), | ||
| in(dependencies.assemble.result, 'Succeeded', 'SucceededWithIssues'), | ||
| in(dependencies.prepare_installers.result, 'Succeeded', 'SucceededWithIssues', 'Skipped') | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the intended behavior for mixed stage results — e.g., one stage
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We also have Partially Succeeded which I'm not sure I follow how is that counted now.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Documented this in 835b033 rather than changing the conditions, because the limbo behavior is actually the conservative-correct outcome. On On I rejected widening the success condition to tolerate |
||
| ) | ||
| variables: | ||
| # Stage results exposed as runtime variables so the shared notify | ||
| # template can compose a "failed stages" list. In Success mode none | ||
| # are 'Failed', so the list is empty and the notifier ignores it. | ||
| - name: BuildSignNativeStageResult | ||
| value: $[ dependencies.build_sign_native.result ] | ||
| - name: BuildExtensionStageResult | ||
| value: $[ dependencies.build_extension.result ] | ||
| - name: BuildStageResult | ||
| value: $[ dependencies.build.result ] | ||
| - name: TemplateTestsStageResult | ||
| value: $[ dependencies.template_tests.result ] | ||
| - name: AssembleStageResult | ||
| value: $[ dependencies.assemble.result ] | ||
| - name: PrepareInstallersStageResult | ||
| value: $[ dependencies.prepare_installers.result ] | ||
| jobs: | ||
| - template: /eng/pipelines/templates/notify-build-result.yml@self | ||
| parameters: | ||
| mode: Success | ||
| jobName: CloseOnSuccess | ||
| jobDisplayName: 'Close ci-broken issue' | ||
| dryRun: ${{ parameters.notifyOnFailureDryRun }} | ||
Uh oh!
There was an error while loading. Please reload this page.