Skip to content
Draft
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 169 additions & 0 deletions docs/ci/internal-build-failure-notifications.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
# Internal build failure notifications

The internal Azure DevOps pipeline (`microsoft-aspire`, definition 1602,
defined in [`eng/pipelines/azure-pipelines.yml`](../../eng/pipelines/azure-pipelines.yml))
files a GitHub issue on [microsoft/aspire](https://github.com/microsoft/aspire/issues)
when it breaks on a publishing branch, and closes that issue when the next
build of the same branch goes green.

This document describes the contract so future maintainers can reason about
the behavior without re-reading the pipeline YAML.

## What gets notified

Two stages run at the end of every non-PR internal build, after every
other stage:

- `notify_failure` — files or updates a GitHub issue when at least one
build stage ends with `Failed`. It depends on **every** non-notify
stage — `build_sign_native`, `build_extension`, `build`,
`template_tests`, `assemble`, and `prepare_installers` — so a break
anywhere (including the publish stage `assemble`) files an issue.
- `notify_success` — closes any open `ci-broken` issue for the branch
when all of those stages end with `Succeeded` or `SucceededWithIssues`
(`prepare_installers` may also end with `Skipped`; see below).

A stage whose dependency failed is reported as `Skipped` (not `Failed`)
by Azure Pipelines, so each stage must be watched directly — a single
downstream stage cannot be relied on to roll failures up. This is why
both conditions enumerate the full stage set rather than gating on one
terminal stage.

`prepare_installers` is allowed to end with `Skipped` in the success
condition only as a defensive measure. On a notifiable branch
(`main` / `release/*`) it runs whenever `build_sign_native` succeeded, so
a `Skipped` there implies `build_sign_native` did not succeed — a failure
path already covered by `notify_failure`. (Its own condition only skips on
the *stable* channel for **non**-notifiable branches.)

Both stages gate on the branch being either:

- `refs/heads/main` (exact — the trigger uses the wildcard `main*` so an
exact match here is load-bearing to avoid sweeping in branches like
`main-something`), or
- `refs/heads/release/*`.

`internal/release/*` is deliberately excluded so internal branch names
don't leak into the public issue tracker. Pull-request builds are also
excluded.

The two stages must be at the stage level (not as two jobs in a single
stage) because cross-stage dependency results can only be referenced
from a stage condition via `dependencies.<stage>.result`; from a job
condition the only available form is
`stageDependencies.<stage>.<job>.result`, which has no stage-aggregate
equivalent.

## What gets filed

When the `notify_failure` stage fires, it creates (or appends a comment to)
a single GitHub issue per affected branch:

- **Title:** `Internal build broken on <branch>`
- **Labels:** `area-engineering-systems`, `ci-broken`, `blocking-clean-ci`
- **Assignees:** `joperezr`, `radical`
- **Body marker:** the first line is a hidden HTML comment
`<!-- aspire-internal-build-broken:<branch> -->` used for dedup.

Only one open issue per branch exists at a time.

The body records the first failure's build link, commit SHA, and the
comma-separated list of failed stages (any of `build_sign_native`,
`build_extension`, `build`, `template_tests`, `assemble`,
`prepare_installers`). It is written once at creation and is **not**
rewritten afterwards.

On each subsequent failure the script **posts a follow-up comment** with
that build's link, commit SHA, failed stages, and `cc @joperezr @radical`.
The comments are the per-failure history — and the comment is what fires
notifications, since editing the issue body would not.

`Canceled` stage results (operator cancellation, 1ES timeouts) intentionally
do not file an issue — the stage condition uses explicit `in(..., 'Failed')`
checks which exclude `Canceled`.

## What gets closed

The `notify_success` stage lists open `ci-broken` issues, filters by the
branch marker, and for each match posts a "build is green again" comment
and closes the issue with `state_reason: completed`.

## Dedup and race handling

Issue lookup uses `GET /repos/microsoft/aspire/issues?labels=ci-broken&state=open`
(strongly consistent) plus a local body-marker filter. The Search API is
intentionally avoided because its 1–2 minute eventual-consistency window
would cause near-simultaneous failed builds to each see "0 hits" and file
duplicate issues.

Two builds of the same branch failing within that window can still briefly
create two issues. Because builds are rolling this is rare, and the cost of
auto-deduping it (an extra `gh issue list` round-trip on every first-failure)
isn't worth it — the duplicate is left for a human to close.

## Auth

The script mints an installation access token for the **aspire-repo-bot**
GitHub App via [`Get-AspireBotInstallationToken.ps1`](../../eng/pipelines/scripts/Get-AspireBotInstallationToken.ps1)
(the same helper used by the release pipeline's
`dispatch-release-github-tasks.ps1`). The token is immediately registered
as a secret with the agent via `##vso[task.setsecret]` so any incidental
log echo is redacted; it is consumed by `gh` through the `GH_TOKEN` process
environment variable and is not persisted as a pipeline variable.

The App's `aspire-bot-app-id` and `aspire-bot-private-key` secrets come
from the `Aspire-Release-Secrets` variable group, imported at pipeline
scope in `eng/pipelines/azure-pipelines.yml` and gated on non-PR builds
of `refs/heads/main` or `refs/heads/release/*` — the same condition the
notify stages use. Manual runs on feature branches and PR builds skip
the import entirely.

**Prerequisite**: the aspire-repo-bot install on microsoft/aspire must have
`issues:write` permission. If missing, the script will 403 on every call
(but never break the build — see below).

## Disabling for a single run

Queue the pipeline manually and set `Notify on failure: dry-run` to true.
In dry-run mode, both stages log the `gh` CLI commands they *would* run
without mutating anything on GitHub. This applies to both the failure
and success paths — a green-build dry-run will not accidentally close
real open issues.

Dry-run mode is fully decoupled from the aspire-repo-bot credentials:
the wrapper omits the `ASPIRE_BOT_APP_ID` / `ASPIRE_BOT_PRIVATE_KEY` env
block and the script's `-AppId` / `-PrivateKeyPem` parameters are
non-mandatory, so a dry-run validation works without Aspire-Release-Secrets
variable group access and never mints a token.

## Why this never breaks the build

[`Notify-GitHubOnBuildResult.ps1`](../../eng/pipelines/scripts/Notify-GitHubOnBuildResult.ps1)
wraps the entire body in `try`/`catch` and always exits 0. Any GitHub API
error, network blip, or 401/403 from a missing App permission produces a
`Write-Warning` in the job log but leaves the build result unchanged. A
flaky notification path must never turn an otherwise-correct build red.

However, a silently-skipped notification is its own failure mode — operators
need to see when the notification path itself broke (e.g., revoked App
permission, GitHub API shape change, deleted label). The catch block emits
AzDO logging commands so failures are visible without breaking the build:

- `##vso[task.logissue type=warning]` surfaces the warning in the build
summary, in 1ES dashboards, and on the badge.
- `##vso[task.complete result=SucceededWithIssues;]` bumps the job result
to `SucceededWithIssues`, which renders as a yellow badge instead of
green. Notifications and dashboards can filter on this.

A build that finishes "green-but-yellow" means the upstream build itself
succeeded, but the notify stage's call to GitHub failed for some reason —
worth investigating, but does not block anything that depends on the build.

## Manually filing or closing

If you need to file or close a `ci-broken` issue by hand (e.g. during
recovery), use the existing label and add the marker `<!-- aspire-internal-build-broken:<branch> -->`
as the first line of the body. The marker must start a line — the script
matches it anchored to the start of a line so the text can't accidentally
match if pasted mid-prose into an unrelated issue. The script's next run
will treat the issue as the canonical open one and append/close accordingly.
6 changes: 6 additions & 0 deletions eng/pipelines/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,9 @@ This pipeline:
## Template Structure

The public pipelines (`azure-pipelines-public.yml` and `azdo-tests.yml`) use a shared template (`templates/public-pipeline-template.yml`) to avoid code duplication while maintaining the same functionality.

## Build-result notifications

`azure-pipelines.yml` files a GitHub issue on microsoft/aspire when the
internal build breaks on `main` or `release/*`, and closes it when the
next build is green. See [docs/ci/internal-build-failure-notifications.md](../../docs/ci/internal-build-failure-notifications.md).
143 changes: 143 additions & 0 deletions eng/pipelines/azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,15 @@ parameters:
- stable
- staging
- daily
# When true, the notify_failure and notify_success stages run in dry-run
# mode: they log the `gh` CLI commands they would run to file/update/close
# the ci-broken issue, but do not actually mutate anything on
Comment thread
radical marked this conversation as resolved.
# microsoft/aspire. Use this when validating pipeline plumbing changes
# without spamming issues.
- name: notifyOnFailureDryRun
displayName: 'Notify on failure: dry-run (log gh commands, do not mutate)'
type: boolean
default: false

trigger:
batch: true
Expand Down Expand Up @@ -64,6 +73,13 @@ variables:
- template: /eng/pipelines/common-variables.yml@self
- template: /eng/common/templates-official/variables/pool-providers.yml@self

# aspire-bot-app-id + aspire-bot-private-key for notify_failure / notify_success.
# Gate mirrors the notify stage conditions: non-PR + main/release/* only.
# A manual run on a feature branch never needs these, so don't pay the
# variable-group auth check at queue time.
- ${{ if and(notin(variables['Build.Reason'], 'PullRequest'), or(eq(variables['Build.SourceBranch'], 'refs/heads/main'), startsWith(variables['Build.SourceBranch'], 'refs/heads/release/'))) }}:
- group: Aspire-Release-Secrets

- name: _BuildConfig
value: Release
- name: Build.Arcade.ArtifactsPath
Expand Down Expand Up @@ -876,3 +892,130 @@ extends:
Write-Host "Resolved macOS npm validation artifact: $artifactName"
Write-Host "##vso[task.setvariable variable=NpmValidationArtifactName]$artifactName"
displayName: 🟣Resolve native macOS RID

# ----------------------------------------------------------------
# On every non-PR build of main / release/*, file or update a GitHub
# issue on microsoft/aspire when the build fails, and close any open
# ci-broken issue for the branch when the build succeeds.
#
# Split into TWO stages so we can gate at the STAGE level — only
# stage conditions can reference dependencies.<stage>.result; from a
# JOB condition in a different stage, the only available form is
# stageDependencies.<stage>.<job>.result (no stage-aggregate form),
# which would force us to enumerate specific job names from upstream
# stages.
#
# See docs/ci/internal-build-failure-notifications.md for the full
# contract (labels, marker, dedupe behavior, dry-run).
# ----------------------------------------------------------------
- stage: notify_failure
displayName: Notify Build Failure
dependsOn:
- build_sign_native
- build_extension
- build
- template_tests
- assemble
- prepare_installers
Comment thread
radical marked this conversation as resolved.
# Files an issue when at least one watched stage Failed. We depend on
# EVERY non-notify stage so a break anywhere — including the publish
# stage `assemble`, the VSIX build `build_extension`, or `template_tests`
# — files an issue. A stage whose dependency failed is reported as
# 'Skipped' (not 'Failed'), so each stage must be watched directly
# rather than relying on one downstream stage to roll failures up.
# Canceled stage results match neither 'Failed' nor 'Succeeded', so
# operator/timeout cancellations produce no notification.
condition: |
and(
ne(variables['Build.Reason'], 'PullRequest'),
eq(variables['_IsNotificationBranch'], 'true'),
or(
in(dependencies.build_sign_native.result, 'Failed'),
in(dependencies.build_extension.result, 'Failed'),
in(dependencies.build.result, 'Failed'),
in(dependencies.template_tests.result, 'Failed'),
in(dependencies.assemble.result, 'Failed'),
in(dependencies.prepare_installers.result, 'Failed')
)
)
variables:
# Stage results exposed as runtime variables so the shared notify
# template can compose a "failed stages" list for the notification.
- name: BuildSignNativeStageResult
value: $[ dependencies.build_sign_native.result ]
- name: BuildExtensionStageResult
value: $[ dependencies.build_extension.result ]
- name: BuildStageResult
value: $[ dependencies.build.result ]
- name: TemplateTestsStageResult
value: $[ dependencies.template_tests.result ]
- name: AssembleStageResult
value: $[ dependencies.assemble.result ]
- name: PrepareInstallersStageResult
value: $[ dependencies.prepare_installers.result ]
jobs:
- template: /eng/pipelines/templates/notify-build-result.yml@self
parameters:
mode: Failure
jobName: NotifyOnFailure
jobDisplayName: 'File / update ci-broken issue'
dryRun: ${{ parameters.notifyOnFailureDryRun }}

- stage: notify_success
Comment thread
radical marked this conversation as resolved.
displayName: Notify Build Success
dependsOn:
- build_sign_native
- build_extension
- build
- template_tests
- assemble
- prepare_installers
# Only closes the issue when EVERY watched stage is Succeeded /
# SucceededWithIssues (the same stage set as notify_failure). Requiring
# the full set means a build that fixes one stage but breaks another
# (e.g. green build/build_sign_native but a failed `assemble`) does NOT
# falsely close the open ci-broken issue.
#
# prepare_installers also permits 'Skipped': on a notifiable branch
# (main / release/*) it runs whenever build_sign_native succeeded, so a
# Skip there implies build_sign_native did not succeed — a failure path
# already covered by notify_failure. (The stable-channel skip in its own
# condition only takes effect on non-notifiable branches.) Allowing
# 'Skipped' is therefore defensive and cannot cause a false close.
#
# Canceled stage results don't match either branch and produce no
# notification.
condition: |
and(
ne(variables['Build.Reason'], 'PullRequest'),
eq(variables['_IsNotificationBranch'], 'true'),
in(dependencies.build_sign_native.result, 'Succeeded', 'SucceededWithIssues'),
in(dependencies.build_extension.result, 'Succeeded', 'SucceededWithIssues'),
in(dependencies.build.result, 'Succeeded', 'SucceededWithIssues'),
in(dependencies.template_tests.result, 'Succeeded', 'SucceededWithIssues'),
in(dependencies.assemble.result, 'Succeeded', 'SucceededWithIssues'),
in(dependencies.prepare_installers.result, 'Succeeded', 'SucceededWithIssues', 'Skipped')

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the intended behavior for mixed stage results — e.g., one stage Canceled (1ES timeout) and others Succeeded? Today neither stage fires, which is fine on the failure side, but if there's a pre-existing open ci-broken issue it'll stay open across subsequent mixed-result builds until a fully-clean build happens. Worth either widening the success condition to tolerate Canceled/Skipped on non-critical stages, or documenting that operators may need to manually close stuck issues after intermittent 1ES failures.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have Partially Succeeded which I'm not sure I follow how is that counted now.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented this in 835b033 rather than changing the conditions, because the limbo behavior is actually the conservative-correct outcome.

On PartiallySucceeded: at the YAML stage-dependency level dependencies.<stage>.result is one of Succeeded / SucceededWithIssues / Failed / Canceled / Skipped. Partial/warning surfaces as SucceededWithIssues (which the success condition already tolerates); PartiallySucceeded is a run-level / Classic-release label, not a stage-dependency result. So there's no separate bucket to handle.

On Canceled (the real limbo case — all green except one stage canceled by a 1ES timeout): neither stage fires by design. We don't file (a cancellation is infra noise, not a code break) and we don't close (a canceled stage never verified its work, so there's no fully-green signal). An existing ci-broken issue stays open until a genuinely green build; operators close it manually if they confirm the cancellation was spurious.

I rejected widening the success condition to tolerate Canceled because that would auto-close on an incomplete build — a false all-clear. Now spelled out in the stage comment + docs/ci/internal-build-failure-notifications.md.

)
variables:
# Stage results exposed as runtime variables so the shared notify
# template can compose a "failed stages" list. In Success mode none
# are 'Failed', so the list is empty and the notifier ignores it.
- name: BuildSignNativeStageResult
value: $[ dependencies.build_sign_native.result ]
- name: BuildExtensionStageResult
value: $[ dependencies.build_extension.result ]
- name: BuildStageResult
value: $[ dependencies.build.result ]
- name: TemplateTestsStageResult
value: $[ dependencies.template_tests.result ]
- name: AssembleStageResult
value: $[ dependencies.assemble.result ]
- name: PrepareInstallersStageResult
value: $[ dependencies.prepare_installers.result ]
jobs:
- template: /eng/pipelines/templates/notify-build-result.yml@self
parameters:
mode: Success
jobName: CloseOnSuccess
jobDisplayName: 'Close ci-broken issue'
dryRun: ${{ parameters.notifyOnFailureDryRun }}
10 changes: 10 additions & 0 deletions eng/pipelines/common-variables.yml
Original file line number Diff line number Diff line change
Expand Up @@ -71,3 +71,13 @@ variables:
# repositories." Consumed by publish-winget.yml to guard upstream submission.
- name: _IsProductionBranch
value: ${{ or(eq(variables['Build.SourceBranch'], 'refs/heads/main'), startsWith(variables['Build.SourceBranch'], 'refs/heads/release/'), startsWith(variables['Build.SourceBranch'], 'refs/heads/internal/release/')) }}

# Branches where notify_failure / notify_success file or close GitHub
# issues. Excludes internal/release/* so internal branch names don't
# leak into the public microsoft/aspire tracker.
#
# IMPORTANT: exact match on refs/heads/main, not startsWith — the
# pipeline trigger's `main*` wildcard would otherwise sweep in
# branches like main-something.
- name: _IsNotificationBranch
value: ${{ or(eq(variables['Build.SourceBranch'], 'refs/heads/main'), startsWith(variables['Build.SourceBranch'], 'refs/heads/release/')) }}
Loading
Loading