Skip to content

bug(#1099): fix deploy-azd startup_failure + permission-cap CI gate#1099

Merged
Cataldir merged 3 commits into
mainfrom
bug/1099-fix-deploy-azd-startup-failure
May 13, 2026
Merged

bug(#1099): fix deploy-azd startup_failure + permission-cap CI gate#1099
Cataldir merged 3 commits into
mainfrom
bug/1099-fix-deploy-azd-startup-failure

Conversation

@Cataldir

Copy link
Copy Markdown
Contributor

Closes #1099

Problem

All 27 per-service deploy entrypoints (deploy-azd-*.yml) have been
startup_failure-ing on every dispatch since PR #1097 merged. Concretely:
17 startup_failure + 29 cancelled + 0 successful runs across the last
1,000 entries in the workflow history. Every run died in ~7 seconds with no
logs.

Root cause

PR #1097 added an open-image-tag-bump-pr job to the reusable callee
.github/workflows/deploy-azd.yml that requested
permissions.pull-requests: write. The 27 per-service entrypoints grant
only id-token | contents | issues: write on their uses: job:

# in deploy-azd-inventory-health-check.yml (caller)
deploy:
  permissions:
    id-token: write
    contents: write
    issues: write
  uses: ./.github/workflows/deploy-azd.yml  # callee tries to elevate to pull-requests: write — REJECTED

GitHub's nested-workflow rule:
"Permissions can only be maintained or reduced — not elevated — throughout
the chain."
GitHub rejects the callee at the orchestrator before any
runner is allocated
, which is why no logs exist for the failed runs.
actionlint cannot catch this — it is a cross-file semantic rule.

Bisection confirmed: a branch reverting deploy-azd.yml to PR #1097's
parent commit ran the full 18-job pipeline successfully (only failed on
expected environment protection on the deploy job).

Fix

  1. Remove the open-image-tag-bump-pr job from deploy-azd.yml.
    The 27 HelmRelease YAML re-pins from PR feat(deploy): align Flux HelmReleases with build path + add image-tag PR bridge (#990) #1097 are kept — they are a
    legitimate ACR-path fix. The proper PR-bridge implementation
    (Phase 2b) moves to Flux's own components and lives outside the GHA
    pipeline. See the ADR-017 amendment below for the post-mortem and
    the next-attempt design.
  2. Grant the missing permissions on deploy-azd-prod.yml's
    deploy job. The watchdog-apim-agc-swa-drift job in the
    callee comments on issue Watchdog: monitor APIM/AGC/SWA drift after deploy #298 when drift is detected, requiring
    issues: write. Prod tag pushes (v*.*.*) would have hit
    the same startup_failure. Latent bug — fix in this PR.
  3. Add scripts/ci/lint_workflow_permissions.py — static
    permission-cap linter that diffs each caller's per-job
    permissions: map (with workflow-level fallback) against the
    union of per-job permissions: declared in the callee. Emits
    ::error file=...:: markers on mismatch. 4 unit tests cover the
    pass case, the issue bug(#1099): fix deploy-azd startup_failure + permission-cap CI gate #1099 regression, the read-only caller case,
    and the absent-per-job-permissions case.
  4. Add .github/workflows/lint-actions.yml — runs actionlint
    plus the new linter on every workflow PR/push. Prevents this exact
    class of regression from ever shipping silently again.

ADR-017 amendment

docs/architecture/adrs/adr-017-deployment-strategy.md § "Phase 2b"
now documents the failed first attempt (PR #1097), the cross-file
permission-cap defect, the scoped revert, and the next-attempt design
(Flux ImageRepository + ImagePolicy + ImageUpdateAutomation

  • Notification Controller GitHub provider for the PR-bridge).

Validation

Check Result
python scripts/ci/lint_workflow_permissions.py OK (0 violations)
actionlint on all 4 modified workflow files 0 issues
pytest scripts/ci/tests/test_lint_workflow_permissions.py 4 passed
Pre-push gate (705 tests + full lint suite) passed
Bisection (revert deploy-azd.yml to PR #1097's parent) 18 jobs ran cleanly

Risk

Low. The removed job has been a no-op since merge (never produced
output). The added permissions: block on deploy-azd-prod.yml only
takes effect on v*.*.* tag pushes, which have not happened during the
broken window. The linter is additive — fails closed on workflow changes
only.

Follow-ups (separate issues)

…cap lint

Root cause
----------
PR #1097 added an `open-image-tag-bump-pr` job to the reusable
`.github/workflows/deploy-azd.yml` that declared
`permissions.pull-requests: write`.  The 27 per-service entrypoints
(`deploy-azd-*.yml`) grant only `id-token | contents | issues: write`
on their `uses:` job.  GitHub Actions rejects nested-workflow callees
whose `permissions:` map elevates beyond the caller grant, and the
rejection happens at the orchestrator before any runner is allocated.
`actionlint` cannot see this because it is a cross-file semantic rule.

Impact: 17 `startup_failure`-s + 29 cancellations + 0 successes across
the last 1,000 runs.  Every dispatched deploy short-circuited in ~7s with
no logs.  The dev deploy chain has been silently broken for ~2 days.

Fix
---
1. Remove the `open-image-tag-bump-pr` job from `deploy-azd.yml`.
   The 27 HelmRelease YAML re-pins from PR #1097 are kept (legitimate
   ACR-path fix).  The proper Phase 2b implementation moves to Flux's
   `ImageRepository` + `ImagePolicy` + `ImageUpdateAutomation` +
   Notification Controller (tracked separately in ADR-017 amendment).
2. Add `permissions: { id-token | contents | issues }: write` to
   `deploy-azd-prod.yml` (latent bug — prod tag pushes would have
   hit the same failure on the `watchdog-apim-agc-swa-drift` job
   which posts comments to issue #298).
3. Add `scripts/ci/lint_workflow_permissions.py` — static linter
   that diffs caller/callee per-job permission maps with workflow-level
   fallback, catches this exact regression class, and emits
   `::error file=...::` markers.  Backed by 4 unit tests.
4. Add `.github/workflows/lint-actions.yml` — CI gate running
   actionlint + the permission-cap linter on every workflow PR/push.

Validation
----------
- `python scripts/ci/lint_workflow_permissions.py` → OK
- `actionlint` on deploy-azd.yml / deploy-azd-prod.yml / deploy-azd-dev.yml
  / lint-actions.yml → 0 issues
- `pytest scripts/ci/tests/test_lint_workflow_permissions.py` → 4 passed
- Bisection branch `bug/1099-bisect-deploy-azd-pre-1097` reverted to the
  parent of PR #1097 and confirmed 18 jobs ran successfully (only failed
  on expected environment protection).
- ADR-017 §Phase 2b updated with attempt-1 post-mortem, decision, and
  lessons learned.

Refs
----
- Closes #1099
- Amends ADR-017 (Deployment Strategy)
- Follow-up issue to be filed: proper Phase 2b via Flux
  `ImageUpdateAutomation` + Notification Controller PR-bridge.
Comment thread scripts/ci/tests/test_lint_workflow_permissions.py Fixed
Cataldir added 2 commits May 13, 2026 01:09
The new lint-actions workflow's purpose is enforcing GitHub Actions schema + the cross-file permission-cap rule (via scripts/ci/lint_workflow_permissions.py). The embedded shellcheck pass surfaced ~30 pre-existing style warnings (SC2034/SC2129/SC2153) in deploy-azd.yml and ci.yml that this PR does not touch. Bundling shellcheck cleanup with the deploy fix would force unrelated refactors as the price of merging. Re-enable in a focused follow-up if/when the team commits to a shell-quality cleanup pass.
@Cataldir Cataldir merged commit 88409ec into main May 13, 2026
15 checks passed
@Cataldir Cataldir deleted the bug/1099-fix-deploy-azd-startup-failure branch May 13, 2026 04:34
Cataldir added a commit that referenced this pull request May 13, 2026
…workflow-level baseline (#1100)

Two issues remained after PR #1099:

1. deploy-azd-truth.yml (scoped multi-truth-agent entrypoint) granted only contents: read at both workflow- and job-level. The reusable callee deploy-azd.yml declares workflow-level contents: write (jobs inherit it). GitHub Actions cap check rejected the run with startup_failure before runner allocation. Fix: grant contents: write at both levels in the truth entrypoint, mirroring every other per-service entrypoint.

2. The permission-cap linter introduced in PR #1099 missed this case because it only computed the union of CALLEE per-job permissions: maps, ignoring callee workflow-level fallback. GitHub treats workflow-level permissions as the effective permissions for any job that omits its own map. The linter now mirrors that semantics: when collecting the callee required-set, workflow-level permissions are seeded first and per-job maps override per-key when present.

Tests: added regression 	est_linter_includes_callee_workflow_level_permissions_in_required_set (5/5 pass).

Validation: python scripts/ci/lint_workflow_permissions.py -> 1 callee checked, zero violations.

Closes-related #1099.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant