bug(#1099): fix deploy-azd startup_failure + permission-cap CI gate by Cataldir · Pull Request #1099 · Azure-Samples/holiday-peak-hub

Cataldir · 2026-05-13T04:05:30Z

Closes #1099

Problem

All 27 per-service deploy entrypoints (deploy-azd-*.yml) have been
startup_failure-ing on every dispatch since PR #1097 merged. Concretely:
17 startup_failure + 29 cancelled + 0 successful runs across the last
1,000 entries in the workflow history. Every run died in ~7 seconds with no
logs.

Root cause

PR #1097 added an open-image-tag-bump-pr job to the reusable callee
.github/workflows/deploy-azd.yml that requested
permissions.pull-requests: write. The 27 per-service entrypoints grant
only id-token | contents | issues: write on their uses: job:

# in deploy-azd-inventory-health-check.yml (caller)
deploy:
  permissions:
    id-token: write
    contents: write
    issues: write
  uses: ./.github/workflows/deploy-azd.yml  # callee tries to elevate to pull-requests: write — REJECTED

GitHub's nested-workflow rule:
"Permissions can only be maintained or reduced — not elevated — throughout
the chain." GitHub rejects the callee at the orchestrator before any
runner is allocated, which is why no logs exist for the failed runs.
actionlint cannot catch this — it is a cross-file semantic rule.

Bisection confirmed: a branch reverting deploy-azd.yml to PR #1097's
parent commit ran the full 18-job pipeline successfully (only failed on
expected environment protection on the deploy job).

Fix

Remove the open-image-tag-bump-pr job from deploy-azd.yml.
The 27 HelmRelease YAML re-pins from PR feat(deploy): align Flux HelmReleases with build path + add image-tag PR bridge (#990) #1097 are kept — they are a
legitimate ACR-path fix. The proper PR-bridge implementation
(Phase 2b) moves to Flux's own components and lives outside the GHA
pipeline. See the ADR-017 amendment below for the post-mortem and
the next-attempt design.
Grant the missing permissions on deploy-azd-prod.yml's
deploy job. The watchdog-apim-agc-swa-drift job in the
callee comments on issue Watchdog: monitor APIM/AGC/SWA drift after deploy #298 when drift is detected, requiring
issues: write. Prod tag pushes (v*.*.*) would have hit
the same startup_failure. Latent bug — fix in this PR.
Add scripts/ci/lint_workflow_permissions.py — static
permission-cap linter that diffs each caller's per-job
permissions: map (with workflow-level fallback) against the
union of per-job permissions: declared in the callee. Emits
::error file=...:: markers on mismatch. 4 unit tests cover the
pass case, the issue bug(#1099): fix deploy-azd startup_failure + permission-cap CI gate #1099 regression, the read-only caller case,
and the absent-per-job-permissions case.
Add .github/workflows/lint-actions.yml — runs actionlint
plus the new linter on every workflow PR/push. Prevents this exact
class of regression from ever shipping silently again.

ADR-017 amendment

docs/architecture/adrs/adr-017-deployment-strategy.md § "Phase 2b"
now documents the failed first attempt (PR #1097), the cross-file
permission-cap defect, the scoped revert, and the next-attempt design
(Flux ImageRepository + ImagePolicy + ImageUpdateAutomation

Notification Controller GitHub provider for the PR-bridge).

Validation

Check	Result
`python scripts/ci/lint_workflow_permissions.py`	OK (0 violations)
`actionlint` on all 4 modified workflow files	0 issues
`pytest scripts/ci/tests/test_lint_workflow_permissions.py`	4 passed
Pre-push gate (705 tests + full lint suite)	passed
Bisection (revert `deploy-azd.yml` to PR #1097's parent)	18 jobs ran cleanly

Risk

Low. The removed job has been a no-op since merge (never produced
output). The added permissions: block on deploy-azd-prod.yml only
takes effect on v*.*.* tag pushes, which have not happened during the
broken window. The linter is additive — fails closed on workflow changes
only.

Follow-ups (separate issues)

Phase 2b proper implementation: Flux ImageUpdateAutomation +
Notification Controller PR-bridge per the ADR-017 amendment.
Foundry hosted-agent flip: apps/inventory-health-check/agent.yaml
needs template.kind: direct-model → hosted-agent once the dev
deploy chain confirms PR feat(framework): Foundry hosted-agent FastAPI-mount pilot + FOUNDRY_STREAM removal (#981) #1098's hosted mount path works in AKS.

…cap lint Root cause ---------- PR #1097 added an `open-image-tag-bump-pr` job to the reusable `.github/workflows/deploy-azd.yml` that declared `permissions.pull-requests: write`. The 27 per-service entrypoints (`deploy-azd-*.yml`) grant only `id-token | contents | issues: write` on their `uses:` job. GitHub Actions rejects nested-workflow callees whose `permissions:` map elevates beyond the caller grant, and the rejection happens at the orchestrator before any runner is allocated. `actionlint` cannot see this because it is a cross-file semantic rule. Impact: 17 `startup_failure`-s + 29 cancellations + 0 successes across the last 1,000 runs. Every dispatched deploy short-circuited in ~7s with no logs. The dev deploy chain has been silently broken for ~2 days. Fix --- 1. Remove the `open-image-tag-bump-pr` job from `deploy-azd.yml`. The 27 HelmRelease YAML re-pins from PR #1097 are kept (legitimate ACR-path fix). The proper Phase 2b implementation moves to Flux's `ImageRepository` + `ImagePolicy` + `ImageUpdateAutomation` + Notification Controller (tracked separately in ADR-017 amendment). 2. Add `permissions: { id-token | contents | issues }: write` to `deploy-azd-prod.yml` (latent bug — prod tag pushes would have hit the same failure on the `watchdog-apim-agc-swa-drift` job which posts comments to issue #298). 3. Add `scripts/ci/lint_workflow_permissions.py` — static linter that diffs caller/callee per-job permission maps with workflow-level fallback, catches this exact regression class, and emits `::error file=...::` markers. Backed by 4 unit tests. 4. Add `.github/workflows/lint-actions.yml` — CI gate running actionlint + the permission-cap linter on every workflow PR/push. Validation ---------- - `python scripts/ci/lint_workflow_permissions.py` → OK - `actionlint` on deploy-azd.yml / deploy-azd-prod.yml / deploy-azd-dev.yml / lint-actions.yml → 0 issues - `pytest scripts/ci/tests/test_lint_workflow_permissions.py` → 4 passed - Bisection branch `bug/1099-bisect-deploy-azd-pre-1097` reverted to the parent of PR #1097 and confirmed 18 jobs ran successfully (only failed on expected environment protection). - ADR-017 §Phase 2b updated with attempt-1 post-mortem, decision, and lessons learned. Refs ---- - Closes #1099 - Amends ADR-017 (Deployment Strategy) - Follow-up issue to be filed: proper Phase 2b via Flux `ImageUpdateAutomation` + Notification Controller PR-bridge.

The new lint-actions workflow's purpose is enforcing GitHub Actions schema + the cross-file permission-cap rule (via scripts/ci/lint_workflow_permissions.py). The embedded shellcheck pass surfaced ~30 pre-existing style warnings (SC2034/SC2129/SC2153) in deploy-azd.yml and ci.yml that this PR does not touch. Bundling shellcheck cleanup with the deploy fix would force unrelated refactors as the price of merging. Re-enable in a focused follow-up if/when the team commits to a shell-quality cleanup pass.

…workflow-level baseline (#1100) Two issues remained after PR #1099: 1. deploy-azd-truth.yml (scoped multi-truth-agent entrypoint) granted only contents: read at both workflow- and job-level. The reusable callee deploy-azd.yml declares workflow-level contents: write (jobs inherit it). GitHub Actions cap check rejected the run with startup_failure before runner allocation. Fix: grant contents: write at both levels in the truth entrypoint, mirroring every other per-service entrypoint. 2. The permission-cap linter introduced in PR #1099 missed this case because it only computed the union of CALLEE per-job permissions: maps, ignoring callee workflow-level fallback. GitHub treats workflow-level permissions as the effective permissions for any job that omits its own map. The linter now mirrors that semantics: when collecting the callee required-set, workflow-level permissions are seeded first and per-job maps override per-key when present. Tests: added regression est_linter_includes_callee_workflow_level_permissions_in_required_set (5/5 pass). Validation: python scripts/ci/lint_workflow_permissions.py -> 1 callee checked, zero violations. Closes-related #1099.

github-code-quality Bot found potential problems May 13, 2026

View reviewed changes

Comment thread scripts/ci/tests/test_lint_workflow_permissions.py Fixed

Cataldir added 2 commits May 13, 2026 01:09

ci(#1099): drop unused pytest import flagged by Copilot reviewer

f470eaf

Cataldir merged commit 88409ec into main May 13, 2026
15 checks passed

Cataldir deleted the bug/1099-fix-deploy-azd-startup-failure branch May 13, 2026 04:34

Cataldir mentioned this pull request May 13, 2026

bug(#1099-followup): truth scoped entrypoint contents:write + linter workflow-level baseline #1100

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(#1099): fix deploy-azd startup_failure + permission-cap CI gate#1099

bug(#1099): fix deploy-azd startup_failure + permission-cap CI gate#1099
Cataldir merged 3 commits into
mainfrom
bug/1099-fix-deploy-azd-startup-failure

Cataldir commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Cataldir commented May 13, 2026

Closes #1099

Problem

Root cause

Fix

ADR-017 amendment

Validation

Risk

Follow-ups (separate issues)

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant