bug(#1099): fix deploy-azd startup_failure + permission-cap CI gate#1099
Merged
Conversation
…cap lint Root cause ---------- PR #1097 added an `open-image-tag-bump-pr` job to the reusable `.github/workflows/deploy-azd.yml` that declared `permissions.pull-requests: write`. The 27 per-service entrypoints (`deploy-azd-*.yml`) grant only `id-token | contents | issues: write` on their `uses:` job. GitHub Actions rejects nested-workflow callees whose `permissions:` map elevates beyond the caller grant, and the rejection happens at the orchestrator before any runner is allocated. `actionlint` cannot see this because it is a cross-file semantic rule. Impact: 17 `startup_failure`-s + 29 cancellations + 0 successes across the last 1,000 runs. Every dispatched deploy short-circuited in ~7s with no logs. The dev deploy chain has been silently broken for ~2 days. Fix --- 1. Remove the `open-image-tag-bump-pr` job from `deploy-azd.yml`. The 27 HelmRelease YAML re-pins from PR #1097 are kept (legitimate ACR-path fix). The proper Phase 2b implementation moves to Flux's `ImageRepository` + `ImagePolicy` + `ImageUpdateAutomation` + Notification Controller (tracked separately in ADR-017 amendment). 2. Add `permissions: { id-token | contents | issues }: write` to `deploy-azd-prod.yml` (latent bug — prod tag pushes would have hit the same failure on the `watchdog-apim-agc-swa-drift` job which posts comments to issue #298). 3. Add `scripts/ci/lint_workflow_permissions.py` — static linter that diffs caller/callee per-job permission maps with workflow-level fallback, catches this exact regression class, and emits `::error file=...::` markers. Backed by 4 unit tests. 4. Add `.github/workflows/lint-actions.yml` — CI gate running actionlint + the permission-cap linter on every workflow PR/push. Validation ---------- - `python scripts/ci/lint_workflow_permissions.py` → OK - `actionlint` on deploy-azd.yml / deploy-azd-prod.yml / deploy-azd-dev.yml / lint-actions.yml → 0 issues - `pytest scripts/ci/tests/test_lint_workflow_permissions.py` → 4 passed - Bisection branch `bug/1099-bisect-deploy-azd-pre-1097` reverted to the parent of PR #1097 and confirmed 18 jobs ran successfully (only failed on expected environment protection). - ADR-017 §Phase 2b updated with attempt-1 post-mortem, decision, and lessons learned. Refs ---- - Closes #1099 - Amends ADR-017 (Deployment Strategy) - Follow-up issue to be filed: proper Phase 2b via Flux `ImageUpdateAutomation` + Notification Controller PR-bridge.
The new lint-actions workflow's purpose is enforcing GitHub Actions schema + the cross-file permission-cap rule (via scripts/ci/lint_workflow_permissions.py). The embedded shellcheck pass surfaced ~30 pre-existing style warnings (SC2034/SC2129/SC2153) in deploy-azd.yml and ci.yml that this PR does not touch. Bundling shellcheck cleanup with the deploy fix would force unrelated refactors as the price of merging. Re-enable in a focused follow-up if/when the team commits to a shell-quality cleanup pass.
Cataldir
added a commit
that referenced
this pull request
May 13, 2026
…workflow-level baseline (#1100) Two issues remained after PR #1099: 1. deploy-azd-truth.yml (scoped multi-truth-agent entrypoint) granted only contents: read at both workflow- and job-level. The reusable callee deploy-azd.yml declares workflow-level contents: write (jobs inherit it). GitHub Actions cap check rejected the run with startup_failure before runner allocation. Fix: grant contents: write at both levels in the truth entrypoint, mirroring every other per-service entrypoint. 2. The permission-cap linter introduced in PR #1099 missed this case because it only computed the union of CALLEE per-job permissions: maps, ignoring callee workflow-level fallback. GitHub treats workflow-level permissions as the effective permissions for any job that omits its own map. The linter now mirrors that semantics: when collecting the callee required-set, workflow-level permissions are seeded first and per-job maps override per-key when present. Tests: added regression est_linter_includes_callee_workflow_level_permissions_in_required_set (5/5 pass). Validation: python scripts/ci/lint_workflow_permissions.py -> 1 callee checked, zero violations. Closes-related #1099.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1099
Problem
All 27 per-service deploy entrypoints (
deploy-azd-*.yml) have beenstartup_failure-ing on every dispatch since PR #1097 merged. Concretely:17
startup_failure+ 29 cancelled + 0 successful runs across the last1,000 entries in the workflow history. Every run died in ~7 seconds with no
logs.
Root cause
PR #1097 added an
open-image-tag-bump-prjob to the reusable callee.github/workflows/deploy-azd.ymlthat requestedpermissions.pull-requests: write. The 27 per-service entrypoints grantonly
id-token | contents | issues: writeon theiruses:job:GitHub's nested-workflow rule:
"Permissions can only be maintained or reduced — not elevated — throughout
the chain." GitHub rejects the callee at the orchestrator before any
runner is allocated, which is why no logs exist for the failed runs.
actionlintcannot catch this — it is a cross-file semantic rule.Bisection confirmed: a branch reverting
deploy-azd.ymlto PR #1097'sparent commit ran the full 18-job pipeline successfully (only failed on
expected environment protection on the deploy job).
Fix
open-image-tag-bump-prjob fromdeploy-azd.yml.The 27 HelmRelease YAML re-pins from PR feat(deploy): align Flux HelmReleases with build path + add image-tag PR bridge (#990) #1097 are kept — they are a
legitimate ACR-path fix. The proper PR-bridge implementation
(Phase 2b) moves to Flux's own components and lives outside the GHA
pipeline. See the ADR-017 amendment below for the post-mortem and
the next-attempt design.
deploy-azd-prod.yml'sdeployjob. Thewatchdog-apim-agc-swa-driftjob in thecallee comments on issue Watchdog: monitor APIM/AGC/SWA drift after deploy #298 when drift is detected, requiring
issues: write. Prod tag pushes (v*.*.*) would have hitthe same
startup_failure. Latent bug — fix in this PR.scripts/ci/lint_workflow_permissions.py— staticpermission-cap linter that diffs each caller's per-job
permissions:map (with workflow-level fallback) against theunion of per-job
permissions:declared in the callee. Emits::error file=...::markers on mismatch. 4 unit tests cover thepass case, the issue bug(#1099): fix deploy-azd startup_failure + permission-cap CI gate #1099 regression, the read-only caller case,
and the absent-per-job-permissions case.
.github/workflows/lint-actions.yml— runs actionlintplus the new linter on every workflow PR/push. Prevents this exact
class of regression from ever shipping silently again.
ADR-017 amendment
docs/architecture/adrs/adr-017-deployment-strategy.md§ "Phase 2b"now documents the failed first attempt (PR #1097), the cross-file
permission-cap defect, the scoped revert, and the next-attempt design
(Flux
ImageRepository+ImagePolicy+ImageUpdateAutomationValidation
python scripts/ci/lint_workflow_permissions.pyactionlinton all 4 modified workflow filespytest scripts/ci/tests/test_lint_workflow_permissions.pydeploy-azd.ymlto PR #1097's parent)Risk
Low. The removed job has been a no-op since merge (never produced
output). The added
permissions:block ondeploy-azd-prod.ymlonlytakes effect on
v*.*.*tag pushes, which have not happened during thebroken window. The linter is additive — fails closed on workflow changes
only.
Follow-ups (separate issues)
ImageUpdateAutomation+Notification Controller PR-bridge per the ADR-017 amendment.
apps/inventory-health-check/agent.yamlneeds
template.kind: direct-model→hosted-agentonce the devdeploy chain confirms PR feat(framework): Foundry hosted-agent FastAPI-mount pilot + FOUNDRY_STREAM removal (#981) #1098's hosted mount path works in AKS.