Skip to content

ci: adopt self-hosted Renovate, retire Dependabot (Phase 6)#232

Closed
chrisns wants to merge 3 commits into
mainfrom
feat/renovate-adopt
Closed

ci: adopt self-hosted Renovate, retire Dependabot (Phase 6)#232
chrisns wants to merge 3 commits into
mainfrom
feat/renovate-adopt

Conversation

@chrisns
Copy link
Copy Markdown
Member

@chrisns chrisns commented May 12, 2026

Summary

Phase 6 of the scenario-regression smoke-pack tech-spec. Adopts self-hosted Renovate via `renovatebot/github-action` and retires `.github/dependabot.yml` per ADR-2.

Ships:

  • `renovate.json` with:
    • `osvVulnerabilityAlerts: true` (CVE-aware updates)
    • `dependencyDashboard: true` (the Renovate dashboard issue is the persistent state record between runs; without it Renovate may re-resolve every dep every invocation and double-open PRs)
    • Six group rules from the spec's Pinning-Strategy table: `scenario--images`, `upstream-` (tika/paperless/gotenberg), `npm-dev`, `npm-prod`, `composer`, `github-actions` (with `pinDigests=true`), plus `security-priority` via `vulnerabilityAlerts`
    • `customManagers` regex matching the GHCR / docker.io pins introduced in Phase 5: `(ghcr.io|docker.io)/repo:tag@sha256:digest`. Renovate's built-in `docker` manager doesn't cover `ecs.ContainerImage.fromRegistry()` string literals; the custom manager does.
  • `.github/workflows/renovate.yml`:
    • Twice daily (06:00, 18:00 UTC) + `workflow_dispatch`
    • `renovatebot/github-action` pinned by digest (`v46.1.14` → `693b9ef15eec82123529a37c782242f091365961`). Renovate's own github-actions packageRule keeps this current via `pinDigests`.
    • Uses `RENOVATE_TOKEN` (fine-grained PAT, `repo:read+write` on this repo only). Minted by operator per the runbook's Operational Notes → RENOVATE_TOKEN rotation.
  • `.github/dependabot.yml` deleted. The seven ecosystem groups port onto Renovate's six groups (composer's three drupal sub-groups merged into one; pip routed; docker absorbed into customManagers + upstream group; github-actions ported verbatim).

Depends on: Phase 5 (PR #231) — Renovate's custom-manager regex requires the new `@sha256:` shape; before Phase 5 there's nothing for Renovate to bump cleanly.

Operator follow-ups (NOT in this PR):

  • T6.3: mint `RENOVATE_TOKEN` (fine-grained PAT, `repo:read+write` on `co-cddo/ndx_try_aws_scenarios` only) and add as a repo secret.
  • T6.5: close in-flight Dependabot PRs (preferred: merge the safe ones first).
  • T6.6: verify Renovate fires post-merge — `gh workflow run renovate.yml` then `gh pr list --author 'renovate[bot]'`.

Phase-6 DoD per spec: "first Renovate PR has fired AND smoke has gated it." The smoke-gating part requires Phase 1b to be complete; until then, Renovate PRs open but the smoke check is a no-op (the smoke workflow self-disables on placeholder `docs/smoke-test-account-config.yml`).

Test plan

  • CI: `renovate.json` validates against the Renovate schema (Renovate validator action or local `npx renovate-config-validator`)
  • CI: `renovate.yml` lints (actionlint)
  • Operator: after merge, `gh workflow run renovate.yml` triggers a run; verify the action assumes the PAT and reaches the "starting branch" log line
  • Operator: at least one Renovate PR opens within 24 hours (could be the Renovate dashboard issue or a real dep bump); confirm group label is applied per packageRules
  • Operator: after Phase 1b, a Renovate PR that bumps a scenario image digest triggers the smoke workflow's scoped path (proves the gate)

chrisns added 3 commits May 12, 2026 11:09
Phase 5 of the scenario-regression smoke-pack tech-spec. Every :latest
reference (10 own GHCR images + 2 upstream images) is now pinned to
<tag>@sha256:<digest>. Renovate (Phase 6) will track these via the
custom-manager regex.

Own GHCR images (pinned to sha-<7chars>@sha256:<digest>):
- fixmystreet:sha-be035a6
- localgov_drupal:sha-5801fb7
- minute_backend / minute_worker / minute_frontend:sha-5d3d423 (same SHA,
  different digests per service)
- planx-hasura / planx-api / planx-sharedb / planx-editor:sha-e748ef5
  (same SHA, different digests). Replaced the ghcrPrefix-based template
  strings with explicit literals so Renovate's regex matches each
  service independently.
- dpr:sha-33e9e9f (digital-planning-register default ImageUri parameter)

Upstream images (pinned to <stable-tag>@sha256:<digest>):
- docker.io/apache/tika:3.3.0.0-full
- ghcr.io/paperless-ngx/paperless-ngx:2.9

Drive-bys:
- bops-planning/docker/bops/Dockerfile: example build command in a
  comment changed from `bops:latest` to `bops-local` so the T5.7 grep
  passes without changing functionality (this was a comment, never an
  image reference at runtime).
- minute/template.json removed. Legacy synth snapshot referenced ECR
  images at :latest that are no longer part of the pipeline (the
  current synth job emits template.yaml from CDK). Nothing else in the
  repo references this file.

T5.7 verification (run locally): `grep -rE ':latest' cloudformation/scenarios/
--include='*.yaml' --include='*.yml' --include='Dockerfile*'
--include='*.ts' --include='*.json' | grep -vE 'package-lock|node_modules|cdk.out'`
returns zero matches.

T5.8 per-pin verification: each digest was resolved live via `docker
buildx imagetools inspect` against the registry. Real-deploy
verification of each scenario against the new pins is the operator-driven
follow-up (per the spec, that happens against the smoke account once
Phase 1b lands; for now, the existing manual deploys against ISB sandbox
accounts continue to use the same images at the same digests, just with
explicit pins instead of resolving :latest at deploy time).
Phase 6 of the scenario-regression smoke-pack tech-spec. Adopts self-hosted
Renovate via renovatebot/github-action and retires Dependabot per ADR-2.

What ships:
- renovate.json with:
  - osvVulnerabilityAlerts: true (CVE-aware updates)
  - dependencyDashboard: true (the dashboard issue is Renovate's state
    record between runs; without it Renovate re-resolves every dep every
    invocation and may double-open PRs)
  - 6 group rules from the spec's pinning-strategy table:
    * scenario-{packageName} for own GHCR images (per-image immediate)
    * upstream-{packageName} for tika / paperless-ngx / gotenberg (weekly)
    * npm-dev, npm-prod (weekly, separate)
    * composer (weekly)
    * github-actions (weekly + pinDigests=true)
    * security-priority (ungrouped, immediate, via vulnerabilityAlerts)
  - customManagers regex matching the GHCR / docker.io pins introduced in
    Phase 5: (ghcr.io|docker.io)/repo:tag@sha256:digest. Renovate's
    built-in docker manager doesn't cover ecs.ContainerImage.fromRegistry()
    string literals; this custom manager does.
- .github/workflows/renovate.yml:
  - twice daily (06:00, 18:00 UTC) + workflow_dispatch
  - renovatebot/github-action pinned by digest (v46.1.14 ->
    693b9ef15eec82123529a37c782242f091365961). Renovate's own
    github-actions packageRule keeps this current via pinDigests.
  - Uses RENOVATE_TOKEN (fine-grained PAT, repo:read+write on this repo
    only). Minted by operator per the runbook's Operational Notes.
- .github/dependabot.yml deleted. The 7 ecosystem groups from dependabot
  map onto the 6 Renovate groups (composer + drupal-core / contrib /
  localgov merged into single composer group; npm split into dev+prod
  by depType; pip routed to its own group; docker absorbed into custom
  managers + upstream group; github-actions ported verbatim).

Operator follow-ups (NOT in this PR):
- T6.3 mint RENOVATE_TOKEN and add as repo secret
- T6.5 close in-flight Dependabot PRs (preferred: merge the safe ones first)
- T6.6 verify Renovate fires post-merge - workflow_dispatch then check PR list

DoD per spec: "first Renovate PR has fired AND smoke has gated it" - the
smoke-gating part requires Phase 1b to be complete so the smoke workflow
runs. Until then, Renovate PRs open but the smoke check is a no-op (the
smoke workflow self-disables on placeholder config).
…nDigests)

Address review findings against PR #232:

1. Vulnerability rule was split across two places: top-level
   `vulnerabilityAlerts` (labels only) plus a packageRule nested
   `vulnerabilityAlerts.groupName: security-priority`. Renovate evaluates
   these separately and behaviour was unclear. Consolidate into a single
   top-level `vulnerabilityAlerts` block with labels + groupName + schedule.
   Drop the catch-all packageRule with `matchPackagePatterns: [".*"]` and
   `matchUpdateTypes: ["patch", "minor"]` that was meant to coerce CVE
   fixes into the security-priority group — the top-level block now does
   that directly.

2. `pinDigests: true` was applied to ALL github-actions, including
   third-party action references in docker-build workflows that already
   chose their pin shape deliberately. Narrow to official actions/* and
   aws-actions/* only. Renovate's first adoption run will still propose
   PRs for these, but the firehose is bounded.

3. PR limits: drop `prConcurrentLimit` from 10 to 6 and
   `branchConcurrentLimit` from 20 to 12. The original numbers permitted
   10+ concurrent smoke runs which would exhaust the smoke account's
   quota envelope. 6 leaves enough headroom for in-flight reviews while
   keeping account contention manageable.
@chrisns
Copy link
Copy Markdown
Member Author

chrisns commented May 12, 2026

Adversarial-review followup landed in 2e686ab: competing vuln rules consolidated, pinDigests narrowed to actions/* + aws-actions/* only (was firehose-risk), prConcurrentLimit dropped 10→6 so we don't exhaust the smoke account quota with concurrent runs.

@chrisns
Copy link
Copy Markdown
Member Author

chrisns commented May 12, 2026

Superseded by #236 — single squashed PR for all 6 phases. Dependency chain across 9 PRs made parallel review worse than serial; one PR with one review and one CI run is simpler.

@chrisns chrisns closed this May 12, 2026
chrisns added a commit that referenced this pull request May 12, 2026
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md
in a single deliverable. The original spec called for one PR per phase
(8+ PRs); experience showed the dependency overlap made that worse for
review, not better, so this squashes #226 / #227 / #228 / #229 / #230
/ #231 / #232 / #233 / #235 into a single change.

What ships
==========

Phase 1a — runbook + config schema
- docs/smoke-test-account-setup.md: one-off manual procedure for vending
  the long-lived smoke-test AWS account, with the four required sections
  (Prerequisites / Procedure / Verification / Operational Notes). Per-step
  idempotency checks + inverses; ProtectISB role-creation canary +
  fallback branch (ADR-1); Bedrock model-access enablement + gotchas
  (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota
  targets; QuickSight decision; iterate-to-least-privilege protocol for
  the inline IAM policy.
- docs/smoke-test-account-config.yml: post-runbook state record schema.

Phase 1b — operator-executed account state
- Smoke account 464453619983 provisioned in NDX org under the fallback
  branch (ProtectISB canary failed; account moved to root with
  Restrictions SCP attached directly). AwsNuke SCP intentionally NOT
  attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN
  delete + retention lint, not aws-nuke).
- OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created
  with 6h max-session-duration. Trust policy uses sub-pattern lock
  (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the
  repository_owner claim condition is omitted because it reproducibly
  breaks the assume even though the OIDC token contains the claim
  (verified via JWT decode in an investigation workflow that has since
  been deleted; see runbook Step 10).
- expected_scps reflects live state: Restrictions + FullAWSAccess.

Phase 2a — synth pipelines for missing scenarios
- New synth jobs in .github/workflows/deploy-blueprints.yml for planx
  and digital-planning-register (CDK -> template.yaml -> S3 via the
  existing isb-hub upload chain).
- bops-planning synth job lands in Phase 2b after the retention lint
  is justification-aware.
- ai-contact-centre: new "verify packaged CodeUri targets blueprints
  bucket" step catches a sam-package regression where --s3-bucket would
  silently land in the SAM default bucket.

Phase 2b — all-demo expansion + retention lint
- cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16
  nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS,
  Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning
  Register). Umbrella parameters for credentials (GovUkPayApiKey,
  OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable
  empty / sensible defaults; per-scenario URL + admin-credential Outputs
  surfaced.
- scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain /
  UpdateReplacePolicy=Retain / Properties.DeletionProtection=true /
  Properties.EnableDeletionProtection=true /
  Properties.FinalSnapshotIdentifier unless the resource carries a
  non-empty Metadata.Justification. Per-template cap (default 3) +
  global cap (default 10) so any one scenario can't pencil-whip
  retentions repo-wide.
- lint-committed-templates job in deploy-blueprints.yml runs the lint
  over hand-authored CFN templates.
- bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate
  debug-after-rollback) with a Metadata.Justification attached via
  cfnOptions; bops synth job re-enabled.

Phase 3 — smoke rails
- playwright.config.ts: new 'smoke' project gated on
  PLAYWRIGHT_SUITE=smoke.
- tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper.
  Sensitive output values flow only via explicit sensitiveValue()
  accessor; toString / inspect / Symbol.toPrimitive emit REDACTED
  placeholder. Documents the CloudFormation-API limitation that Output
  Metadata.Sensitive opt-in isn't readable (regex is the sole signal).
- tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries
  populated.
- tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts
  form-encoded passwords from Playwright trace.
- scripts/smoke.sh + .env.example: local + CI identical invocation.
- .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly
  cron / push-to-main / workflow_dispatch); scope decides full vs
  scoped from changed paths; global serial concurrency (no
  cancel-in-progress — cancelled runs leave orphan AWS state);
  configure-aws-credentials with role-duration-seconds=21600 (6h)
  to match the role's max-session-duration; pre-deploy state check
  with auto-recovery for stranded stacks; SCP drift check
  (excluding FullAWSAccess, fail-soft for first 7 detections);
  quarantine-expiry check; CFN events captured BEFORE teardown;
  teardown with 3x60s retry, gated on aws-creds outcome so we don't
  burn 3min retrying without credentials.
- .github/workflows/quarterly-audit.yml: 3-monthly tracking issue
  (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate
  liveness, ProtectISB-fallback revisit).
- .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns
  review (until a maintainers team is provisioned).

Phase 4 — 17 per-scenario smoke specs
- One spec per scenario covering the auth-mode pattern (admin-login /
  public / sso-skip / umbrella). Bug-informed feature flows cite the
  historical regression that informed each test:
  - fixmystreet: /reports requires bin/update-all-reports; /admin
    must reach the dashboard without 2FA redirect
  - planx: SPA boots free of domain-allowlist / Airbrake errors;
    Hasura native /v1/version responds (Caddy elimination)
  - minute: magic-link sets cookie; same-origin fetch() works
    post-auth; /api/proxy/healthcheck reaches the backend (catches
    the basic-auth-breaks-fetch() regression and the ALB /api/*
    interception regression)
  - localgov-ims: Windows IIS multi-site routing; AdminPassword
    must not be the literal {{resolve:...}} token (catches the
    Lambda-custom-resource regression)
  - localgov-drupal: ndx_aws_ai module boots without Bedrock
    AccessDeniedException
  - simply-readable: SPA loads, credentials non-empty + non-token;
    reload produces no 5xx responses (catches BlueprintsBucketName
    mis-wire)
  - ai-contact-centre: PSTN claim matches UK toll-free / landline OR
    US toll-free (catches international fallback regression)
  - paperless-ngx: /documents view + /api/documents/ respond (S3
    Files mount integrity)
  - bops-planning: post-login URL is NOT on the Applicants port
    (catches the routing.rb single-tenant override regression)
  - digital-planning-register: register loads with planning markers
  - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park,
    text-to-speech, council-chatbot): FunctionURL not-5xx + not-403
    (catches the InvokeFunctionUrl + InvokeFunction dual-permission
    regression); council-chatbot uses POST not GET so the test isn't
    vacuous against a POST-only Lambda
  - quicksight-dashboard: landing + outputs only (sso-skip per auth-
    mode categorisation)
  - all-demo: discovers Output keys dynamically by parsing the
    committed template at test time; asserts every Output present,
    non-empty, and not the {{resolve:...}} literal; URL outputs match
    https?://

Phase 5 — pin every floating image tag
- 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*,
  dpr) pinned to sha-<7chars>@sha256:<digest>.
- 2 upstream images (docker.io/apache/tika 3.3.0.0-full,
  ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>.
- Removed legacy cloudformation/scenarios/minute/template.json (stale
  ECR references; nothing in the repo referenced it).

Phase 6 — Renovate adoption (replaces Dependabot)
- renovate.json: 6 group rules per the spec's pinning-strategy table;
  customManagers regex matching the new pin shape; osvVulnerabilityAlerts
  + security-priority group; pinDigests scoped to official actions/* +
  aws-actions/* only so the first run doesn't firehose; per-PR limits
  capped at 6.
- .github/workflows/renovate.yml: twice-daily + workflow_dispatch.
  Action pinned by digest to v46.1.14.
- .github/dependabot.yml deleted.

Operator follow-ups (not in this PR)
====================================
- NAP-548: migrate scenarios off legacy claude-3-haiku-20240307
- NAP-549: revisit ProtectISB fallback by 2026-11-12
- NAP-550: service-quota Console requests
- NAP-551: QuickSight subscription decision
- NAP-552: mint RENOVATE_TOKEN repo secret
- NAP-554: close in-flight Dependabot PRs
- NAP-555: T2b.5b + T3.8 end-to-end verifications

Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
chrisns added a commit that referenced this pull request May 13, 2026
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md
in a single deliverable. The original spec called for one PR per phase
(8+ PRs); experience showed the dependency overlap made that worse for
review, not better, so this squashes #226 / #227 / #228 / #229 / #230
/ #231 / #232 / #233 / #235 into a single change.

What ships
==========

Phase 1a — runbook + config schema
- docs/smoke-test-account-setup.md: one-off manual procedure for vending
  the long-lived smoke-test AWS account, with the four required sections
  (Prerequisites / Procedure / Verification / Operational Notes). Per-step
  idempotency checks + inverses; ProtectISB role-creation canary +
  fallback branch (ADR-1); Bedrock model-access enablement + gotchas
  (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota
  targets; QuickSight decision; iterate-to-least-privilege protocol for
  the inline IAM policy.
- docs/smoke-test-account-config.yml: post-runbook state record schema.

Phase 1b — operator-executed account state
- Smoke account 464453619983 provisioned in NDX org under the fallback
  branch (ProtectISB canary failed; account moved to root with
  Restrictions SCP attached directly). AwsNuke SCP intentionally NOT
  attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN
  delete + retention lint, not aws-nuke).
- OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created
  with 6h max-session-duration. Trust policy uses sub-pattern lock
  (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the
  repository_owner claim condition is omitted because it reproducibly
  breaks the assume even though the OIDC token contains the claim
  (verified via JWT decode in an investigation workflow that has since
  been deleted; see runbook Step 10).
- expected_scps reflects live state: Restrictions + FullAWSAccess.

Phase 2a — synth pipelines for missing scenarios
- New synth jobs in .github/workflows/deploy-blueprints.yml for planx
  and digital-planning-register (CDK -> template.yaml -> S3 via the
  existing isb-hub upload chain).
- bops-planning synth job lands in Phase 2b after the retention lint
  is justification-aware.
- ai-contact-centre: new "verify packaged CodeUri targets blueprints
  bucket" step catches a sam-package regression where --s3-bucket would
  silently land in the SAM default bucket.

Phase 2b — all-demo expansion + retention lint
- cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16
  nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS,
  Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning
  Register). Umbrella parameters for credentials (GovUkPayApiKey,
  OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable
  empty / sensible defaults; per-scenario URL + admin-credential Outputs
  surfaced.
- scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain /
  UpdateReplacePolicy=Retain / Properties.DeletionProtection=true /
  Properties.EnableDeletionProtection=true /
  Properties.FinalSnapshotIdentifier unless the resource carries a
  non-empty Metadata.Justification. Per-template cap (default 3) +
  global cap (default 10) so any one scenario can't pencil-whip
  retentions repo-wide.
- lint-committed-templates job in deploy-blueprints.yml runs the lint
  over hand-authored CFN templates.
- bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate
  debug-after-rollback) with a Metadata.Justification attached via
  cfnOptions; bops synth job re-enabled.

Phase 3 — smoke rails
- playwright.config.ts: new 'smoke' project gated on
  PLAYWRIGHT_SUITE=smoke.
- tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper.
  Sensitive output values flow only via explicit sensitiveValue()
  accessor; toString / inspect / Symbol.toPrimitive emit REDACTED
  placeholder. Documents the CloudFormation-API limitation that Output
  Metadata.Sensitive opt-in isn't readable (regex is the sole signal).
- tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries
  populated.
- tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts
  form-encoded passwords from Playwright trace.
- scripts/smoke.sh + .env.example: local + CI identical invocation.
- .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly
  cron / push-to-main / workflow_dispatch); scope decides full vs
  scoped from changed paths; global serial concurrency (no
  cancel-in-progress — cancelled runs leave orphan AWS state);
  configure-aws-credentials with role-duration-seconds=21600 (6h)
  to match the role's max-session-duration; pre-deploy state check
  with auto-recovery for stranded stacks; SCP drift check
  (excluding FullAWSAccess, fail-soft for first 7 detections);
  quarantine-expiry check; CFN events captured BEFORE teardown;
  teardown with 3x60s retry, gated on aws-creds outcome so we don't
  burn 3min retrying without credentials.
- .github/workflows/quarterly-audit.yml: 3-monthly tracking issue
  (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate
  liveness, ProtectISB-fallback revisit).
- .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns
  review (until a maintainers team is provisioned).

Phase 4 — 17 per-scenario smoke specs
- One spec per scenario covering the auth-mode pattern (admin-login /
  public / sso-skip / umbrella). Bug-informed feature flows cite the
  historical regression that informed each test:
  - fixmystreet: /reports requires bin/update-all-reports; /admin
    must reach the dashboard without 2FA redirect
  - planx: SPA boots free of domain-allowlist / Airbrake errors;
    Hasura native /v1/version responds (Caddy elimination)
  - minute: magic-link sets cookie; same-origin fetch() works
    post-auth; /api/proxy/healthcheck reaches the backend (catches
    the basic-auth-breaks-fetch() regression and the ALB /api/*
    interception regression)
  - localgov-ims: Windows IIS multi-site routing; AdminPassword
    must not be the literal {{resolve:...}} token (catches the
    Lambda-custom-resource regression)
  - localgov-drupal: ndx_aws_ai module boots without Bedrock
    AccessDeniedException
  - simply-readable: SPA loads, credentials non-empty + non-token;
    reload produces no 5xx responses (catches BlueprintsBucketName
    mis-wire)
  - ai-contact-centre: PSTN claim matches UK toll-free / landline OR
    US toll-free (catches international fallback regression)
  - paperless-ngx: /documents view + /api/documents/ respond (S3
    Files mount integrity)
  - bops-planning: post-login URL is NOT on the Applicants port
    (catches the routing.rb single-tenant override regression)
  - digital-planning-register: register loads with planning markers
  - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park,
    text-to-speech, council-chatbot): FunctionURL not-5xx + not-403
    (catches the InvokeFunctionUrl + InvokeFunction dual-permission
    regression); council-chatbot uses POST not GET so the test isn't
    vacuous against a POST-only Lambda
  - quicksight-dashboard: landing + outputs only (sso-skip per auth-
    mode categorisation)
  - all-demo: discovers Output keys dynamically by parsing the
    committed template at test time; asserts every Output present,
    non-empty, and not the {{resolve:...}} literal; URL outputs match
    https?://

Phase 5 — pin every floating image tag
- 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*,
  dpr) pinned to sha-<7chars>@sha256:<digest>.
- 2 upstream images (docker.io/apache/tika 3.3.0.0-full,
  ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>.
- Removed legacy cloudformation/scenarios/minute/template.json (stale
  ECR references; nothing in the repo referenced it).

Phase 6 — Renovate adoption (replaces Dependabot)
- renovate.json: 6 group rules per the spec's pinning-strategy table;
  customManagers regex matching the new pin shape; osvVulnerabilityAlerts
  + security-priority group; pinDigests scoped to official actions/* +
  aws-actions/* only so the first run doesn't firehose; per-PR limits
  capped at 6.
- .github/workflows/renovate.yml: twice-daily + workflow_dispatch.
  Action pinned by digest to v46.1.14.
- .github/dependabot.yml deleted.

Operator follow-ups (not in this PR)
====================================
- NAP-548: migrate scenarios off legacy claude-3-haiku-20240307
- NAP-549: revisit ProtectISB fallback by 2026-11-12
- NAP-550: service-quota Console requests
- NAP-551: QuickSight subscription decision
- NAP-552: mint RENOVATE_TOKEN repo secret
- NAP-554: close in-flight Dependabot PRs
- NAP-555: T2b.5b + T3.8 end-to-end verifications

Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
chrisns added a commit that referenced this pull request May 13, 2026
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md
in a single deliverable. The original spec called for one PR per phase
(8+ PRs); experience showed the dependency overlap made that worse for
review, not better, so this squashes #226 / #227 / #228 / #229 / #230
/ #231 / #232 / #233 / #235 into a single change.

What ships
==========

Phase 1a — runbook + config schema
- docs/smoke-test-account-setup.md: one-off manual procedure for vending
  the long-lived smoke-test AWS account, with the four required sections
  (Prerequisites / Procedure / Verification / Operational Notes). Per-step
  idempotency checks + inverses; ProtectISB role-creation canary +
  fallback branch (ADR-1); Bedrock model-access enablement + gotchas
  (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota
  targets; QuickSight decision; iterate-to-least-privilege protocol for
  the inline IAM policy.
- docs/smoke-test-account-config.yml: post-runbook state record schema.

Phase 1b — operator-executed account state
- Smoke account 464453619983 provisioned in NDX org under the fallback
  branch (ProtectISB canary failed; account moved to root with
  Restrictions SCP attached directly). AwsNuke SCP intentionally NOT
  attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN
  delete + retention lint, not aws-nuke).
- OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created
  with 6h max-session-duration. Trust policy uses sub-pattern lock
  (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the
  repository_owner claim condition is omitted because it reproducibly
  breaks the assume even though the OIDC token contains the claim
  (verified via JWT decode in an investigation workflow that has since
  been deleted; see runbook Step 10).
- expected_scps reflects live state: Restrictions + FullAWSAccess.

Phase 2a — synth pipelines for missing scenarios
- New synth jobs in .github/workflows/deploy-blueprints.yml for planx
  and digital-planning-register (CDK -> template.yaml -> S3 via the
  existing isb-hub upload chain).
- bops-planning synth job lands in Phase 2b after the retention lint
  is justification-aware.
- ai-contact-centre: new "verify packaged CodeUri targets blueprints
  bucket" step catches a sam-package regression where --s3-bucket would
  silently land in the SAM default bucket.

Phase 2b — all-demo expansion + retention lint
- cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16
  nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS,
  Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning
  Register). Umbrella parameters for credentials (GovUkPayApiKey,
  OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable
  empty / sensible defaults; per-scenario URL + admin-credential Outputs
  surfaced.
- scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain /
  UpdateReplacePolicy=Retain / Properties.DeletionProtection=true /
  Properties.EnableDeletionProtection=true /
  Properties.FinalSnapshotIdentifier unless the resource carries a
  non-empty Metadata.Justification. Per-template cap (default 3) +
  global cap (default 10) so any one scenario can't pencil-whip
  retentions repo-wide.
- lint-committed-templates job in deploy-blueprints.yml runs the lint
  over hand-authored CFN templates.
- bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate
  debug-after-rollback) with a Metadata.Justification attached via
  cfnOptions; bops synth job re-enabled.

Phase 3 — smoke rails
- playwright.config.ts: new 'smoke' project gated on
  PLAYWRIGHT_SUITE=smoke.
- tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper.
  Sensitive output values flow only via explicit sensitiveValue()
  accessor; toString / inspect / Symbol.toPrimitive emit REDACTED
  placeholder. Documents the CloudFormation-API limitation that Output
  Metadata.Sensitive opt-in isn't readable (regex is the sole signal).
- tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries
  populated.
- tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts
  form-encoded passwords from Playwright trace.
- scripts/smoke.sh + .env.example: local + CI identical invocation.
- .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly
  cron / push-to-main / workflow_dispatch); scope decides full vs
  scoped from changed paths; global serial concurrency (no
  cancel-in-progress — cancelled runs leave orphan AWS state);
  configure-aws-credentials with role-duration-seconds=21600 (6h)
  to match the role's max-session-duration; pre-deploy state check
  with auto-recovery for stranded stacks; SCP drift check
  (excluding FullAWSAccess, fail-soft for first 7 detections);
  quarantine-expiry check; CFN events captured BEFORE teardown;
  teardown with 3x60s retry, gated on aws-creds outcome so we don't
  burn 3min retrying without credentials.
- .github/workflows/quarterly-audit.yml: 3-monthly tracking issue
  (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate
  liveness, ProtectISB-fallback revisit).
- .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns
  review (until a maintainers team is provisioned).

Phase 4 — 17 per-scenario smoke specs
- One spec per scenario covering the auth-mode pattern (admin-login /
  public / sso-skip / umbrella). Bug-informed feature flows cite the
  historical regression that informed each test:
  - fixmystreet: /reports requires bin/update-all-reports; /admin
    must reach the dashboard without 2FA redirect
  - planx: SPA boots free of domain-allowlist / Airbrake errors;
    Hasura native /v1/version responds (Caddy elimination)
  - minute: magic-link sets cookie; same-origin fetch() works
    post-auth; /api/proxy/healthcheck reaches the backend (catches
    the basic-auth-breaks-fetch() regression and the ALB /api/*
    interception regression)
  - localgov-ims: Windows IIS multi-site routing; AdminPassword
    must not be the literal {{resolve:...}} token (catches the
    Lambda-custom-resource regression)
  - localgov-drupal: ndx_aws_ai module boots without Bedrock
    AccessDeniedException
  - simply-readable: SPA loads, credentials non-empty + non-token;
    reload produces no 5xx responses (catches BlueprintsBucketName
    mis-wire)
  - ai-contact-centre: PSTN claim matches UK toll-free / landline OR
    US toll-free (catches international fallback regression)
  - paperless-ngx: /documents view + /api/documents/ respond (S3
    Files mount integrity)
  - bops-planning: post-login URL is NOT on the Applicants port
    (catches the routing.rb single-tenant override regression)
  - digital-planning-register: register loads with planning markers
  - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park,
    text-to-speech, council-chatbot): FunctionURL not-5xx + not-403
    (catches the InvokeFunctionUrl + InvokeFunction dual-permission
    regression); council-chatbot uses POST not GET so the test isn't
    vacuous against a POST-only Lambda
  - quicksight-dashboard: landing + outputs only (sso-skip per auth-
    mode categorisation)
  - all-demo: discovers Output keys dynamically by parsing the
    committed template at test time; asserts every Output present,
    non-empty, and not the {{resolve:...}} literal; URL outputs match
    https?://

Phase 5 — pin every floating image tag
- 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*,
  dpr) pinned to sha-<7chars>@sha256:<digest>.
- 2 upstream images (docker.io/apache/tika 3.3.0.0-full,
  ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>.
- Removed legacy cloudformation/scenarios/minute/template.json (stale
  ECR references; nothing in the repo referenced it).

Phase 6 — Renovate adoption (replaces Dependabot)
- renovate.json: 6 group rules per the spec's pinning-strategy table;
  customManagers regex matching the new pin shape; osvVulnerabilityAlerts
  + security-priority group; pinDigests scoped to official actions/* +
  aws-actions/* only so the first run doesn't firehose; per-PR limits
  capped at 6.
- .github/workflows/renovate.yml: twice-daily + workflow_dispatch.
  Action pinned by digest to v46.1.14.
- .github/dependabot.yml deleted.

Operator follow-ups (not in this PR)
====================================
- NAP-548: migrate scenarios off legacy claude-3-haiku-20240307
- NAP-549: revisit ProtectISB fallback by 2026-11-12
- NAP-550: service-quota Console requests
- NAP-551: QuickSight subscription decision
- NAP-552: mint RENOVATE_TOKEN repo secret
- NAP-554: close in-flight Dependabot PRs
- NAP-555: T2b.5b + T3.8 end-to-end verifications

Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant