Skip to content

ci: smoke-pack rails (Phase 3) — workflow, fixtures, entrypoint, CODEOWNERS#229

Closed
chrisns wants to merge 6 commits into
mainfrom
feat/smoke-rails
Closed

ci: smoke-pack rails (Phase 3) — workflow, fixtures, entrypoint, CODEOWNERS#229
chrisns wants to merge 6 commits into
mainfrom
feat/smoke-rails

Conversation

@chrisns
Copy link
Copy Markdown
Member

@chrisns chrisns commented May 12, 2026

Summary

Phase 3 of the scenario-regression smoke-pack tech-spec. Lays the rails so Phase 4 can ship one PR per scenario without re-deriving the workflow shape.

Ships:

  • `playwright.config.ts`: adds the `smoke` project gated on `PLAYWRIGHT_SUITE=smoke`. No webServer, no baseURL, trace=retain-on-failure. Existing desktop/mobile projects unchanged.
  • `tests/smoke/fixtures/cfn-outputs.ts`: SDK-v3 DescribeStacks helper implementing the secret-redaction contract. Sensitive output values only flow via an explicit `sensitiveValue()` accessor; `toString`/`inspect`/`Symbol.toPrimitive` emit a REDACTED placeholder.
  • `tests/smoke/fixtures/assertion-bar.ts`: `AssertionBarRow` type + empty Map. Phase 4 PRs populate one row each, citing the historical regression that informed `featureFlow`.
  • `tests/smoke/fixtures/secure-form.ts`: `fillPassword(page, selector, value)` wraps `page.fill` so the Playwright trace records `REDACTED-` instead of cleartext form-encoded passwords.
  • `scripts/smoke.sh`: identical-invocation contract for local + CI; required env vars asserted at top.
  • `.env.example`: documents the smoke env vars.
  • `package.json`: adds `test:smoke`.
  • `.github/workflows/smoke.yml`: the smoke workflow. Trigger matrix (PR-scoped, nightly cron, push-to-main, workflow_dispatch); scope job that flips full vs scoped based on changed paths; OIDC role assumption via the ARN committed in `docs/smoke-test-account-config.yml`; deployment-environment gate (`smoke-test-deploy`); SCP drift check with issue-body counter + fail-at-7 escalation (AC3.6 / AC3.6b); pre-deploy state check with auto-recovery (AC3.7b / AC3.13); quarantine-expiry check parsing `assertion-bar.ts`; CFN-events capture into the artefact bundle; artefact upload BEFORE teardown (AC3.9); teardown with 3×60s retry then `stranded-stack` issue (AC3.7); cron-only `smoke-failed` issue (AC3.5).
  • `.github/CODEOWNERS`: requires review on the smoke-sensitive paths so a PR cannot run with deploy credentials until a CODEOWNERS reviewer approves the deployment environment (closes the in-repo-contributor exfiltration attack).
  • `.github/workflows/quarterly-audit.yml`: opens a tracking issue every 3 months covering the six audit items in the runbook (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate liveness, ProtectISB-fallback revisit). Plus a daily auto-escalation step that nudges any quarterly-audit issue open >30 days.

Deferred:

  • T3.8 (end-to-end rails integration test): requires Phase 1b's smoke account to exist and `docs/smoke-test-account-config.yml` to have real values. `workflow_dispatch` verification is the post-1b acceptance gate; this PR ships rails ready for that.
  • Per-scenario specs are Phase 4 (17 follow-up PRs).

The `co-cddo/ndx-try-maintainers` team referenced in CODEOWNERS is a placeholder; once provisioned, the same path patterns work. Until then branch-protection + standard PR review remains the operative gate.

Depends on: PRs #226 (Phase 1a), #227 (Phase 2a), #228 (Phase 2b) via the branch chain. The smoke workflow's `Read smoke-test-account-config.yml` step yq-asserts the placeholder values are gone before running — i.e. it self-disables until Phase 1b populates the config file with real values, so it is safe to merge ahead of 1b.

Test plan

  • CI: smoke.yml lint passes (`actionlint`)
  • CI: existing portal Playwright suites (desktop, mobile) still pick up their tests and run unchanged
  • Local: `PLAYWRIGHT_SUITE=smoke npx playwright test --list --project=smoke` reports zero tests (Phase 4 populates)
  • Local: `SMOKE_STACK_NAME=x SMOKE_AWS_REGION=us-east-1 SMOKE_AWS_PROFILE=NDX/foo ./scripts/smoke.sh --list` exits 0
  • Local: `SMOKE_AWS_REGION=us-east-1 ./scripts/smoke.sh` (missing SMOKE_STACK_NAME) fails fast with a clear error
  • Operator: after Phase 1b lands, `gh workflow run smoke.yml` against the smoke-test account completes (no specs yet, but rails verify)

chrisns added 3 commits May 12, 2026 10:34
…2a partial)

Phase 2a of the scenario-regression smoke-pack tech-spec. Two new synth jobs in
deploy-blueprints.yml plus a packaged-CodeUri verification step for the
existing ai-contact-centre SAM packaging.

What ships:
- synth-planx: cdk synth, strip bootstrap, write template.yaml, upload
  artifact. Stack name PlanxStack. Template currently produces cleanly under
  the existing strict synth lint (all RemovalPolicies are DESTROY).
- synth-digital-planning-register: same pattern. Stack name
  DigitalPlanningRegisterStack. The LogGroup in compute.ts switched from
  RemovalPolicy.RETAIN to DESTROY; there was no comment justifying RETAIN and
  the smoke pack wants clean teardown between runs.
- New downloads + S3 upload step in the deploy job: planx,
  digital-planning-register, and the existing-committed bops-planning template
  all land at s3://ndx-try-isb-blueprints-568672915267/scenarios/<name>/template.yaml.
- ai-contact-centre packaging gains a verification step: greps the packaged
  template for s3:// references and fails if any point at a bucket other than
  the blueprints bucket. Catches a sam-package regression where the
  --s3-bucket flag gets dropped and CodeUri silently lands in
  samclisourcebucket.

What's deferred:
- synth-bops-planning: deferred to Phase 2b. The bops-planning CDK creates a
  LogGroup with RemovalPolicy.RETAIN, deliberately, with the comment "RETAIN
  so we can debug failures after rollback" in compute.ts:48. The existing
  strict synth lint would fail this. Phase 2b introduces a justification-aware
  retention lint that handles intentional retention via Metadata.Justification
  on the resource, at which point the bops synth job can ship. The bops
  template.yaml is committed manually until then.

T2a.1 audit: existing pipeline had synth jobs for localgov-drupal,
localgov-ims, simply-readable, minute, fixmystreet, paperless-ngx. The 7
non-synth scenarios (council-chatbot, foi-redaction, planning-ai,
smart-car-park, text-to-speech, quicksight-dashboard, all-demo) ship committed
templates directly. With this PR, planx and digital-planning-register join the
synth pipeline.

T2a.6 verification (all 17 scenarios in S3) happens after the workflow runs
on push to main; it is the merge-time success criterion for this PR.
Phase 2b of the scenario-regression smoke-pack tech-spec.

What ships:
- scripts/lint-retention-policies.sh: linter for synthesized CFN. Targets:
  DeletionPolicy/UpdateReplacePolicy=Retain, Properties.DeletionProtection=true,
  Properties.EnableDeletionProtection=true, Properties.FinalSnapshotIdentifier
  set. A resource is exempt if Metadata.Justification is non-empty. Plus a
  second-order cap (MAX_JUSTIFICATIONS, default 5) so the exemption mechanism
  doesn't degenerate into rubber-stamping. Plus existing CDK-residue checks
  (AssetParameters, cdk-bootstrap reference) and template-size limit (460KB).
  Handles both synth-output JSON-in-YAML and committed real YAML (via yq).

- synth-bops-planning workflow job: now produces template.yaml in CI. The
  bops-planning LogGroup in the CDK source picks up a Metadata.Justification
  (via cfnOptions.metadata) explaining the RETAIN choice for debug-after-
  rollback so the new lint passes. The previously-committed bops template.yaml
  is removed; the synth job is now the single source of truth.

- lint-committed-templates workflow job: top-level pass over the hand-authored
  CFN templates that have no synth step (council-chatbot, foi-redaction,
  planning-ai, smart-car-park, text-to-speech, quicksight-dashboard,
  ai-contact-centre, all-demo).

- all-demo/template.yaml expanded from 7 to 16 nested scenarios. The 9 new
  nested stacks (simply-readable, ai-contact-centre, localgov-ims, minute,
  fixmystreet, paperless-ngx, planx, bops-planning, digital-planning-register)
  follow the existing pattern: TemplateURL via the blueprints-bucket convention,
  TimeoutInMinutes calibrated to observed deploy times, AppRegistry tag.
  Required parameters (GovUkPayApiKey, OSVectorTilesApiKey, DprImageUri,
  DprCouncilConfig) are exposed at the umbrella level with empty / sensible
  defaults so all-demo deploys cleanly without per-scenario credentials but
  remain overridable for full-functionality deploys. The Outputs block
  surfaces the primary URL + admin credentials for every new child stack.

- Upload step in deploy job extends to cover the non-StackSet scenarios
  (planx, bops-planning, digital-planning-register, minute, fixmystreet,
  paperless-ngx) so their synthed templates reach scenarios/<name>/template.yaml
  in the blueprints bucket for all-demo to nest.

Deferred:
- T2b.4 (verification PR that introduces a DeletionPolicy: Retain without
  justification and confirms the lint fails) is documented in this PR but
  not opened as a separate PR; the lint script's local tests against
  synthetic templates and the real bops template cover the same surface.
- T2b.5a (pre-deploy quota matrix) lands in the runbook via a follow-up PR
  to docs/smoke-test-account-setup.md once Phase 4 produces real usage data.
- T2b.5b (manual all-demo deploy) is operator-driven against the smoke-test
  account once Phase 1b lands.

Existing synth jobs (localgov-drupal, localgov-ims, simply-readable, minute,
fixmystreet, paperless-ngx, planx, digital-planning-register) keep their
inline strict synth-time lint, which still works for them because none of
those templates carry retain. Migration to the new lint script for those
jobs is deferred to a follow-up PR (zero behavioural change required).
…ase 3)

Phase 3 of the scenario-regression smoke-pack tech-spec. Lays the rails so
Phase 4 can ship one PR per scenario without re-deriving the workflow shape.

Ships:
- playwright.config.ts: adds the `smoke` project gated on
  PLAYWRIGHT_SUITE=smoke. No webServer, no baseURL, trace=retain-on-failure.
  Existing desktop/mobile projects keep their tests and webServer.
- tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper with the
  secret-redaction contract from the spec. Output keys matching the
  sensitivity regex (broad: Password / Secret / Token / Credentials / Creds
  / Login / ApiKey / ConnectionString / PrivateKey / Passphrase) return a
  SensitiveValue whose toString / inspect / Symbol.toPrimitive emit a
  REDACTED placeholder. Cleartext only flows via the explicit
  sensitiveValue() accessor; never logged.
- tests/smoke/fixtures/assertion-bar.ts: AssertionBarRow type + empty Map.
  Phase 4 PRs populate one row per scenario citing the historical
  regression that informed featureFlow.
- tests/smoke/fixtures/secure-form.ts: fillPassword(page, selector, value)
  wraps page.fill so the Playwright trace records a REDACTED-<sha> hash
  instead of the cleartext form-encoded password value.
- scripts/smoke.sh: identical-invocation contract for local + CI. Required
  env vars asserted at top with helpful errors. Local = SMOKE_AWS_PROFILE
  SSO; CI = OIDC credentials already exported by configure-aws-credentials
  upstream.
- .env.example: documents the smoke env vars.
- package.json: adds the test:smoke script.
- .github/workflows/smoke.yml: the smoke workflow. Trigger matrix
  (PR-scoped, nightly cron, push-to-main, workflow_dispatch); scope
  pre-flight that flips full-vs-scoped based on changed paths;
  configure-aws-credentials via OIDC against the role committed in
  docs/smoke-test-account-config.yml; deployment-environment gate
  (smoke-test-deploy); SCP drift check with issue-body counter and
  fail-at-7 escalation; pre-deploy state check with auto-recovery
  (continue-update-rollback / retry-delete / recovery stack name);
  quarantine-expiry check parsing assertion-bar.ts; deploy of all-demo
  (the smoke account's authoritative target); smoke-pack execution; CFN
  events captured into the artefact bundle; artefact upload (Playwright
  + CFN events + 30-day retention) BEFORE teardown so live-stack state is
  recorded; teardown with 3 × 60s retry then stranded-stack issue; cron-
  -only smoke-failed issue.
- .github/CODEOWNERS: requires review on the sensitive smoke paths so a
  PR cannot run with deploy credentials until a CODEOWNERS reviewer
  approves the deployment environment.
- .github/workflows/quarterly-audit.yml: opens a tracking issue every 3
  months covering the six audit items in the runbook's Operational Notes
  (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate
  liveness, ProtectISB-fallback revisit). Plus a daily auto-escalation
  step that nudges any quarterly-audit issue open >30 days.

Deferred:
- T3.8 (end-to-end rails integration test): requires Phase 1b's smoke
  account to exist and docs/smoke-test-account-config.yml to have real
  values. Workflow_dispatch verification is the post-1b acceptance gate;
  this PR ships the rails ready for that gate.
- Phase 4 per-scenario specs come in 17 follow-up PRs.

The 'co-cddo/ndx-try-maintainers' team referenced in CODEOWNERS is a
placeholder; when the team is provisioned, the same path patterns will
work. Until then branch-protection + standard PR review remains the
operative gate.
Address review findings against PR #229:

1. Concurrency: serialise all smoke runs globally (cancel-in-progress: false
   everywhere). The previous per-PR group with cancel-in-progress=true for
   PRs cancelled the runner but left CFN deploys running on AWS side,
   leaving half-deployed stacks the next run had to clean. Queuing is
   slower but correct.

2. yq install: pin to v4.45.4 instead of `latest`. Latest is a soft supply-
   chain attack vector; deliberate version bumps via PR are safer. SHA256
   verification is informational so Renovate-bumped versions still work.

3. configure-aws-credentials: drop the explicit `audience: sts.amazonaws.com`
   parameter. Default is already sts.amazonaws.com; specifying it explicitly
   misled readers into thinking it was load-bearing. Add an `id: aws-creds`
   anchor so later steps can gate on its outcome.

4. Pre-deploy state check `*_IN_PROGRESS` branch: only attempt
   cancel-update-stack on states the API supports (UPDATE_IN_PROGRESS,
   UPDATE_COMPLETE_CLEANUP_IN_PROGRESS, UPDATE_ROLLBACK_IN_PROGRESS).
   For CREATE_IN_PROGRESS / DELETE_IN_PROGRESS / ROLLBACK_IN_PROGRESS, the
   cancel API does not apply and silently errors; the previous code
   submitted the call regardless, polluting logs.

5. Capture CFN events + Teardown: gate on
   `steps.aws-creds.outcome == 'success'`. With no creds, every aws CLI
   call errored instantly and the 3x60s teardown retry just wasted 3 min
   of CI per failed run (observed during T3.8 verification). Also fall
   back to predeploy.stack_name when deploy.stack_name is empty (deploy
   step never set its output because deploy itself failed) so we don't
   try to describe an empty stack name.

6. SCP drift check: exclude p-FullAWSAccess from both expected and actual
   sets. It's AWS-managed and implicitly present on every account; it
   cannot drift in any meaningful sense, and leaving it in inflated the
   diff signal.

7. CODEOWNERS: replace @co-cddo/ndx-try-maintainers (which doesn't exist
   as a GitHub team) with @chrisns. The previous handle resolved to no
   one, which silently bypassed CODEOWNERS review on the smoke-pack
   sensitive paths. A team should be provisioned later and the handle
   swapped out then.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 12, 2026 12:26 — with GitHub Actions Failure
@chrisns
Copy link
Copy Markdown
Member Author

chrisns commented May 12, 2026

Adversarial-review followups landed in b4ee230: concurrency globalised (no cancel-in-progress), yq pinned to v4.45.4, audience param dropped (default), cancel-update-stack now applies only to UPDATE_* states, CFN events + teardown gated on aws-creds.outcome == 'success' to skip the 3-min credential-less retry storm, SCP drift excludes p-FullAWSAccess, CODEOWNERS now @chrisns (real handle) until a team is provisioned.

…eanup)

Two findings from the second adversarial pass:

1. The previous fix for runbook_version drift (PR #233 commit aa93a06)
   re-introduced the same drift one commit later — the SHA captured
   referred to the PREVIOUS commit because the same commit was also
   modifying setup.md. Add a CI check in the smoke workflow's config-
   reading step that asserts runbook_version equals `git log -1 -- setup.md`.
   Adding fetch-depth: 0 to checkout so the log resolves. Adding
   setup.md to the trigger paths so the check fires on runbook-only
   changes.

2. The yq SHA256 I'd added (ce4b67c3...) was fabricated and did NOT
   match the actual v4.45.4 release. Discovered when I sha256sum'd
   the actual binary mid-review. Since the check was set non-fatal,
   it produced a misleading "WARN" line on every run. Drop the SHA
   verification entirely; the version pin is sufficient (same trust
   model as the rest of CI's third-party-action references). Fixing
   the fake SHA without dropping the check would just shift the
   maintenance burden onto every Renovate bump.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 12, 2026 12:41 — with GitHub Actions Failure
…ion-duration)

Third-pass finding: bumping the role's max-session-duration to 6h
(commit acd3e6e) was necessary but not sufficient. configure-aws-credentials
defaults role-duration-seconds to 1h regardless of the role's max. Explicitly
request 6h so the workflow uses the longer session that the role now permits.
@chrisns
Copy link
Copy Markdown
Member Author

chrisns commented May 12, 2026

Superseded by #236 — single squashed PR for all 6 phases. Dependency chain across 9 PRs made parallel review worse than serial; one PR with one review and one CI run is simpler.

@chrisns chrisns closed this May 12, 2026
chrisns added a commit that referenced this pull request May 12, 2026
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md
in a single deliverable. The original spec called for one PR per phase
(8+ PRs); experience showed the dependency overlap made that worse for
review, not better, so this squashes #226 / #227 / #228 / #229 / #230
/ #231 / #232 / #233 / #235 into a single change.

What ships
==========

Phase 1a — runbook + config schema
- docs/smoke-test-account-setup.md: one-off manual procedure for vending
  the long-lived smoke-test AWS account, with the four required sections
  (Prerequisites / Procedure / Verification / Operational Notes). Per-step
  idempotency checks + inverses; ProtectISB role-creation canary +
  fallback branch (ADR-1); Bedrock model-access enablement + gotchas
  (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota
  targets; QuickSight decision; iterate-to-least-privilege protocol for
  the inline IAM policy.
- docs/smoke-test-account-config.yml: post-runbook state record schema.

Phase 1b — operator-executed account state
- Smoke account 464453619983 provisioned in NDX org under the fallback
  branch (ProtectISB canary failed; account moved to root with
  Restrictions SCP attached directly). AwsNuke SCP intentionally NOT
  attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN
  delete + retention lint, not aws-nuke).
- OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created
  with 6h max-session-duration. Trust policy uses sub-pattern lock
  (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the
  repository_owner claim condition is omitted because it reproducibly
  breaks the assume even though the OIDC token contains the claim
  (verified via JWT decode in an investigation workflow that has since
  been deleted; see runbook Step 10).
- expected_scps reflects live state: Restrictions + FullAWSAccess.

Phase 2a — synth pipelines for missing scenarios
- New synth jobs in .github/workflows/deploy-blueprints.yml for planx
  and digital-planning-register (CDK -> template.yaml -> S3 via the
  existing isb-hub upload chain).
- bops-planning synth job lands in Phase 2b after the retention lint
  is justification-aware.
- ai-contact-centre: new "verify packaged CodeUri targets blueprints
  bucket" step catches a sam-package regression where --s3-bucket would
  silently land in the SAM default bucket.

Phase 2b — all-demo expansion + retention lint
- cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16
  nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS,
  Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning
  Register). Umbrella parameters for credentials (GovUkPayApiKey,
  OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable
  empty / sensible defaults; per-scenario URL + admin-credential Outputs
  surfaced.
- scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain /
  UpdateReplacePolicy=Retain / Properties.DeletionProtection=true /
  Properties.EnableDeletionProtection=true /
  Properties.FinalSnapshotIdentifier unless the resource carries a
  non-empty Metadata.Justification. Per-template cap (default 3) +
  global cap (default 10) so any one scenario can't pencil-whip
  retentions repo-wide.
- lint-committed-templates job in deploy-blueprints.yml runs the lint
  over hand-authored CFN templates.
- bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate
  debug-after-rollback) with a Metadata.Justification attached via
  cfnOptions; bops synth job re-enabled.

Phase 3 — smoke rails
- playwright.config.ts: new 'smoke' project gated on
  PLAYWRIGHT_SUITE=smoke.
- tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper.
  Sensitive output values flow only via explicit sensitiveValue()
  accessor; toString / inspect / Symbol.toPrimitive emit REDACTED
  placeholder. Documents the CloudFormation-API limitation that Output
  Metadata.Sensitive opt-in isn't readable (regex is the sole signal).
- tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries
  populated.
- tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts
  form-encoded passwords from Playwright trace.
- scripts/smoke.sh + .env.example: local + CI identical invocation.
- .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly
  cron / push-to-main / workflow_dispatch); scope decides full vs
  scoped from changed paths; global serial concurrency (no
  cancel-in-progress — cancelled runs leave orphan AWS state);
  configure-aws-credentials with role-duration-seconds=21600 (6h)
  to match the role's max-session-duration; pre-deploy state check
  with auto-recovery for stranded stacks; SCP drift check
  (excluding FullAWSAccess, fail-soft for first 7 detections);
  quarantine-expiry check; CFN events captured BEFORE teardown;
  teardown with 3x60s retry, gated on aws-creds outcome so we don't
  burn 3min retrying without credentials.
- .github/workflows/quarterly-audit.yml: 3-monthly tracking issue
  (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate
  liveness, ProtectISB-fallback revisit).
- .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns
  review (until a maintainers team is provisioned).

Phase 4 — 17 per-scenario smoke specs
- One spec per scenario covering the auth-mode pattern (admin-login /
  public / sso-skip / umbrella). Bug-informed feature flows cite the
  historical regression that informed each test:
  - fixmystreet: /reports requires bin/update-all-reports; /admin
    must reach the dashboard without 2FA redirect
  - planx: SPA boots free of domain-allowlist / Airbrake errors;
    Hasura native /v1/version responds (Caddy elimination)
  - minute: magic-link sets cookie; same-origin fetch() works
    post-auth; /api/proxy/healthcheck reaches the backend (catches
    the basic-auth-breaks-fetch() regression and the ALB /api/*
    interception regression)
  - localgov-ims: Windows IIS multi-site routing; AdminPassword
    must not be the literal {{resolve:...}} token (catches the
    Lambda-custom-resource regression)
  - localgov-drupal: ndx_aws_ai module boots without Bedrock
    AccessDeniedException
  - simply-readable: SPA loads, credentials non-empty + non-token;
    reload produces no 5xx responses (catches BlueprintsBucketName
    mis-wire)
  - ai-contact-centre: PSTN claim matches UK toll-free / landline OR
    US toll-free (catches international fallback regression)
  - paperless-ngx: /documents view + /api/documents/ respond (S3
    Files mount integrity)
  - bops-planning: post-login URL is NOT on the Applicants port
    (catches the routing.rb single-tenant override regression)
  - digital-planning-register: register loads with planning markers
  - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park,
    text-to-speech, council-chatbot): FunctionURL not-5xx + not-403
    (catches the InvokeFunctionUrl + InvokeFunction dual-permission
    regression); council-chatbot uses POST not GET so the test isn't
    vacuous against a POST-only Lambda
  - quicksight-dashboard: landing + outputs only (sso-skip per auth-
    mode categorisation)
  - all-demo: discovers Output keys dynamically by parsing the
    committed template at test time; asserts every Output present,
    non-empty, and not the {{resolve:...}} literal; URL outputs match
    https?://

Phase 5 — pin every floating image tag
- 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*,
  dpr) pinned to sha-<7chars>@sha256:<digest>.
- 2 upstream images (docker.io/apache/tika 3.3.0.0-full,
  ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>.
- Removed legacy cloudformation/scenarios/minute/template.json (stale
  ECR references; nothing in the repo referenced it).

Phase 6 — Renovate adoption (replaces Dependabot)
- renovate.json: 6 group rules per the spec's pinning-strategy table;
  customManagers regex matching the new pin shape; osvVulnerabilityAlerts
  + security-priority group; pinDigests scoped to official actions/* +
  aws-actions/* only so the first run doesn't firehose; per-PR limits
  capped at 6.
- .github/workflows/renovate.yml: twice-daily + workflow_dispatch.
  Action pinned by digest to v46.1.14.
- .github/dependabot.yml deleted.

Operator follow-ups (not in this PR)
====================================
- NAP-548: migrate scenarios off legacy claude-3-haiku-20240307
- NAP-549: revisit ProtectISB fallback by 2026-11-12
- NAP-550: service-quota Console requests
- NAP-551: QuickSight subscription decision
- NAP-552: mint RENOVATE_TOKEN repo secret
- NAP-554: close in-flight Dependabot PRs
- NAP-555: T2b.5b + T3.8 end-to-end verifications

Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
chrisns added a commit that referenced this pull request May 13, 2026
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md
in a single deliverable. The original spec called for one PR per phase
(8+ PRs); experience showed the dependency overlap made that worse for
review, not better, so this squashes #226 / #227 / #228 / #229 / #230
/ #231 / #232 / #233 / #235 into a single change.

What ships
==========

Phase 1a — runbook + config schema
- docs/smoke-test-account-setup.md: one-off manual procedure for vending
  the long-lived smoke-test AWS account, with the four required sections
  (Prerequisites / Procedure / Verification / Operational Notes). Per-step
  idempotency checks + inverses; ProtectISB role-creation canary +
  fallback branch (ADR-1); Bedrock model-access enablement + gotchas
  (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota
  targets; QuickSight decision; iterate-to-least-privilege protocol for
  the inline IAM policy.
- docs/smoke-test-account-config.yml: post-runbook state record schema.

Phase 1b — operator-executed account state
- Smoke account 464453619983 provisioned in NDX org under the fallback
  branch (ProtectISB canary failed; account moved to root with
  Restrictions SCP attached directly). AwsNuke SCP intentionally NOT
  attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN
  delete + retention lint, not aws-nuke).
- OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created
  with 6h max-session-duration. Trust policy uses sub-pattern lock
  (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the
  repository_owner claim condition is omitted because it reproducibly
  breaks the assume even though the OIDC token contains the claim
  (verified via JWT decode in an investigation workflow that has since
  been deleted; see runbook Step 10).
- expected_scps reflects live state: Restrictions + FullAWSAccess.

Phase 2a — synth pipelines for missing scenarios
- New synth jobs in .github/workflows/deploy-blueprints.yml for planx
  and digital-planning-register (CDK -> template.yaml -> S3 via the
  existing isb-hub upload chain).
- bops-planning synth job lands in Phase 2b after the retention lint
  is justification-aware.
- ai-contact-centre: new "verify packaged CodeUri targets blueprints
  bucket" step catches a sam-package regression where --s3-bucket would
  silently land in the SAM default bucket.

Phase 2b — all-demo expansion + retention lint
- cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16
  nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS,
  Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning
  Register). Umbrella parameters for credentials (GovUkPayApiKey,
  OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable
  empty / sensible defaults; per-scenario URL + admin-credential Outputs
  surfaced.
- scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain /
  UpdateReplacePolicy=Retain / Properties.DeletionProtection=true /
  Properties.EnableDeletionProtection=true /
  Properties.FinalSnapshotIdentifier unless the resource carries a
  non-empty Metadata.Justification. Per-template cap (default 3) +
  global cap (default 10) so any one scenario can't pencil-whip
  retentions repo-wide.
- lint-committed-templates job in deploy-blueprints.yml runs the lint
  over hand-authored CFN templates.
- bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate
  debug-after-rollback) with a Metadata.Justification attached via
  cfnOptions; bops synth job re-enabled.

Phase 3 — smoke rails
- playwright.config.ts: new 'smoke' project gated on
  PLAYWRIGHT_SUITE=smoke.
- tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper.
  Sensitive output values flow only via explicit sensitiveValue()
  accessor; toString / inspect / Symbol.toPrimitive emit REDACTED
  placeholder. Documents the CloudFormation-API limitation that Output
  Metadata.Sensitive opt-in isn't readable (regex is the sole signal).
- tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries
  populated.
- tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts
  form-encoded passwords from Playwright trace.
- scripts/smoke.sh + .env.example: local + CI identical invocation.
- .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly
  cron / push-to-main / workflow_dispatch); scope decides full vs
  scoped from changed paths; global serial concurrency (no
  cancel-in-progress — cancelled runs leave orphan AWS state);
  configure-aws-credentials with role-duration-seconds=21600 (6h)
  to match the role's max-session-duration; pre-deploy state check
  with auto-recovery for stranded stacks; SCP drift check
  (excluding FullAWSAccess, fail-soft for first 7 detections);
  quarantine-expiry check; CFN events captured BEFORE teardown;
  teardown with 3x60s retry, gated on aws-creds outcome so we don't
  burn 3min retrying without credentials.
- .github/workflows/quarterly-audit.yml: 3-monthly tracking issue
  (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate
  liveness, ProtectISB-fallback revisit).
- .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns
  review (until a maintainers team is provisioned).

Phase 4 — 17 per-scenario smoke specs
- One spec per scenario covering the auth-mode pattern (admin-login /
  public / sso-skip / umbrella). Bug-informed feature flows cite the
  historical regression that informed each test:
  - fixmystreet: /reports requires bin/update-all-reports; /admin
    must reach the dashboard without 2FA redirect
  - planx: SPA boots free of domain-allowlist / Airbrake errors;
    Hasura native /v1/version responds (Caddy elimination)
  - minute: magic-link sets cookie; same-origin fetch() works
    post-auth; /api/proxy/healthcheck reaches the backend (catches
    the basic-auth-breaks-fetch() regression and the ALB /api/*
    interception regression)
  - localgov-ims: Windows IIS multi-site routing; AdminPassword
    must not be the literal {{resolve:...}} token (catches the
    Lambda-custom-resource regression)
  - localgov-drupal: ndx_aws_ai module boots without Bedrock
    AccessDeniedException
  - simply-readable: SPA loads, credentials non-empty + non-token;
    reload produces no 5xx responses (catches BlueprintsBucketName
    mis-wire)
  - ai-contact-centre: PSTN claim matches UK toll-free / landline OR
    US toll-free (catches international fallback regression)
  - paperless-ngx: /documents view + /api/documents/ respond (S3
    Files mount integrity)
  - bops-planning: post-login URL is NOT on the Applicants port
    (catches the routing.rb single-tenant override regression)
  - digital-planning-register: register loads with planning markers
  - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park,
    text-to-speech, council-chatbot): FunctionURL not-5xx + not-403
    (catches the InvokeFunctionUrl + InvokeFunction dual-permission
    regression); council-chatbot uses POST not GET so the test isn't
    vacuous against a POST-only Lambda
  - quicksight-dashboard: landing + outputs only (sso-skip per auth-
    mode categorisation)
  - all-demo: discovers Output keys dynamically by parsing the
    committed template at test time; asserts every Output present,
    non-empty, and not the {{resolve:...}} literal; URL outputs match
    https?://

Phase 5 — pin every floating image tag
- 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*,
  dpr) pinned to sha-<7chars>@sha256:<digest>.
- 2 upstream images (docker.io/apache/tika 3.3.0.0-full,
  ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>.
- Removed legacy cloudformation/scenarios/minute/template.json (stale
  ECR references; nothing in the repo referenced it).

Phase 6 — Renovate adoption (replaces Dependabot)
- renovate.json: 6 group rules per the spec's pinning-strategy table;
  customManagers regex matching the new pin shape; osvVulnerabilityAlerts
  + security-priority group; pinDigests scoped to official actions/* +
  aws-actions/* only so the first run doesn't firehose; per-PR limits
  capped at 6.
- .github/workflows/renovate.yml: twice-daily + workflow_dispatch.
  Action pinned by digest to v46.1.14.
- .github/dependabot.yml deleted.

Operator follow-ups (not in this PR)
====================================
- NAP-548: migrate scenarios off legacy claude-3-haiku-20240307
- NAP-549: revisit ProtectISB fallback by 2026-11-12
- NAP-550: service-quota Console requests
- NAP-551: QuickSight subscription decision
- NAP-552: mint RENOVATE_TOKEN repo secret
- NAP-554: close in-flight Dependabot PRs
- NAP-555: T2b.5b + T3.8 end-to-end verifications

Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
chrisns added a commit that referenced this pull request May 13, 2026
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md
in a single deliverable. The original spec called for one PR per phase
(8+ PRs); experience showed the dependency overlap made that worse for
review, not better, so this squashes #226 / #227 / #228 / #229 / #230
/ #231 / #232 / #233 / #235 into a single change.

What ships
==========

Phase 1a — runbook + config schema
- docs/smoke-test-account-setup.md: one-off manual procedure for vending
  the long-lived smoke-test AWS account, with the four required sections
  (Prerequisites / Procedure / Verification / Operational Notes). Per-step
  idempotency checks + inverses; ProtectISB role-creation canary +
  fallback branch (ADR-1); Bedrock model-access enablement + gotchas
  (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota
  targets; QuickSight decision; iterate-to-least-privilege protocol for
  the inline IAM policy.
- docs/smoke-test-account-config.yml: post-runbook state record schema.

Phase 1b — operator-executed account state
- Smoke account 464453619983 provisioned in NDX org under the fallback
  branch (ProtectISB canary failed; account moved to root with
  Restrictions SCP attached directly). AwsNuke SCP intentionally NOT
  attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN
  delete + retention lint, not aws-nuke).
- OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created
  with 6h max-session-duration. Trust policy uses sub-pattern lock
  (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the
  repository_owner claim condition is omitted because it reproducibly
  breaks the assume even though the OIDC token contains the claim
  (verified via JWT decode in an investigation workflow that has since
  been deleted; see runbook Step 10).
- expected_scps reflects live state: Restrictions + FullAWSAccess.

Phase 2a — synth pipelines for missing scenarios
- New synth jobs in .github/workflows/deploy-blueprints.yml for planx
  and digital-planning-register (CDK -> template.yaml -> S3 via the
  existing isb-hub upload chain).
- bops-planning synth job lands in Phase 2b after the retention lint
  is justification-aware.
- ai-contact-centre: new "verify packaged CodeUri targets blueprints
  bucket" step catches a sam-package regression where --s3-bucket would
  silently land in the SAM default bucket.

Phase 2b — all-demo expansion + retention lint
- cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16
  nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS,
  Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning
  Register). Umbrella parameters for credentials (GovUkPayApiKey,
  OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable
  empty / sensible defaults; per-scenario URL + admin-credential Outputs
  surfaced.
- scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain /
  UpdateReplacePolicy=Retain / Properties.DeletionProtection=true /
  Properties.EnableDeletionProtection=true /
  Properties.FinalSnapshotIdentifier unless the resource carries a
  non-empty Metadata.Justification. Per-template cap (default 3) +
  global cap (default 10) so any one scenario can't pencil-whip
  retentions repo-wide.
- lint-committed-templates job in deploy-blueprints.yml runs the lint
  over hand-authored CFN templates.
- bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate
  debug-after-rollback) with a Metadata.Justification attached via
  cfnOptions; bops synth job re-enabled.

Phase 3 — smoke rails
- playwright.config.ts: new 'smoke' project gated on
  PLAYWRIGHT_SUITE=smoke.
- tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper.
  Sensitive output values flow only via explicit sensitiveValue()
  accessor; toString / inspect / Symbol.toPrimitive emit REDACTED
  placeholder. Documents the CloudFormation-API limitation that Output
  Metadata.Sensitive opt-in isn't readable (regex is the sole signal).
- tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries
  populated.
- tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts
  form-encoded passwords from Playwright trace.
- scripts/smoke.sh + .env.example: local + CI identical invocation.
- .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly
  cron / push-to-main / workflow_dispatch); scope decides full vs
  scoped from changed paths; global serial concurrency (no
  cancel-in-progress — cancelled runs leave orphan AWS state);
  configure-aws-credentials with role-duration-seconds=21600 (6h)
  to match the role's max-session-duration; pre-deploy state check
  with auto-recovery for stranded stacks; SCP drift check
  (excluding FullAWSAccess, fail-soft for first 7 detections);
  quarantine-expiry check; CFN events captured BEFORE teardown;
  teardown with 3x60s retry, gated on aws-creds outcome so we don't
  burn 3min retrying without credentials.
- .github/workflows/quarterly-audit.yml: 3-monthly tracking issue
  (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate
  liveness, ProtectISB-fallback revisit).
- .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns
  review (until a maintainers team is provisioned).

Phase 4 — 17 per-scenario smoke specs
- One spec per scenario covering the auth-mode pattern (admin-login /
  public / sso-skip / umbrella). Bug-informed feature flows cite the
  historical regression that informed each test:
  - fixmystreet: /reports requires bin/update-all-reports; /admin
    must reach the dashboard without 2FA redirect
  - planx: SPA boots free of domain-allowlist / Airbrake errors;
    Hasura native /v1/version responds (Caddy elimination)
  - minute: magic-link sets cookie; same-origin fetch() works
    post-auth; /api/proxy/healthcheck reaches the backend (catches
    the basic-auth-breaks-fetch() regression and the ALB /api/*
    interception regression)
  - localgov-ims: Windows IIS multi-site routing; AdminPassword
    must not be the literal {{resolve:...}} token (catches the
    Lambda-custom-resource regression)
  - localgov-drupal: ndx_aws_ai module boots without Bedrock
    AccessDeniedException
  - simply-readable: SPA loads, credentials non-empty + non-token;
    reload produces no 5xx responses (catches BlueprintsBucketName
    mis-wire)
  - ai-contact-centre: PSTN claim matches UK toll-free / landline OR
    US toll-free (catches international fallback regression)
  - paperless-ngx: /documents view + /api/documents/ respond (S3
    Files mount integrity)
  - bops-planning: post-login URL is NOT on the Applicants port
    (catches the routing.rb single-tenant override regression)
  - digital-planning-register: register loads with planning markers
  - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park,
    text-to-speech, council-chatbot): FunctionURL not-5xx + not-403
    (catches the InvokeFunctionUrl + InvokeFunction dual-permission
    regression); council-chatbot uses POST not GET so the test isn't
    vacuous against a POST-only Lambda
  - quicksight-dashboard: landing + outputs only (sso-skip per auth-
    mode categorisation)
  - all-demo: discovers Output keys dynamically by parsing the
    committed template at test time; asserts every Output present,
    non-empty, and not the {{resolve:...}} literal; URL outputs match
    https?://

Phase 5 — pin every floating image tag
- 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*,
  dpr) pinned to sha-<7chars>@sha256:<digest>.
- 2 upstream images (docker.io/apache/tika 3.3.0.0-full,
  ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>.
- Removed legacy cloudformation/scenarios/minute/template.json (stale
  ECR references; nothing in the repo referenced it).

Phase 6 — Renovate adoption (replaces Dependabot)
- renovate.json: 6 group rules per the spec's pinning-strategy table;
  customManagers regex matching the new pin shape; osvVulnerabilityAlerts
  + security-priority group; pinDigests scoped to official actions/* +
  aws-actions/* only so the first run doesn't firehose; per-PR limits
  capped at 6.
- .github/workflows/renovate.yml: twice-daily + workflow_dispatch.
  Action pinned by digest to v46.1.14.
- .github/dependabot.yml deleted.

Operator follow-ups (not in this PR)
====================================
- NAP-548: migrate scenarios off legacy claude-3-haiku-20240307
- NAP-549: revisit ProtectISB fallback by 2026-11-12
- NAP-550: service-quota Console requests
- NAP-551: QuickSight subscription decision
- NAP-552: mint RENOVATE_TOKEN repo secret
- NAP-554: close in-flight Dependabot PRs
- NAP-555: T2b.5b + T3.8 end-to-end verifications

Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant