Skip to content

smoke pack DRY, workflow extraction, CI unblock#238

Merged
chrisns merged 43 commits into
mainfrom
refactor/smoke-pack-dryrun-and-fixes
May 18, 2026
Merged

smoke pack DRY, workflow extraction, CI unblock#238
chrisns merged 43 commits into
mainfrom
refactor/smoke-pack-dryrun-and-fixes

Conversation

@chrisns
Copy link
Copy Markdown
Member

@chrisns chrisns commented May 13, 2026

Summary

Three threads bundled because they share files:

  1. Smoke pack consolidation — per-scenario tests/smoke/*.spec.ts files (avg ~70 lines, lots of repetition) collapsed into 17 co-located cloudformation/scenarios/*/smoke.ts configs (avg 25 lines). A single tests/smoke/runner.ts drives all of them; tests/smoke/helpers.ts::adminLogin() plus get/getSecret accessors on the smoke context absorb the login-form boilerplate. assertion-bar.ts is gone — quarantine state moves into each scenario's own smoke.ts. Playwright's smoke project re-targeted at cloudformation/scenarios with testMatch: /[^/]+\/smoke\.ts$/ so new scenarios auto-discover. All 17 tests still found by playwright test --list.

  2. Workflow de-bloat (1255 → 601 lines):

    • smoke.yml 467 → 204: scope decision, pre-deploy state, SCP drift, capture-events, teardown extracted to scripts/smoke-*.sh
    • deploy-blueprints.yml 726 → 354: 9 per-scenario synth jobs collapsed to a single matrix; CDK strip + SAM-bucket verify pulled into scripts/
    • renovate.yml 62 → 43: narrative stripped (RENOVATE_TOKEN PAT retained per operator decision)
    • Phase / AC / T*.* / tech-spec narrative removed throughout
  3. Pre-existing CI unblock:

    • LocalGov IMS CDK dropped minLength: 1 on GovUkPayApiKey so smoke deploys (no key) succeed; payment portal degrades cleanly
    • Retention lint extended to cover DeletionPolicy/UpdateReplacePolicy=Snapshot (parity with the prior inline check)

CODEOWNERS switched from @chrisns to @co-cddo/ndx and broadened to cover the new scripts and per-scenario smoke.ts files.

Operator actions taken out-of-band (state already reconciled)

  • IsbHubStack had been UPDATE_ROLLBACK_FAILED for 12 days (transient AiContactCentreStackSet issue from May 1); recovered cleanly via continue-update-rollback.
  • 6 orphan StackSets (bops-planning, digital-planning-register, fixmystreet, minute, paperless-ngx, planx) imported back into IsbHubStack via change-set-type IMPORT. UUIDs preserved → ISB lease bindings intact. The first cdk deploy after merge will create the 6 missing BucketDeployment resources and reconcile any property drift on the imported StackSets.
  • All 17 scenario templates refreshed in s3://ndx-try-isb-blueprints-568672915267/scenarios/. Planx had been at template.json (manual May-8 upload); now at the canonical template.yaml.

Test plan

  • CI: deploy-blueprints.yml green (the matrix + the deploy job, including the 6 new BucketDeployment creates from the import reconciliation)
  • CI: smoke workflow green (templates fresh in S3, LocalgovIms accepts empty key, hub stack healthy)
  • Spot-check: playwright test --project=smoke --list shows 17 tests after checkout
  • Spot-check: git diff main -- .github/workflows/*.yml shows shrunk YAML with no inline scripts > ~15 lines

chrisns added 2 commits May 13, 2026 09:28
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md
in a single deliverable. The original spec called for one PR per phase
(8+ PRs); experience showed the dependency overlap made that worse for
review, not better, so this squashes #226 / #227 / #228 / #229 / #230
/ #231 / #232 / #233 / #235 into a single change.

What ships
==========

Phase 1a — runbook + config schema
- docs/smoke-test-account-setup.md: one-off manual procedure for vending
  the long-lived smoke-test AWS account, with the four required sections
  (Prerequisites / Procedure / Verification / Operational Notes). Per-step
  idempotency checks + inverses; ProtectISB role-creation canary +
  fallback branch (ADR-1); Bedrock model-access enablement + gotchas
  (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota
  targets; QuickSight decision; iterate-to-least-privilege protocol for
  the inline IAM policy.
- docs/smoke-test-account-config.yml: post-runbook state record schema.

Phase 1b — operator-executed account state
- Smoke account 464453619983 provisioned in NDX org under the fallback
  branch (ProtectISB canary failed; account moved to root with
  Restrictions SCP attached directly). AwsNuke SCP intentionally NOT
  attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN
  delete + retention lint, not aws-nuke).
- OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created
  with 6h max-session-duration. Trust policy uses sub-pattern lock
  (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the
  repository_owner claim condition is omitted because it reproducibly
  breaks the assume even though the OIDC token contains the claim
  (verified via JWT decode in an investigation workflow that has since
  been deleted; see runbook Step 10).
- expected_scps reflects live state: Restrictions + FullAWSAccess.

Phase 2a — synth pipelines for missing scenarios
- New synth jobs in .github/workflows/deploy-blueprints.yml for planx
  and digital-planning-register (CDK -> template.yaml -> S3 via the
  existing isb-hub upload chain).
- bops-planning synth job lands in Phase 2b after the retention lint
  is justification-aware.
- ai-contact-centre: new "verify packaged CodeUri targets blueprints
  bucket" step catches a sam-package regression where --s3-bucket would
  silently land in the SAM default bucket.

Phase 2b — all-demo expansion + retention lint
- cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16
  nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS,
  Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning
  Register). Umbrella parameters for credentials (GovUkPayApiKey,
  OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable
  empty / sensible defaults; per-scenario URL + admin-credential Outputs
  surfaced.
- scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain /
  UpdateReplacePolicy=Retain / Properties.DeletionProtection=true /
  Properties.EnableDeletionProtection=true /
  Properties.FinalSnapshotIdentifier unless the resource carries a
  non-empty Metadata.Justification. Per-template cap (default 3) +
  global cap (default 10) so any one scenario can't pencil-whip
  retentions repo-wide.
- lint-committed-templates job in deploy-blueprints.yml runs the lint
  over hand-authored CFN templates.
- bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate
  debug-after-rollback) with a Metadata.Justification attached via
  cfnOptions; bops synth job re-enabled.

Phase 3 — smoke rails
- playwright.config.ts: new 'smoke' project gated on
  PLAYWRIGHT_SUITE=smoke.
- tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper.
  Sensitive output values flow only via explicit sensitiveValue()
  accessor; toString / inspect / Symbol.toPrimitive emit REDACTED
  placeholder. Documents the CloudFormation-API limitation that Output
  Metadata.Sensitive opt-in isn't readable (regex is the sole signal).
- tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries
  populated.
- tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts
  form-encoded passwords from Playwright trace.
- scripts/smoke.sh + .env.example: local + CI identical invocation.
- .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly
  cron / push-to-main / workflow_dispatch); scope decides full vs
  scoped from changed paths; global serial concurrency (no
  cancel-in-progress — cancelled runs leave orphan AWS state);
  configure-aws-credentials with role-duration-seconds=21600 (6h)
  to match the role's max-session-duration; pre-deploy state check
  with auto-recovery for stranded stacks; SCP drift check
  (excluding FullAWSAccess, fail-soft for first 7 detections);
  quarantine-expiry check; CFN events captured BEFORE teardown;
  teardown with 3x60s retry, gated on aws-creds outcome so we don't
  burn 3min retrying without credentials.
- .github/workflows/quarterly-audit.yml: 3-monthly tracking issue
  (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate
  liveness, ProtectISB-fallback revisit).
- .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns
  review (until a maintainers team is provisioned).

Phase 4 — 17 per-scenario smoke specs
- One spec per scenario covering the auth-mode pattern (admin-login /
  public / sso-skip / umbrella). Bug-informed feature flows cite the
  historical regression that informed each test:
  - fixmystreet: /reports requires bin/update-all-reports; /admin
    must reach the dashboard without 2FA redirect
  - planx: SPA boots free of domain-allowlist / Airbrake errors;
    Hasura native /v1/version responds (Caddy elimination)
  - minute: magic-link sets cookie; same-origin fetch() works
    post-auth; /api/proxy/healthcheck reaches the backend (catches
    the basic-auth-breaks-fetch() regression and the ALB /api/*
    interception regression)
  - localgov-ims: Windows IIS multi-site routing; AdminPassword
    must not be the literal {{resolve:...}} token (catches the
    Lambda-custom-resource regression)
  - localgov-drupal: ndx_aws_ai module boots without Bedrock
    AccessDeniedException
  - simply-readable: SPA loads, credentials non-empty + non-token;
    reload produces no 5xx responses (catches BlueprintsBucketName
    mis-wire)
  - ai-contact-centre: PSTN claim matches UK toll-free / landline OR
    US toll-free (catches international fallback regression)
  - paperless-ngx: /documents view + /api/documents/ respond (S3
    Files mount integrity)
  - bops-planning: post-login URL is NOT on the Applicants port
    (catches the routing.rb single-tenant override regression)
  - digital-planning-register: register loads with planning markers
  - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park,
    text-to-speech, council-chatbot): FunctionURL not-5xx + not-403
    (catches the InvokeFunctionUrl + InvokeFunction dual-permission
    regression); council-chatbot uses POST not GET so the test isn't
    vacuous against a POST-only Lambda
  - quicksight-dashboard: landing + outputs only (sso-skip per auth-
    mode categorisation)
  - all-demo: discovers Output keys dynamically by parsing the
    committed template at test time; asserts every Output present,
    non-empty, and not the {{resolve:...}} literal; URL outputs match
    https?://

Phase 5 — pin every floating image tag
- 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*,
  dpr) pinned to sha-<7chars>@sha256:<digest>.
- 2 upstream images (docker.io/apache/tika 3.3.0.0-full,
  ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>.
- Removed legacy cloudformation/scenarios/minute/template.json (stale
  ECR references; nothing in the repo referenced it).

Phase 6 — Renovate adoption (replaces Dependabot)
- renovate.json: 6 group rules per the spec's pinning-strategy table;
  customManagers regex matching the new pin shape; osvVulnerabilityAlerts
  + security-priority group; pinDigests scoped to official actions/* +
  aws-actions/* only so the first run doesn't firehose; per-PR limits
  capped at 6.
- .github/workflows/renovate.yml: twice-daily + workflow_dispatch.
  Action pinned by digest to v46.1.14.
- .github/dependabot.yml deleted.

Operator follow-ups (not in this PR)
====================================
- NAP-548: migrate scenarios off legacy claude-3-haiku-20240307
- NAP-549: revisit ProtectISB fallback by 2026-11-12
- NAP-550: service-quota Console requests
- NAP-551: QuickSight subscription decision
- NAP-552: mint RENOVATE_TOKEN repo secret
- NAP-554: close in-flight Dependabot PRs
- NAP-555: T2b.5b + T3.8 end-to-end verifications

Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
… issues

Smoke pack:
- 17 per-scenario .spec.ts files collapsed to 17 cloudformation/scenarios/*/smoke.ts
  configs (avg 25 lines, down from ~70) driven by a single tests/smoke/runner.ts
- adminLogin() helper + get()/getSecret() accessors absorb login-form boilerplate
- assertion-bar.ts removed; quarantine state lives in each scenario's smoke.ts
- Playwright smoke project's testDir points at cloudformation/scenarios with
  testMatch /[^/]+\/smoke\.ts$/ so new scenarios auto-discover

Workflows (1255 → 601 lines):
- smoke.yml 467 → 204: scope decision, pre-deploy state, SCP drift,
  capture-events, teardown extracted to scripts/smoke-*.sh
- deploy-blueprints.yml 726 → 354: 9 per-scenario synth jobs collapsed to a
  single matrix; CDK strip + SAM-bucket verify extracted to scripts/
- renovate.yml 62 → 43: narrative stripped (PAT kept per operator decision)

Pre-existing CI fixes:
- LocalGov IMS: dropped minLength:1 on GovUkPayApiKey so empty values are
  accepted (smoke deploys don't have a real key); doc clarified
- Retention lint: added DeletionPolicy/UpdateReplacePolicy=Snapshot to the
  rule set so the central lint matches the prior inline check

CODEOWNERS: @co-cddo/ndx (was @chrisns); broadened patterns to cover the new
scripts and the per-scenario smoke.ts files.

Operator note: IsbHubStack was UPDATE_ROLLBACK_FAILED for 12 days; recovered
via continue-update-rollback. The 6 orphan StackSets (bops-planning,
digital-planning-register, fixmystreet, minute, paperless-ngx, planx) were
imported back into IsbHubStack via change-set-type IMPORT — IDs preserved, no
ISB lease bindings broken.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 13, 2026 16:07 — with GitHub Actions Failure
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 08:34 — with GitHub Actions Failure
…leanup works

The deploy role's iam:DeleteRolePolicy was gated to
arn:aws:iam::*:role/InnovationSandbox-ndx-* — but scenario templates use
their own role naming (ndx-try-foi-role-*, etc.). Every smoke teardown left
those roles stranded, putting the all-demo stack in DELETE_FAILED and
poisoning the next run via stale AppRegistry applications.

Resource broadened to arn:aws:iam::*:role/*. SCPs still constrain what the
deploy role can actually do in practice; this just makes its identity-based
policy permissive enough to manage scenario IAM. Live policy already
updated; this commit syncs the runbook.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 08:48 — with GitHub Actions Failure
- smoke-capture-events.sh: nested-stack PhysicalResourceId is a full
  CloudFormation ARN with colons. The artefact uploader rejected those.
  Strip to the short stack name and sanitise the rest. The prior recursion
  added the broken filenames; without this, the artefact bundle never
  uploads when there are nested stacks (= every smoke failure).

- runbook IAM: replace efs:* with elasticfilesystem:* (efs isn't a real
  IAM action prefix), drop aurora:* (covered by rds:*), and add
  servicediscovery:* (Minute's PrivateDnsNamespace creation needs it).
  Live policy already updated.

Pre-existing issue not fixed yet: VPC quota in the smoke account is 5 (AWS
default) but all-demo needs ~9 simultaneous VPCs. Service quota increase
to 20 requested — AWS-side ticket pending.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 09:05 — with GitHub Actions Failure
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 09:25 — with GitHub Actions Failure
…ck to *

Smoke deploys uncovered missing services after VPC quota cleared:
- wisdom:*           (Amazon Q in Connect — ai-contact-centre)
- s3vectors:*        (S3 Vector Buckets — council-chatbot, ai-contact-centre KBs)
- cognito-idp:*      (simply-readable user pool)
- cognito-identity:* (paired with idp)
- iot:*              (smart-car-park IoT Things)
- bedrock:*          (was an explicit-action list; Guardrails not covered)

Live policy already updated.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 09:49 — with GitHub Actions Failure
Curating service-by-service IAM allow-lists for the SmokeTestDeployRole
proved fragile — every new scenario surfaced another missing permission
(wisdom, s3vectors, appsync, s3files, iot, cognito-*, servicediscovery,
elasticfilesystem...). The inline policy can't ergonomically keep up.

PowerUserAccess covers every AWS service except IAM. The custom inline
SmokeTestDeployInline still constrains IAM specifically. The Restrictions
SCP attached to the smoke account remains in force as the outer guard.
Net effect: same authorisation envelope, far less maintenance.

Live role updated.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 10:04 — with GitHub Actions Failure
Connect releases a phone number on stack-delete; the released number is held
in a 30-day cooldown and consumes UK DID claim quota during that window.
Long-lived smoke deploys would exhaust the quota in days.

ai-contact-centre/template.yaml gains two opt-in parameters:
  ExistingPhoneNumberArn  ARN of a pre-claimed number, OR '' to claim a new DID
  ExistingPhoneNumber     the dialable E.164 string for that ARN

Both default to '' → ClaimNewPhoneNumber=true → behaviour identical to today
for ISB pool deploys (StackSet doesn't override the defaults). When set, the
GeoNumber resource and the GeoFlowAssoc custom resource are skipped; the
ExistingPhoneNumber is surfaced as the PstnNumber output for the smoke
regex check.

all-demo umbrella plumbs both values through (AiccExistingPhoneNumberArn,
AiccExistingPhoneNumber). smoke.yml reads from
docs/smoke-test-account-config.yml and passes via --parameter-overrides.

Config holds placeholders today; ai-contact-centre's smoke spec is
quarantined until the operator completes runbook Step 13 (one-time: create a
holder Connect instance, claim a number against it, record the values).
That step needs to run as the SmokeTestDeployRole because the Restrictions
SCP blocks connect:CreateInstance from non-InnovationSandbox-ndx-* principals.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 10:55 — with GitHub Actions Failure
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 11:03 — with GitHub Actions Failure
A single flaky scenario (Planx Hasura ECS circuit-breaker, etc.) shouldn't
unwind 16 healthy nested stacks — Aurora cold-start, ALB warm-up and
similar make full rebuilds slow and expensive.

aws cloudformation deploy now passes --disable-rollback. On CREATE failure
the umbrella stays in CREATE_FAILED with successful child stacks intact.
The next run's pre-deploy state check recognises CREATE_FAILED and proceeds
to update-stack against the same name; CFN's update-stack handles
CREATE_FAILED stacks (since ~2020) by replacing only the failed resources.

Matches the established fix-forward pattern (memory:feedback_cfn_fix_forward_failed_stack).
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 11:33 — with GitHub Actions Failure
…d state)

Pairs with --disable-rollback. Previously the teardown step ran on every
'always()' path including failures, which wiped the CREATE_FAILED state we
want to keep so the next run can update-stack-fix-forward instead of
rebuilding 16 healthy nested stacks.

Now: teardown runs when the job succeeded OR when the event is the nightly
schedule. PR failures leave the stack in CREATE_FAILED; the next push picks
up where the last attempt fell over. Nightly cron still cleans up so the
smoke account doesn't accumulate debris over weeks.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 11:48 — with GitHub Actions Failure
LogGroups with fixed names declared in CFN race the AWS-Lambda implicit
LogGroup creator. When the explicit one fails to clean up (rollback gap,
race, etc.) the orphan blocks the next deploy with AlreadyExists. With
fix-forward (--disable-rollback) we re-attempt against the same name on
every run, so the orphans recur.

Proactive prune in pre-deploy: best-effort delete of the known set
(/ndx-bops/production and the ndx-* Lambda log groups). Failures are
tolerated (most runs the LG won't exist). Adds ~3s to pre-deploy on
account of the 6 sequential calls.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 12:26 — with GitHub Actions Failure
CREATE_FAILED was the only fix-forward path. Once the umbrella stack
exists, subsequent failures land in UPDATE_FAILED, not CREATE_FAILED. CFN's
update-stack accepts both as starting states, so we should too.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 12:43 — with GitHub Actions Failure
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 12:54 — with GitHub Actions Failure
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 13:14 — with GitHub Actions Failure
All four planx custom images (hasura, api, editor, sharedb) are amd64-only.
Fargate task definitions specifying ARM64 fail with CannotPullContainerError:
'image Manifest does not contain descriptor matching platform linux/arm64 v8'.

ARM64 was selected for cost (~20% cheaper) but the image pipeline produces
single-arch amd64 builds; the manifest list has no arm64 entry. Restoring
ARM64 needs docker buildx multi-arch builds in the planx image CI.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 14, 2026 13:53 — with GitHub Actions Failure
The smoke account's Restrictions SCP blocks connect:CreateInstance from
any principal whose name doesn't start with InnovationSandbox-ndx-*, so
operator-side AWS sessions (SSO admin etc.) can't run the setup directly.
This workflow runs as the SmokeTestDeployRole (same OIDC trust as smoke.yml)
and is therefore allowed.

The script is idempotent: re-running reuses an existing holder + number.

Manually triggered (workflow_dispatch). After it runs, paste the printed
values into docs/smoke-test-account-config.yml so every subsequent smoke
deploy reuses the same number — avoids the 30-day release-cooldown that
exhausts UK DID claim quota.
@chrisns chrisns temporarily deployed to smoke-test-deploy May 17, 2026 08:21 — with GitHub Actions Inactive
…umberArn)

UK DID claim quota is 5/30days. Sustained smoke runs would exhaust it. Pass a
non-empty placeholder ExistingPhoneNumberArn so the AICC template's
ClaimNewPhoneNumber condition is false: no GeoNumber, no GeoFlowAssoc, no
release-on-teardown. AICC's ConnectInstance + Lex + Wisdom + KB + companion
UI still deploy real; smoke checks the companion URL HTTP status + that
PstnNumber matches a +44 format string ("+442012345678" satisfies the regex).

Holder Connect instance deleted (the 1 quota slot is now AICC's). Reverts the
DeployAiContactCentre gate that was a stopgap while the holder existed.
@chrisns chrisns temporarily deployed to smoke-test-deploy May 18, 2026 09:44 — with GitHub Actions Inactive
UK DID claim quota is 5/30 days so smoke never claims a real number. The
holder/pre-claim flow that smoke-pstn-setup.{yml,sh} implemented is no longer
used — a non-empty DUMMY ExistingPhoneNumberArn keeps AICC's ClaimNewPhoneNumber
condition false, so the template skips GeoNumber + GeoFlowAssoc entirely.
AICC's ConnectInstance + Lex + Wisdom + KB + companion UI still deploy real
and exercise the rest of the scenario.

- delete .github/workflows/smoke-pstn-setup.yml
- delete scripts/smoke-pstn-setup.sh
- runbook Step 13 rewritten to describe the dummy-PSTN approach (no operator
  setup required)
- config comments rewritten to reflect the new model
@chrisns chrisns temporarily deployed to smoke-test-deploy May 18, 2026 10:13 — with GitHub Actions Inactive
S3 templates carry several patches applied by hand during smoke debugging
yesterday. Without these in CDK source / committed templates, the next
deploy-blueprints run would silently revert every fix to the orphan-name
collision modes.

Sources brought in line with S3:
- paperless-ngx: bucket name `ndx-try-paperless-archive-v2-…` (orphan v1
  FS in smoke account holds the un-suffixed name hostage)
- bops-planning: 8 fixed names suffixed with `-v2-` (cluster, ALB, services,
  roles, VPC, SGs, LG, bucket-empty role) — orphan stack stuck in
  UPDATE_ROLLBACK_COMPLETE_CLEANUP holds the un-suffixed set
- council-chatbot, foi-redaction, planning-ai, text-to-speech,
  quicksight-dashboard, smart-car-park: LogGroupName suffixed with
  `-${AWS::StackName}` so an orphan LG from a previous rollback doesn't
  collide on AlreadyExists
- minute, fixmystreet, localgov-drupal, localgov-ims, planx,
  digital-planning-register, paperless-ngx: same stack-name suffix on the
  CDK LogGroup name + its console-link CloudWatchLogsUrl output

Stale-comment cleanups:
- scripts/smoke-pre-deploy-state.sh: drop references to --disable-rollback
  (removed in 553a556)
- cloudformation/scenarios/all-demo/smoke.ts: drop dead Condition-skip
  peek-loop (every Condition was removed in eab2a02)
- cloudformation/scenarios/{ai-contact-centre,all-demo}/template.yaml:
  rewrite ExistingPhoneNumberArn param descriptions; drop "holder" sentence
  on GeoFlowAssoc (holder doesn't exist; smoke passes DUMMY)
- .github/workflows/smoke.yml: tighten the SmokeRun-tag comment
- tests/smoke/fixtures/cfn-outputs.ts: explain why `Login` is in the
  SENSITIVE_KEY_PATTERN (LoginUrl can carry pre-auth tokens)
- tests/smoke/fixtures/secure-form.ts: drop the 10s consumed-poll (smoke
  specs always click submit AFTER fillPassword returns, so the original
  unroute-before-submit semantics couldn't redact anything; leave the
  routeHandler armed instead so the first post-fill POST is rewritten)

bops-planning-stack.test.ts updated to expect the new -v2 names.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 18, 2026 10:45 — with GitHub Actions Failure
…iants

The .json was an accidental check-in from yesterday's CDK exploration.
The deployed template is template.yaml; the .json was dead.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 18, 2026 10:52 — with GitHub Actions Failure
The route handler rewrote the form-encoded POST body to "REDACTED-<hash>"
to keep cleartext out of the Playwright trace, but `route.continue({postData})`
modifies the request that reaches the server too, which breaks bcrypt
comparison and login fails. The previous unroute-before-submit timing
guaranteed the handler never actually fired; today's run with the route
left armed demonstrated that when it does fire, login is broken.

Simplify to a plain `page.fill` wrapper. The SensitiveValue contract via
.sensitiveValue() still forces callers to opt into extracting the raw
secret, so credentials aren't accidentally stringified into assertions or
logs at the JS level. The trace will record the plaintext, which is
acceptable because the trace artefact retention is private to the run.
@chrisns chrisns temporarily deployed to smoke-test-deploy May 18, 2026 11:01 — with GitHub Actions Inactive
chrisns added 2 commits May 18, 2026 13:48
- tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries (one per
  scenario incl. all-demo umbrella) indexing what each spec asserts and the
  historical regression that motivated it. Smoke specs remain the source of
  truth; this is the reviewer-facing index.
- .github/workflows/smoke.yml: scope job emits `override=true` when the PR
  carries the `smoke-override-emergency` label; smoke job's `if:` skips
  when override is active, so the gate clears. CODEOWNERS approval is
  enforced by repo branch-protection (out of band).
- .github/workflows/smoke-override-followup.yml: hourly cron opens a
  `smoke-override-followup` issue 48h after the merge so the underlying
  regression doesn't get forgotten. Idempotent on PR number.
@chrisns chrisns temporarily deployed to smoke-test-deploy May 18, 2026 12:51 — with GitHub Actions Inactive
Each scenario now asserts something beyond "landing returned 200 / login form
submitted". Specifically the new assertions surface bug-shaped regressions that
the old surface-only probes would have missed.

Scenario depth-adds (one bullet per spec; the spec is the source of truth):

- ai-contact-centre: companion SPA renders the seeded Aldershire welcome +
  the Ask/Call-via-browser entry points (Connect bootstrap + SPA assets)
- bops-planning: post-login dashboard has the 5-tab nav; /planning_applications/all
  lists seeded applications by numeric-id href (seed_sample_data.rb regression)
- council-chatbot: parse the JSON response, assert success/citations/model
  fields, then a 2nd POST with the returned sessionId round-trips (session
  store regression)
- digital-planning-register: council selector renders > 0 council links;
  drilling into a council shows "Recently published applications" list with
  application-ref links (planning-data API regression)
- fixmystreet: /reports dashboard shows > 0 reports across categories
  (seed/DB/cron); /admin/reports moderation queue has report links
- foi-redaction: POST sample text with full PII set; assert redactionCount > 0
  AND original strings are gone AND NAME+EMAIL entities detected
- localgov-drupal: /admin/modules confirms ndx_aws_ai + ndx_council_generator
  modules enabled (Bedrock IAM regression silently disables them); /admin/content
  has seeded demo content
- localgov-ims: post-login dashboard shows IIS nav (Dashboard/Transactions/
  Payment/Users); /Payment/Create renders the GOV.UK Pay payment basket form
- minute: landing shows "AI transcription and drafting service"; /templates
  lists the seeded Document + Form template types
- paperless-ngx: /api/documents/?page=1 parsed; count > 0, first doc has
  > 50 chars of OCR content, at least one doc has an "AI summary" note
  (Bedrock post-consume hook regression)
- planning-ai: POST {useSample:true} triggers full Textract+Bedrock pipeline;
  assert wordCount > 100, OCR confidence > 80%, AI extraction populates
  applicationRef + summary + classification
- planx: post-login editor renders team/flow links or > 5 interactive elements
  (seed migrations + API/Postgres reachability)
- quicksight-dashboard: portal URL returns < 400, body mentions QuickSight/Sign in;
  data bucket CSV path responds 200 or 403 (existence, not anon access)
- simply-readable: app redirects to Cognito hosted UI with username +
  password inputs visible; no 5xx on reload (BlueprintsBucketName regression)
- smart-car-park: dashboard shows Total Spaces > 0 (DynamoDB seed + aggregation);
  zone breakdown renders 3+ zone headings
- text-to-speech: POST {text,voice}, fetch returned signed audioUrl, assert
  content-type audio/mpeg + body > 1KB + MP3 magic byte at offset 0

Token-redaction in fillPassword stayed disabled (AC3.12 deviation documented
earlier). secure-form.ts already simplified.
@chrisns chrisns had a problem deploying to smoke-test-deploy May 18, 2026 13:40 — with GitHub Actions Failure
@chrisns chrisns had a problem deploying to smoke-test-deploy May 18, 2026 13:50 — with GitHub Actions Failure
@chrisns chrisns temporarily deployed to smoke-test-deploy May 18, 2026 13:57 — with GitHub Actions Inactive
The CI workflow's inline Dockerfile set VITE_APP_AIRBRAKE_PROJECT_ID=0
and VITE_APP_AIRBRAKE_PROJECT_KEY=unused. Both truthy strings, so
upstream's `hasConfig` check passes, then `new Notifier({projectId: 0,
projectKey: "unused"})` is called and Airbrake validates projectId truthy
(0 is falsy) and throws "projectId and projectKey are required",
blanking the editor SPA.

Fix: stop passing the env vars, and apply a build-time overlay that
replaces airbrake.ts with an unconditional no-op stub so the import
path is safe regardless of upstream drift. Same pattern as the existing
validateDomain overlay, now also applied in CI (previously only in the
local build.sh).

Also fixes VITE_APP_HASURA_URL drift between CI (/hasura/v1/graphql) and
the local Dockerfile (/v1/graphql) - CloudFront routes /v1/* and
/console/* directly to Hasura, no /hasura prefix.
@chrisns chrisns temporarily deployed to smoke-test-deploy May 18, 2026 14:22 — with GitHub Actions Inactive
@chrisns chrisns temporarily deployed to smoke-test-deploy May 18, 2026 14:29 — with GitHub Actions Inactive
CDK pin update for the new editor image (builds without
VITE_APP_AIRBRAKE_* and with the airbrake.ts no-op overlay).

Smoke spec tightened: previous version only asserted that the SPA bundle
was served by CloudFront because the React tree was crashing in init.
Now require the editor dashboard ("Select a team" heading + "My teams"
section + at least one team card link) to render, and fail if the
Airbrake bootstrap error message appears in the browser console.
@chrisns chrisns temporarily deployed to smoke-test-deploy May 18, 2026 14:54 — with GitHub Actions Inactive
@chrisns chrisns temporarily deployed to smoke-test-deploy May 18, 2026 14:58 — with GitHub Actions Inactive
@chrisns chrisns added this pull request to the merge queue May 18, 2026
Merged via the queue into main with commit 531ac0d May 18, 2026
21 checks passed
@chrisns chrisns deleted the refactor/smoke-pack-dryrun-and-fixes branch May 18, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant