Skip to content

scenario-regression smoke pack (Phases 1-6 squashed)#236

Closed
chrisns wants to merge 1 commit into
mainfrom
feat/smoke-pack
Closed

scenario-regression smoke pack (Phases 1-6 squashed)#236
chrisns wants to merge 1 commit into
mainfrom
feat/smoke-pack

Conversation

@chrisns
Copy link
Copy Markdown
Member

@chrisns chrisns commented May 12, 2026

Summary

Single PR replacing #226 / #227 / #228 / #229 / #230 / #231 / #232 / #233 / #235. Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md plus reviewer-asked fixes. See the commit message on the head commit for the per-phase breakdown.

Architecture correction (vs earlier drafts)

Earlier drafts of this work added a hand-rolled aws s3 cp step in deploy-blueprints.yml to upload templates for what I called "non-StackSet scenarios" (planx, bops-planning, digital-planning-register, minute, fixmystreet, paperless-ngx). That was wrong. Every one of these scenarios already has an ACTIVE ISB StackSet in the hub account, structurally identical to the rest (same InnovationSandbox-ndx-IntermediateRole admin / InnovationSandbox-ndx-SandboxAccountRole execution roles). They were just orphaned from CDK ownership.

This PR puts them under CDK ownership:

  • cloudformation/isb-hub/lib/isb-hub-stack.ts SCENARIOS array now lists all 16 scenarios + all-demo
  • OSVectorTilesApiKey is plumbed through the hub like GovUkPayApiKey (for the bops-planning scenario)
  • all-demo declares parameterKeys: ['GovUkPayApiKey', 'OSVectorTilesApiKey'] so the umbrella forwards them to its nested children
  • The hand-rolled Upload non-StackSet scenario templates to S3 step is removed; all 17 scenarios go through the same BucketDeployment + CfnStackSet path

Operator action required BEFORE merging this PR's first deploy-blueprints run

The six newly-included scenarios have existing ACTIVE StackSets in the hub (visible via aws cloudformation list-stack-sets --region us-west-2). CDK create-stack-set would fail AlreadyExists. Each must be imported into the isb-hub CFN stack via a one-off change-set IMPORT.

For each of ndx-try-minute, ndx-try-fixmystreet, ndx-try-paperless-ngx, ndx-try-planx, ndx-try-bops-planning, ndx-try-digital-planning-register:

# 1. cdk synth to produce the target template
cd cloudformation/isb-hub
npx cdk synth > /tmp/isb-hub-target.template.json

# 2. Construct the import map (resources-to-import file)
# This is JSON listing the StackSet logical IDs CDK gives them
# (e.g. `MinuteStackSet`) and the physical IDs in the hub
# (e.g. `ndx-try-minute:<existing-uuid>`).

# 3. Create an IMPORT change-set
aws cloudformation create-change-set \
  --stack-name isb-hub \
  --change-set-name import-orphan-stacksets-2026-05 \
  --change-set-type IMPORT \
  --resources-to-import file:///tmp/resources-to-import.json \
  --template-body file:///tmp/isb-hub-target.template.json \
  --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND \
  --region us-west-2 \
  --profile NDX/InnovationSandboxHub

# 4. Review the change-set, then execute

This is a one-off operator step. Once imported, subsequent deploy-blueprints runs work normally.

If you'd rather defer the import work, don't run deploy-blueprints post-merge until it's done — the CDK deploy will fail AlreadyExists on the six StackSets and roll the whole isb-hub stack back.

What's in here per phase

Phase Deliverable
1a Runbook (docs/smoke-test-account-setup.md) + config schema (docs/smoke-test-account-config.yml)
1b Smoke account provisioned (464453619983); config populated; deploy role + OIDC + 6h session
2a Synth pipelines for planx + dpr (and bops-planning, with justification-aware retention lint); ai-contact-centre packaged-CodeUri verification
2b all-demo expanded from 7 to 16 scenarios; scripts/lint-retention-policies.sh (per-template + global caps); bops-planning LogGroup justification
3 Smoke workflow + fixtures + entrypoint + CODEOWNERS
4 17 per-scenario smoke specs with bug-informed feature flows + assertion-bar rows; all-demo umbrella spec discovers Outputs dynamically
5 Every :latest pinned (10 own + 2 upstream); legacy minute/template.json removed
6 Self-hosted Renovate replaces Dependabot

Plus this revision: orphan StackSets adopted by CDK (see above).

Things to know before approving

  1. CI on smoke: the smoke workflow self-fires because we touched its trigger paths. Either the smoke-test-deploy environment gate blocks it pending CODEOWNERS approval, or it attempts a deploy. With the previous run's stranded ROLLBACK_COMPLETE stack now cleaned up + the workflow's pre-deploy state check fixed to handle that state, a fresh attempt should proceed further.

  2. CODEOWNERS is @chrisns: the co-cddo/ndx-try-maintainers team referenced in the spec doesn't exist. Once provisioned, swap the handles via a separate PR.

  3. Operator follow-ups tracked in Jira:

    • NAP-548 migrate scenarios off legacy claude-3-haiku-20240307 (chatbot / foi-redaction / planning-ai will fail smoke until done)
    • NAP-549 revisit ProtectISB fallback by 2026-11-12
    • NAP-550 service-quota Console requests
    • NAP-551 QuickSight Enterprise subscription decision
    • NAP-552 mint RENOVATE_TOKEN repo secret
    • NAP-554 close in-flight Dependabot PRs
    • NAP-555 T2b.5b real-deploy verification + T3.8 end-to-end smoke verification
    • NEW import the six orphan StackSets into isb-hub CDK ownership (this PR's prerequisite — file a ticket if not already)
  4. The runbook_version SHA in the config is informational, not enforced. Previous strict drift-lint created chicken-and-egg with squashed PRs. Now warns but doesn't fail.

  5. First Renovate run will fire 6+ PRs once RENOVATE_TOKEN is provisioned: pinDigests: true for actions/* + aws-actions/* will propose digest pins for the workflow action references currently at tag form.

Test plan

  • CI: smoke may go yellow (waiting for env approval) or attempt deploy; other checks should pass
  • Lint check passes locally: scripts/lint-retention-policies.sh cloudformation/scenarios/{council-chatbot,foi-redaction,planning-ai,smart-car-park,text-to-speech,quicksight-dashboard,ai-contact-centre,all-demo}/template.yaml
  • Playwright detects all 17 smoke specs: PLAYWRIGHT_SUITE=smoke npx playwright test --list --project=smoke → "Total: 17 tests in 17 files"
  • Operator: complete the StackSet IMPORT change-sets BEFORE the first post-merge deploy-blueprints run
  • Post-import + post-merge: gh workflow run smoke.yml triggers full smoke against the smoke account

Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md
in a single deliverable. The original spec called for one PR per phase
(8+ PRs); experience showed the dependency overlap made that worse for
review, not better, so this squashes #226 / #227 / #228 / #229 / #230
/ #231 / #232 / #233 / #235 into a single change.

What ships
==========

Phase 1a — runbook + config schema
- docs/smoke-test-account-setup.md: one-off manual procedure for vending
  the long-lived smoke-test AWS account, with the four required sections
  (Prerequisites / Procedure / Verification / Operational Notes). Per-step
  idempotency checks + inverses; ProtectISB role-creation canary +
  fallback branch (ADR-1); Bedrock model-access enablement + gotchas
  (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota
  targets; QuickSight decision; iterate-to-least-privilege protocol for
  the inline IAM policy.
- docs/smoke-test-account-config.yml: post-runbook state record schema.

Phase 1b — operator-executed account state
- Smoke account 464453619983 provisioned in NDX org under the fallback
  branch (ProtectISB canary failed; account moved to root with
  Restrictions SCP attached directly). AwsNuke SCP intentionally NOT
  attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN
  delete + retention lint, not aws-nuke).
- OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created
  with 6h max-session-duration. Trust policy uses sub-pattern lock
  (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the
  repository_owner claim condition is omitted because it reproducibly
  breaks the assume even though the OIDC token contains the claim
  (verified via JWT decode in an investigation workflow that has since
  been deleted; see runbook Step 10).
- expected_scps reflects live state: Restrictions + FullAWSAccess.

Phase 2a — synth pipelines for missing scenarios
- New synth jobs in .github/workflows/deploy-blueprints.yml for planx
  and digital-planning-register (CDK -> template.yaml -> S3 via the
  existing isb-hub upload chain).
- bops-planning synth job lands in Phase 2b after the retention lint
  is justification-aware.
- ai-contact-centre: new "verify packaged CodeUri targets blueprints
  bucket" step catches a sam-package regression where --s3-bucket would
  silently land in the SAM default bucket.

Phase 2b — all-demo expansion + retention lint
- cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16
  nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS,
  Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning
  Register). Umbrella parameters for credentials (GovUkPayApiKey,
  OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable
  empty / sensible defaults; per-scenario URL + admin-credential Outputs
  surfaced.
- scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain /
  UpdateReplacePolicy=Retain / Properties.DeletionProtection=true /
  Properties.EnableDeletionProtection=true /
  Properties.FinalSnapshotIdentifier unless the resource carries a
  non-empty Metadata.Justification. Per-template cap (default 3) +
  global cap (default 10) so any one scenario can't pencil-whip
  retentions repo-wide.
- lint-committed-templates job in deploy-blueprints.yml runs the lint
  over hand-authored CFN templates.
- bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate
  debug-after-rollback) with a Metadata.Justification attached via
  cfnOptions; bops synth job re-enabled.

Phase 3 — smoke rails
- playwright.config.ts: new 'smoke' project gated on
  PLAYWRIGHT_SUITE=smoke.
- tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper.
  Sensitive output values flow only via explicit sensitiveValue()
  accessor; toString / inspect / Symbol.toPrimitive emit REDACTED
  placeholder. Documents the CloudFormation-API limitation that Output
  Metadata.Sensitive opt-in isn't readable (regex is the sole signal).
- tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries
  populated.
- tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts
  form-encoded passwords from Playwright trace.
- scripts/smoke.sh + .env.example: local + CI identical invocation.
- .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly
  cron / push-to-main / workflow_dispatch); scope decides full vs
  scoped from changed paths; global serial concurrency (no
  cancel-in-progress — cancelled runs leave orphan AWS state);
  configure-aws-credentials with role-duration-seconds=21600 (6h)
  to match the role's max-session-duration; pre-deploy state check
  with auto-recovery for stranded stacks; SCP drift check
  (excluding FullAWSAccess, fail-soft for first 7 detections);
  quarantine-expiry check; CFN events captured BEFORE teardown;
  teardown with 3x60s retry, gated on aws-creds outcome so we don't
  burn 3min retrying without credentials.
- .github/workflows/quarterly-audit.yml: 3-monthly tracking issue
  (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate
  liveness, ProtectISB-fallback revisit).
- .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns
  review (until a maintainers team is provisioned).

Phase 4 — 17 per-scenario smoke specs
- One spec per scenario covering the auth-mode pattern (admin-login /
  public / sso-skip / umbrella). Bug-informed feature flows cite the
  historical regression that informed each test:
  - fixmystreet: /reports requires bin/update-all-reports; /admin
    must reach the dashboard without 2FA redirect
  - planx: SPA boots free of domain-allowlist / Airbrake errors;
    Hasura native /v1/version responds (Caddy elimination)
  - minute: magic-link sets cookie; same-origin fetch() works
    post-auth; /api/proxy/healthcheck reaches the backend (catches
    the basic-auth-breaks-fetch() regression and the ALB /api/*
    interception regression)
  - localgov-ims: Windows IIS multi-site routing; AdminPassword
    must not be the literal {{resolve:...}} token (catches the
    Lambda-custom-resource regression)
  - localgov-drupal: ndx_aws_ai module boots without Bedrock
    AccessDeniedException
  - simply-readable: SPA loads, credentials non-empty + non-token;
    reload produces no 5xx responses (catches BlueprintsBucketName
    mis-wire)
  - ai-contact-centre: PSTN claim matches UK toll-free / landline OR
    US toll-free (catches international fallback regression)
  - paperless-ngx: /documents view + /api/documents/ respond (S3
    Files mount integrity)
  - bops-planning: post-login URL is NOT on the Applicants port
    (catches the routing.rb single-tenant override regression)
  - digital-planning-register: register loads with planning markers
  - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park,
    text-to-speech, council-chatbot): FunctionURL not-5xx + not-403
    (catches the InvokeFunctionUrl + InvokeFunction dual-permission
    regression); council-chatbot uses POST not GET so the test isn't
    vacuous against a POST-only Lambda
  - quicksight-dashboard: landing + outputs only (sso-skip per auth-
    mode categorisation)
  - all-demo: discovers Output keys dynamically by parsing the
    committed template at test time; asserts every Output present,
    non-empty, and not the {{resolve:...}} literal; URL outputs match
    https?://

Phase 5 — pin every floating image tag
- 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*,
  dpr) pinned to sha-<7chars>@sha256:<digest>.
- 2 upstream images (docker.io/apache/tika 3.3.0.0-full,
  ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>.
- Removed legacy cloudformation/scenarios/minute/template.json (stale
  ECR references; nothing in the repo referenced it).

Phase 6 — Renovate adoption (replaces Dependabot)
- renovate.json: 6 group rules per the spec's pinning-strategy table;
  customManagers regex matching the new pin shape; osvVulnerabilityAlerts
  + security-priority group; pinDigests scoped to official actions/* +
  aws-actions/* only so the first run doesn't firehose; per-PR limits
  capped at 6.
- .github/workflows/renovate.yml: twice-daily + workflow_dispatch.
  Action pinned by digest to v46.1.14.
- .github/dependabot.yml deleted.

Operator follow-ups (not in this PR)
====================================
- NAP-548: migrate scenarios off legacy claude-3-haiku-20240307
- NAP-549: revisit ProtectISB fallback by 2026-11-12
- NAP-550: service-quota Console requests
- NAP-551: QuickSight subscription decision
- NAP-552: mint RENOVATE_TOKEN repo secret
- NAP-554: close in-flight Dependabot PRs
- NAP-555: T2b.5b + T3.8 end-to-end verifications

Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
@chrisns chrisns force-pushed the feat/smoke-pack branch from 7a85048 to 30e7e90 Compare May 13, 2026 08:28
@chrisns chrisns had a problem deploying to smoke-test-deploy May 13, 2026 08:29 — with GitHub Actions Failure
@chrisns
Copy link
Copy Markdown
Member Author

chrisns commented May 14, 2026

Superseded by #238 — all of this PR's commits are in #238's tree, plus the workflow extraction, smoke-pack DRY, fix-forward deploy semantics, PowerUserAccess broadening, ai-contact-centre PSTN holder support, and the planx arm64 fix.

@chrisns chrisns closed this May 14, 2026
@chrisns chrisns deleted the feat/smoke-pack branch May 14, 2026 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant