Skip to content

docs: populate smoke-test-account-config.yml (Phase 1b complete)#233

Closed
chrisns wants to merge 8 commits into
mainfrom
feat/smoke-test-account-config-1b
Closed

docs: populate smoke-test-account-config.yml (Phase 1b complete)#233
chrisns wants to merge 8 commits into
mainfrom
feat/smoke-test-account-config-1b

Conversation

@chrisns
Copy link
Copy Markdown
Member

@chrisns chrisns commented May 12, 2026

Summary

Phase 1b of the smoke-pack tech-spec: the runbook in PR #226 has been executed end-to-end against the NDX org-management account (`955063685555`). The smoke-test account exists with the OIDC deploy role configured.

Outcomes

Field Value
`smoke_test_account_id` `464453619983`
`smoke_test_deploy_role_arn` `arn:aws:iam::464453619983:role/InnovationSandbox-ndx-SmokeTestDeployRole`
`smoke_test_region` `us-east-1`
`smoke_test_ou_placement_branch` `child-of-root-with-selective-scps` (fallback taken)
`sandbox_ou_id` `ou-2laj-4dyae1oa` (`ndx_InnovationSandboxAccountPool`)
`expected_scps` `p-6tw8eixp` (Restrictions), `p-7pd0szg9` (AwsNukeSupportedServices), `p-FullAWSAccess`

Why the fallback

Step 7 canary failed: ProtectISB (`p-gn4fu3co`) explicitly denies `iam:CreateRole` on `arn:aws:iam:::role/InnovationSandbox-` from `OrganizationAccountAccessRole` (non-ISB-allow-listed principal). Verbatim error:

User: `...assumed-role/OrganizationAccountAccessRole/smoke-canary` is not authorized to perform: `iam:CreateRole` on resource: `arn:aws:iam::464453619983:role/InnovationSandbox-ndx-CanaryDeleteMe` with an explicit deny in a service control policy: `...p-gn4fu3co`

Per runbook fallback, the account was moved out from under `sandboxOu` to the org root, `p-6tw8eixp` + `p-7pd0szg9` were attached directly to the account, and the post-fallback canary passed.

Acknowledged faithfulness loss

This account does NOT reproduce ProtectISB-driven failure modes (role-name-prefix denials against `InnovationSandbox-*` resources, Secrets Manager protections on ISB-named resources). The smoke pack gets an approximation of an Active sandbox SCP profile, not a literal one. A `scp-fallback-revisit` follow-up issue is required (6-month cadence) to re-check whether ProtectISB has loosened.

Operator follow-ups still pending (not in this PR)

  • Bedrock model access: The legacy `anthropic.claude-3-haiku-20240307-v1:0` is no longer accessible to new accounts (`Access denied. This Model is marked by provider as Legacy and you have not been actively using the model in the last 30 days`). Affected scenarios (council-chatbot, foi-redaction, planning-ai, etc.) need either migration to a current model (`claude-3-5-haiku-20241022-v1:0` or newer) or operator console-grant. Nova + Titan-embed models work out-of-the-box (the in-account canary against `amazon.nova-lite-v1:0`, `amazon.nova-pro-v1:0`, `amazon.titan-embed-text-v2:0` reached the model API without an AccessDeniedException; the Nova ones returned `ValidationException: extraneous key [max_tokens] is not permitted` because the canary used Anthropic-shaped bodies, which proves the call reached the model, not a denial).
  • Service Quotas increases: blocked from CLI by `p-7pd0szg9` (AwsNukeSupportedServices doesn't allow `servicequotas:*`). Operator submits via the AWS Console.
  • QuickSight Enterprise subscription: deferred until `quicksight-dashboard` smoke spec is authored in Phase 4 follow-up.
  • scp-fallback-revisit tracking issue: open with a 6-month review cadence so we don't quietly accept the fallback forever.

Depends on

Test plan

chrisns added 3 commits May 12, 2026 09:55
Six-phase plan for an end-to-end scenario regression rig: long-lived
SCP-bound test account (manual runbook), all-demo expansion to 17
scenarios, GitHub OIDC + environment-gated deploy role, Playwright
smoke pack with per-scenario assertion bar, image-tag pinning, and
self-hosted Renovate. Includes 39+ tasks, 30+ ACs, four ADRs, and
fixes from two rounds of adversarial review.
Phase 1a of the scenario-regression smoke-pack tech-spec: a one-off manual
procedure for vending the long-lived AWS account that hosts the smoke pack,
plus the companion machine-readable config the smoke workflow consumes.

The runbook covers the four required sections (Prerequisites, Procedure,
Verification, Operational Notes) and bakes in the ProtectISB role-creation
deadlock canary + fallback branch from ADR-1, the Bedrock model-access
enablement procedure with TOS click-through gotcha capture, service-quota
target table, and the iterate-to-least-privilege protocol for the deploy-role
inline policy. Each Procedure step opens with an idempotency check and lists
its inverse.

The config file commits the post-runbook state schema (8 fields) in template
form. Phase 1b populates real values via PR once the runbook has been
executed by an operator with org-management SSO.
Executed the runbook end-to-end against the org-management account
(955063685555). Smoke-test account 464453619983 now exists with the
deploy role + GitHub OIDC trust configured.

Outcomes:
- smoke_test_account_id: 464453619983
- smoke_test_deploy_role_arn: arn:aws:iam::464453619983:role/InnovationSandbox-ndx-SmokeTestDeployRole
- smoke_test_region: us-east-1
- smoke_test_ou_placement_branch: child-of-root-with-selective-scps (fallback)
- sandbox_ou_id: ou-2laj-4dyae1oa (ndx_InnovationSandboxAccountPool)
- runbook_version: c8cead2 (commit that first introduced the runbook)

ProtectISB fallback was taken. The Step 7 canary failed: ProtectISB
(p-gn4fu3co) explicitly denies iam:CreateRole on any role matching
arn:aws:iam::*:role/InnovationSandbox-* from non-ISB-allow-listed
principals, including OrganizationAccountAccessRole. The account was
moved out of the ndx_InnovationSandboxAccountPool OU to the org root,
and InnovationSandboxRestrictions (p-6tw8eixp) +
InnovationSandboxAwsNukeSupportedServices (p-7pd0szg9) were attached
directly. The post-fallback canary passed.

Acknowledged loss of faithfulness: this account does NOT reproduce
ProtectISB-driven failure modes (role-name-prefix denials against
InnovationSandbox-* resources, Secrets Manager protections on
ISB-named resources). A scp-fallback-revisit follow-up issue should
open with a 6-month review cadence to re-check whether ProtectISB has
loosened. Without that, the smoke pack gets an Active-sandbox approximation,
not a literal one.

expected_scps reflects what's now directly attached to the account:
p-6tw8eixp + p-7pd0szg9 + p-FullAWSAccess. Three policies, not four —
the runbook's expected count of four assumed ProtectISB inheritance,
which is now bypassed. The drift check compares this set to live; if
ProtectISB ever gets attached upstream the drift issue fires.

Operator follow-ups still outstanding (NOT in this PR):
- Bedrock model access for anthropic.claude-3-5-haiku-20241022-v1:0 and
  similar (the legacy claude-3-haiku is no longer accessible to new
  accounts; affected scenarios may need to migrate to a current model).
- Service Quotas increases via the AWS Console (CLI blocked by the
  AwsNukeSupportedServices SCP, which doesn't allow servicequotas:*).
- QuickSight Enterprise subscription decision.
- scp-fallback-revisit tracking issue.
chrisns added 2 commits May 12, 2026 12:06
Runbook update from running Phase 1b. Records two things the Step 12 canary
turned up that future operators will hit:

1. anthropic.claude-3-haiku-20240307-v1:0 is now Legacy. New accounts get
   ResourceNotFoundException: "marked by provider as Legacy and you have
   not been actively using the model in the last 30 days." AWS retired the
   grandfathering window. Scenarios using this model (audit via grep)
   need to migrate to claude-3-5-haiku-20241022-v1:0 or newer (tracked
   separately in NAP-548).

2. Nova models use a different body shape from Anthropic models. The
   Step 12 canary as written produces a ValidationException for Nova
   (extraneous key max_tokens). The ValidationException is NOT an access
   denial; it proves the call reaches the model. Use the Converse API or
   a Nova-shaped body for genuine Nova canaries.

3. Titan embeddings have their own body shape (inputText). Step 12's
   canary block already handles this correctly.
Two findings during T3.8 end-to-end smoke verification against the new
smoke account 464453619983. Both were latent runbook bugs the spec hadn't
caught.

1. AwsNukeSupportedServicesScp blocks sts:AssumeRoleWithWebIdentity.
   The SCP is a Deny on NotAction-allow-list of 172 services that AWS Nuke
   supports. sts:* is NOT in the allow-list, so attaching it to the smoke
   account blocks GitHub Actions OIDC role assumption with "Not authorized
   to perform sts:AssumeRoleWithWebIdentity". Verified by detaching: with
   only Restrictions attached, OIDC assume succeeds.

   Step 7-fallback in the runbook now attaches Restrictions ONLY (not
   Restrictions + AwsNuke + ...). General guidance added: before attaching
   any SCP directly, confirm sts:* is in its allow-list if it's a NotAction
   Deny.

   Also detached AwsNuke from the live account. expected_scps now lists
   only p-6tw8eixp (Restrictions) + p-FullAWSAccess.

2. repository_owner claim condition fails OIDC assume.
   Trust policy with token.actions.githubusercontent.com:repository_owner
   StringEquals "co-cddo" produces "Not authorized" even when the request
   comes from a repo owned by co-cddo. Either GitHub's OIDC token doesn't
   surface this claim under the expected key, or there's another reason.
   Verified by removing the condition: OIDC assume succeeds; re-adding it
   reproduces the failure.

   Step 10 in the runbook now uses sub-pattern only
   (repo:co-cddo/ndx_try_aws_scenarios:*). The fork-defence rationale is
   updated to note that the sub-pattern is repo-locked and GitHub's
   pull_request fork behaviour already isolates secrets/OIDC; if GitHub
   later starts surfacing repository_owner reliably, the condition can
   be re-added.

   Also updated the live deploy role's trust policy to drop the
   repository_owner condition.

T3.8 outcome: rails verified end-to-end. The deploy-step failure observed
during verification was a separate product issue (localgov-drupal's S3
template still contains a CDK bootstrap reference and localgov-ims requires
non-empty GovUkPayApiKey), tracked separately under the Phase 2a/2b PRs.
@chrisns
Copy link
Copy Markdown
Member Author

chrisns commented May 12, 2026

T3.8 verification + runbook fixes (commit 4c0d2ef)

Triggered `workflow_dispatch` of `smoke.yml` against an integration branch (`smoke-verify`) that combines this PR's config with PR #229's smoke workflow. Two latent runbook bugs surfaced that the spec didn't catch:

1. `AwsNukeSupportedServicesScp` blocks `sts:AssumeRoleWithWebIdentity`

The SCP is a `Deny` on `NotAction` allowlist of 172 services that AWS Nuke supports. `sts:*` is NOT in the allowlist, so attaching it blocks GitHub Actions OIDC role assumption: `Could not assume role with OIDC: Not authorized to perform sts:AssumeRoleWithWebIdentity`.

Verified by detaching: with only Restrictions attached, OIDC assume succeeds. Step 7-fallback in the runbook now attaches Restrictions ONLY. General guidance added: before attaching ANY SCP directly to the smoke account, confirm `sts:*` is in its allowlist if it's a NotAction Deny.

Live state updated: `p-7pd0szg9` detached from account `464453619983`. `expected_scps` now lists only `p-6tw8eixp` (Restrictions) + `p-FullAWSAccess`.

2. `repository_owner` claim condition causes OIDC assume to fail

Trust policy with `token.actions.githubusercontent.com:repository_owner` StringEquals `co-cddo` produces `Not authorized` even from a co-cddo-owned repo. Verified by removing: assume succeeds. Re-adding: assume fails.

Either GitHub's OIDC token doesn't surface this claim under the expected key, or there's some other shape difference. Step 10 in the runbook now uses sub-pattern only (`repo:co-cddo/ndx_try_aws_scenarios:*`). Sub-pattern is repo-locked and GitHub's documented fork-PR behaviour already isolates secrets/OIDC, so the spec's defence chain still holds.

Live state updated: deploy role `InnovationSandbox-ndx-SmokeTestDeployRole` trust policy refreshed without the `repository_owner` condition.

Rails verification:

Step Outcome
`Read smoke-test-account-config.yml` (non-placeholder gate) success
`configure-aws-credentials` (OIDC role assume) success (after fixes)
`Configure GHCR pull credentials` success
`Pre-deploy state check` success
`Deploy all-demo` failure — separate product issues (see below)
`Teardown` success (rolled back to DELETE_COMPLETE)
Artefact upload success

The deploy-step failures were real product issues, not rails issues:

For investigating the 'repository_owner condition fails OIDC assume'
finding from the adversarial review. Delete after the investigation
closes.
@chrisns chrisns force-pushed the feat/smoke-test-account-config-1b branch from 0f0c83c to c94dab9 Compare May 12, 2026 12:13
Address review findings against PR #233:

1. Investigated the repository_owner claim properly by shipping a
   one-shot _oidc-debug.yml workflow (since deleted) that decoded the
   GitHub Actions OIDC JWT and dumped its claims. CONFIRMED: the token
   DOES contain `repository_owner: co-cddo`. Re-tested the trust policy
   with the condition restored using both StringEquals and StringLike.
   Both fail reproducibly with the same "Not authorized" error. Whatever
   AWS-side mechanism evaluates the claim against the condition key
   `token.actions.githubusercontent.com:repository_owner` doesn't match,
   despite the claim being present in the token.

   Document in Step 10: the original spec's belt-and-braces is omitted
   here, the remaining defence chain (sub-pattern lock + GitHub fork
   isolation + smoke-test-deploy environment branch policy + CODEOWNERS)
   is documented as net-equivalent, and the door is left open to re-add
   the condition if a fix is identified later (different key spelling,
   provider-config tweak, AWS docs update).

2. Config file `expected_scps` comment said "Expected count: 4 (AwsNuke
   + Restrictions + ProtectISB + LimitRegions)" but the live state has
   2 entries (Restrictions + FullAWSAccess) because the fallback branch
   was taken AND AwsNuke was subsequently detached for OIDC functionality.
   Rewrite the comment to document the current state, what's NOT attached
   and why (ProtectISB blocks role creation; AwsNuke blocks sts:*), and
   guidance for future SCP additions.

3. Config file's "Public-repo disclosure note" listed
   `trust-policy repository_owner claim` as defence (b). That condition
   has been removed per finding #1; rewrite the defence chain to match
   what's actually in place and cross-reference the runbook investigation.

4. Refresh runbook_version SHA to point at the runbook-fix commit
   (4c0d2ef instead of 11d9676, which was the pre-fix sha) and
   setup_date to today. The config-vs-runbook drift was the literal
   bug called out in the adversarial review.
@chrisns
Copy link
Copy Markdown
Member Author

chrisns commented May 12, 2026

Adversarial-review followups landed in aa93a06: repository_owner properly investigated (one-shot _oidc-debug.yml workflow confirmed the claim IS in the JWT but the trust condition fails regardless — both StringEquals and StringLike — so it stays omitted; runbook Step 10 documents this), expected_scps comment rewritten to match live state (2 entries: Restrictions + FullAWSAccess) and explain what's NOT attached and why, public-repo disclosure note's defence chain updated to drop the removed repository_owner condition, runbook_version SHA refreshed to 4c0d2ef (was 11d9676, pre-fix).

Second-pass review finding: in my previous fix (aa93a06) I refreshed
the SHA to 4c0d2ef, but that commit also modified setup.md, making
the SHA-in-file one commit behind the file it references. Same bug
as the original drift, just one commit later.

Manually advance the SHA to the previous commit's hash. A lint to
prevent recurrence lands in the smoke.yml workflow on the Phase 3
branch.
chrisns added a commit that referenced this pull request May 12, 2026
…eanup)

Two findings from the second adversarial pass:

1. The previous fix for runbook_version drift (PR #233 commit aa93a06)
   re-introduced the same drift one commit later — the SHA captured
   referred to the PREVIOUS commit because the same commit was also
   modifying setup.md. Add a CI check in the smoke workflow's config-
   reading step that asserts runbook_version equals `git log -1 -- setup.md`.
   Adding fetch-depth: 0 to checkout so the log resolves. Adding
   setup.md to the trigger paths so the check fires on runbook-only
   changes.

2. The yq SHA256 I'd added (ce4b67c3...) was fabricated and did NOT
   match the actual v4.45.4 release. Discovered when I sha256sum'd
   the actual binary mid-review. Since the check was set non-fatal,
   it produced a misleading "WARN" line on every run. Drop the SHA
   verification entirely; the version pin is sufficient (same trust
   model as the rest of CI's third-party-action references). Fixing
   the fake SHA without dropping the check would just shift the
   maintenance burden onto every Renovate bump.
@chrisns
Copy link
Copy Markdown
Member Author

chrisns commented May 12, 2026

Superseded by #236 — single squashed PR for all 6 phases. Dependency chain across 9 PRs made parallel review worse than serial; one PR with one review and one CI run is simpler.

@chrisns chrisns closed this May 12, 2026
chrisns added a commit that referenced this pull request May 12, 2026
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md
in a single deliverable. The original spec called for one PR per phase
(8+ PRs); experience showed the dependency overlap made that worse for
review, not better, so this squashes #226 / #227 / #228 / #229 / #230
/ #231 / #232 / #233 / #235 into a single change.

What ships
==========

Phase 1a — runbook + config schema
- docs/smoke-test-account-setup.md: one-off manual procedure for vending
  the long-lived smoke-test AWS account, with the four required sections
  (Prerequisites / Procedure / Verification / Operational Notes). Per-step
  idempotency checks + inverses; ProtectISB role-creation canary +
  fallback branch (ADR-1); Bedrock model-access enablement + gotchas
  (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota
  targets; QuickSight decision; iterate-to-least-privilege protocol for
  the inline IAM policy.
- docs/smoke-test-account-config.yml: post-runbook state record schema.

Phase 1b — operator-executed account state
- Smoke account 464453619983 provisioned in NDX org under the fallback
  branch (ProtectISB canary failed; account moved to root with
  Restrictions SCP attached directly). AwsNuke SCP intentionally NOT
  attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN
  delete + retention lint, not aws-nuke).
- OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created
  with 6h max-session-duration. Trust policy uses sub-pattern lock
  (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the
  repository_owner claim condition is omitted because it reproducibly
  breaks the assume even though the OIDC token contains the claim
  (verified via JWT decode in an investigation workflow that has since
  been deleted; see runbook Step 10).
- expected_scps reflects live state: Restrictions + FullAWSAccess.

Phase 2a — synth pipelines for missing scenarios
- New synth jobs in .github/workflows/deploy-blueprints.yml for planx
  and digital-planning-register (CDK -> template.yaml -> S3 via the
  existing isb-hub upload chain).
- bops-planning synth job lands in Phase 2b after the retention lint
  is justification-aware.
- ai-contact-centre: new "verify packaged CodeUri targets blueprints
  bucket" step catches a sam-package regression where --s3-bucket would
  silently land in the SAM default bucket.

Phase 2b — all-demo expansion + retention lint
- cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16
  nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS,
  Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning
  Register). Umbrella parameters for credentials (GovUkPayApiKey,
  OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable
  empty / sensible defaults; per-scenario URL + admin-credential Outputs
  surfaced.
- scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain /
  UpdateReplacePolicy=Retain / Properties.DeletionProtection=true /
  Properties.EnableDeletionProtection=true /
  Properties.FinalSnapshotIdentifier unless the resource carries a
  non-empty Metadata.Justification. Per-template cap (default 3) +
  global cap (default 10) so any one scenario can't pencil-whip
  retentions repo-wide.
- lint-committed-templates job in deploy-blueprints.yml runs the lint
  over hand-authored CFN templates.
- bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate
  debug-after-rollback) with a Metadata.Justification attached via
  cfnOptions; bops synth job re-enabled.

Phase 3 — smoke rails
- playwright.config.ts: new 'smoke' project gated on
  PLAYWRIGHT_SUITE=smoke.
- tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper.
  Sensitive output values flow only via explicit sensitiveValue()
  accessor; toString / inspect / Symbol.toPrimitive emit REDACTED
  placeholder. Documents the CloudFormation-API limitation that Output
  Metadata.Sensitive opt-in isn't readable (regex is the sole signal).
- tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries
  populated.
- tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts
  form-encoded passwords from Playwright trace.
- scripts/smoke.sh + .env.example: local + CI identical invocation.
- .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly
  cron / push-to-main / workflow_dispatch); scope decides full vs
  scoped from changed paths; global serial concurrency (no
  cancel-in-progress — cancelled runs leave orphan AWS state);
  configure-aws-credentials with role-duration-seconds=21600 (6h)
  to match the role's max-session-duration; pre-deploy state check
  with auto-recovery for stranded stacks; SCP drift check
  (excluding FullAWSAccess, fail-soft for first 7 detections);
  quarantine-expiry check; CFN events captured BEFORE teardown;
  teardown with 3x60s retry, gated on aws-creds outcome so we don't
  burn 3min retrying without credentials.
- .github/workflows/quarterly-audit.yml: 3-monthly tracking issue
  (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate
  liveness, ProtectISB-fallback revisit).
- .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns
  review (until a maintainers team is provisioned).

Phase 4 — 17 per-scenario smoke specs
- One spec per scenario covering the auth-mode pattern (admin-login /
  public / sso-skip / umbrella). Bug-informed feature flows cite the
  historical regression that informed each test:
  - fixmystreet: /reports requires bin/update-all-reports; /admin
    must reach the dashboard without 2FA redirect
  - planx: SPA boots free of domain-allowlist / Airbrake errors;
    Hasura native /v1/version responds (Caddy elimination)
  - minute: magic-link sets cookie; same-origin fetch() works
    post-auth; /api/proxy/healthcheck reaches the backend (catches
    the basic-auth-breaks-fetch() regression and the ALB /api/*
    interception regression)
  - localgov-ims: Windows IIS multi-site routing; AdminPassword
    must not be the literal {{resolve:...}} token (catches the
    Lambda-custom-resource regression)
  - localgov-drupal: ndx_aws_ai module boots without Bedrock
    AccessDeniedException
  - simply-readable: SPA loads, credentials non-empty + non-token;
    reload produces no 5xx responses (catches BlueprintsBucketName
    mis-wire)
  - ai-contact-centre: PSTN claim matches UK toll-free / landline OR
    US toll-free (catches international fallback regression)
  - paperless-ngx: /documents view + /api/documents/ respond (S3
    Files mount integrity)
  - bops-planning: post-login URL is NOT on the Applicants port
    (catches the routing.rb single-tenant override regression)
  - digital-planning-register: register loads with planning markers
  - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park,
    text-to-speech, council-chatbot): FunctionURL not-5xx + not-403
    (catches the InvokeFunctionUrl + InvokeFunction dual-permission
    regression); council-chatbot uses POST not GET so the test isn't
    vacuous against a POST-only Lambda
  - quicksight-dashboard: landing + outputs only (sso-skip per auth-
    mode categorisation)
  - all-demo: discovers Output keys dynamically by parsing the
    committed template at test time; asserts every Output present,
    non-empty, and not the {{resolve:...}} literal; URL outputs match
    https?://

Phase 5 — pin every floating image tag
- 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*,
  dpr) pinned to sha-<7chars>@sha256:<digest>.
- 2 upstream images (docker.io/apache/tika 3.3.0.0-full,
  ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>.
- Removed legacy cloudformation/scenarios/minute/template.json (stale
  ECR references; nothing in the repo referenced it).

Phase 6 — Renovate adoption (replaces Dependabot)
- renovate.json: 6 group rules per the spec's pinning-strategy table;
  customManagers regex matching the new pin shape; osvVulnerabilityAlerts
  + security-priority group; pinDigests scoped to official actions/* +
  aws-actions/* only so the first run doesn't firehose; per-PR limits
  capped at 6.
- .github/workflows/renovate.yml: twice-daily + workflow_dispatch.
  Action pinned by digest to v46.1.14.
- .github/dependabot.yml deleted.

Operator follow-ups (not in this PR)
====================================
- NAP-548: migrate scenarios off legacy claude-3-haiku-20240307
- NAP-549: revisit ProtectISB fallback by 2026-11-12
- NAP-550: service-quota Console requests
- NAP-551: QuickSight subscription decision
- NAP-552: mint RENOVATE_TOKEN repo secret
- NAP-554: close in-flight Dependabot PRs
- NAP-555: T2b.5b + T3.8 end-to-end verifications

Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
chrisns added a commit that referenced this pull request May 13, 2026
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md
in a single deliverable. The original spec called for one PR per phase
(8+ PRs); experience showed the dependency overlap made that worse for
review, not better, so this squashes #226 / #227 / #228 / #229 / #230
/ #231 / #232 / #233 / #235 into a single change.

What ships
==========

Phase 1a — runbook + config schema
- docs/smoke-test-account-setup.md: one-off manual procedure for vending
  the long-lived smoke-test AWS account, with the four required sections
  (Prerequisites / Procedure / Verification / Operational Notes). Per-step
  idempotency checks + inverses; ProtectISB role-creation canary +
  fallback branch (ADR-1); Bedrock model-access enablement + gotchas
  (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota
  targets; QuickSight decision; iterate-to-least-privilege protocol for
  the inline IAM policy.
- docs/smoke-test-account-config.yml: post-runbook state record schema.

Phase 1b — operator-executed account state
- Smoke account 464453619983 provisioned in NDX org under the fallback
  branch (ProtectISB canary failed; account moved to root with
  Restrictions SCP attached directly). AwsNuke SCP intentionally NOT
  attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN
  delete + retention lint, not aws-nuke).
- OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created
  with 6h max-session-duration. Trust policy uses sub-pattern lock
  (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the
  repository_owner claim condition is omitted because it reproducibly
  breaks the assume even though the OIDC token contains the claim
  (verified via JWT decode in an investigation workflow that has since
  been deleted; see runbook Step 10).
- expected_scps reflects live state: Restrictions + FullAWSAccess.

Phase 2a — synth pipelines for missing scenarios
- New synth jobs in .github/workflows/deploy-blueprints.yml for planx
  and digital-planning-register (CDK -> template.yaml -> S3 via the
  existing isb-hub upload chain).
- bops-planning synth job lands in Phase 2b after the retention lint
  is justification-aware.
- ai-contact-centre: new "verify packaged CodeUri targets blueprints
  bucket" step catches a sam-package regression where --s3-bucket would
  silently land in the SAM default bucket.

Phase 2b — all-demo expansion + retention lint
- cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16
  nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS,
  Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning
  Register). Umbrella parameters for credentials (GovUkPayApiKey,
  OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable
  empty / sensible defaults; per-scenario URL + admin-credential Outputs
  surfaced.
- scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain /
  UpdateReplacePolicy=Retain / Properties.DeletionProtection=true /
  Properties.EnableDeletionProtection=true /
  Properties.FinalSnapshotIdentifier unless the resource carries a
  non-empty Metadata.Justification. Per-template cap (default 3) +
  global cap (default 10) so any one scenario can't pencil-whip
  retentions repo-wide.
- lint-committed-templates job in deploy-blueprints.yml runs the lint
  over hand-authored CFN templates.
- bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate
  debug-after-rollback) with a Metadata.Justification attached via
  cfnOptions; bops synth job re-enabled.

Phase 3 — smoke rails
- playwright.config.ts: new 'smoke' project gated on
  PLAYWRIGHT_SUITE=smoke.
- tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper.
  Sensitive output values flow only via explicit sensitiveValue()
  accessor; toString / inspect / Symbol.toPrimitive emit REDACTED
  placeholder. Documents the CloudFormation-API limitation that Output
  Metadata.Sensitive opt-in isn't readable (regex is the sole signal).
- tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries
  populated.
- tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts
  form-encoded passwords from Playwright trace.
- scripts/smoke.sh + .env.example: local + CI identical invocation.
- .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly
  cron / push-to-main / workflow_dispatch); scope decides full vs
  scoped from changed paths; global serial concurrency (no
  cancel-in-progress — cancelled runs leave orphan AWS state);
  configure-aws-credentials with role-duration-seconds=21600 (6h)
  to match the role's max-session-duration; pre-deploy state check
  with auto-recovery for stranded stacks; SCP drift check
  (excluding FullAWSAccess, fail-soft for first 7 detections);
  quarantine-expiry check; CFN events captured BEFORE teardown;
  teardown with 3x60s retry, gated on aws-creds outcome so we don't
  burn 3min retrying without credentials.
- .github/workflows/quarterly-audit.yml: 3-monthly tracking issue
  (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate
  liveness, ProtectISB-fallback revisit).
- .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns
  review (until a maintainers team is provisioned).

Phase 4 — 17 per-scenario smoke specs
- One spec per scenario covering the auth-mode pattern (admin-login /
  public / sso-skip / umbrella). Bug-informed feature flows cite the
  historical regression that informed each test:
  - fixmystreet: /reports requires bin/update-all-reports; /admin
    must reach the dashboard without 2FA redirect
  - planx: SPA boots free of domain-allowlist / Airbrake errors;
    Hasura native /v1/version responds (Caddy elimination)
  - minute: magic-link sets cookie; same-origin fetch() works
    post-auth; /api/proxy/healthcheck reaches the backend (catches
    the basic-auth-breaks-fetch() regression and the ALB /api/*
    interception regression)
  - localgov-ims: Windows IIS multi-site routing; AdminPassword
    must not be the literal {{resolve:...}} token (catches the
    Lambda-custom-resource regression)
  - localgov-drupal: ndx_aws_ai module boots without Bedrock
    AccessDeniedException
  - simply-readable: SPA loads, credentials non-empty + non-token;
    reload produces no 5xx responses (catches BlueprintsBucketName
    mis-wire)
  - ai-contact-centre: PSTN claim matches UK toll-free / landline OR
    US toll-free (catches international fallback regression)
  - paperless-ngx: /documents view + /api/documents/ respond (S3
    Files mount integrity)
  - bops-planning: post-login URL is NOT on the Applicants port
    (catches the routing.rb single-tenant override regression)
  - digital-planning-register: register loads with planning markers
  - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park,
    text-to-speech, council-chatbot): FunctionURL not-5xx + not-403
    (catches the InvokeFunctionUrl + InvokeFunction dual-permission
    regression); council-chatbot uses POST not GET so the test isn't
    vacuous against a POST-only Lambda
  - quicksight-dashboard: landing + outputs only (sso-skip per auth-
    mode categorisation)
  - all-demo: discovers Output keys dynamically by parsing the
    committed template at test time; asserts every Output present,
    non-empty, and not the {{resolve:...}} literal; URL outputs match
    https?://

Phase 5 — pin every floating image tag
- 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*,
  dpr) pinned to sha-<7chars>@sha256:<digest>.
- 2 upstream images (docker.io/apache/tika 3.3.0.0-full,
  ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>.
- Removed legacy cloudformation/scenarios/minute/template.json (stale
  ECR references; nothing in the repo referenced it).

Phase 6 — Renovate adoption (replaces Dependabot)
- renovate.json: 6 group rules per the spec's pinning-strategy table;
  customManagers regex matching the new pin shape; osvVulnerabilityAlerts
  + security-priority group; pinDigests scoped to official actions/* +
  aws-actions/* only so the first run doesn't firehose; per-PR limits
  capped at 6.
- .github/workflows/renovate.yml: twice-daily + workflow_dispatch.
  Action pinned by digest to v46.1.14.
- .github/dependabot.yml deleted.

Operator follow-ups (not in this PR)
====================================
- NAP-548: migrate scenarios off legacy claude-3-haiku-20240307
- NAP-549: revisit ProtectISB fallback by 2026-11-12
- NAP-550: service-quota Console requests
- NAP-551: QuickSight subscription decision
- NAP-552: mint RENOVATE_TOKEN repo secret
- NAP-554: close in-flight Dependabot PRs
- NAP-555: T2b.5b + T3.8 end-to-end verifications

Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
chrisns added a commit that referenced this pull request May 13, 2026
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md
in a single deliverable. The original spec called for one PR per phase
(8+ PRs); experience showed the dependency overlap made that worse for
review, not better, so this squashes #226 / #227 / #228 / #229 / #230
/ #231 / #232 / #233 / #235 into a single change.

What ships
==========

Phase 1a — runbook + config schema
- docs/smoke-test-account-setup.md: one-off manual procedure for vending
  the long-lived smoke-test AWS account, with the four required sections
  (Prerequisites / Procedure / Verification / Operational Notes). Per-step
  idempotency checks + inverses; ProtectISB role-creation canary +
  fallback branch (ADR-1); Bedrock model-access enablement + gotchas
  (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota
  targets; QuickSight decision; iterate-to-least-privilege protocol for
  the inline IAM policy.
- docs/smoke-test-account-config.yml: post-runbook state record schema.

Phase 1b — operator-executed account state
- Smoke account 464453619983 provisioned in NDX org under the fallback
  branch (ProtectISB canary failed; account moved to root with
  Restrictions SCP attached directly). AwsNuke SCP intentionally NOT
  attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN
  delete + retention lint, not aws-nuke).
- OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created
  with 6h max-session-duration. Trust policy uses sub-pattern lock
  (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the
  repository_owner claim condition is omitted because it reproducibly
  breaks the assume even though the OIDC token contains the claim
  (verified via JWT decode in an investigation workflow that has since
  been deleted; see runbook Step 10).
- expected_scps reflects live state: Restrictions + FullAWSAccess.

Phase 2a — synth pipelines for missing scenarios
- New synth jobs in .github/workflows/deploy-blueprints.yml for planx
  and digital-planning-register (CDK -> template.yaml -> S3 via the
  existing isb-hub upload chain).
- bops-planning synth job lands in Phase 2b after the retention lint
  is justification-aware.
- ai-contact-centre: new "verify packaged CodeUri targets blueprints
  bucket" step catches a sam-package regression where --s3-bucket would
  silently land in the SAM default bucket.

Phase 2b — all-demo expansion + retention lint
- cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16
  nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS,
  Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning
  Register). Umbrella parameters for credentials (GovUkPayApiKey,
  OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable
  empty / sensible defaults; per-scenario URL + admin-credential Outputs
  surfaced.
- scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain /
  UpdateReplacePolicy=Retain / Properties.DeletionProtection=true /
  Properties.EnableDeletionProtection=true /
  Properties.FinalSnapshotIdentifier unless the resource carries a
  non-empty Metadata.Justification. Per-template cap (default 3) +
  global cap (default 10) so any one scenario can't pencil-whip
  retentions repo-wide.
- lint-committed-templates job in deploy-blueprints.yml runs the lint
  over hand-authored CFN templates.
- bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate
  debug-after-rollback) with a Metadata.Justification attached via
  cfnOptions; bops synth job re-enabled.

Phase 3 — smoke rails
- playwright.config.ts: new 'smoke' project gated on
  PLAYWRIGHT_SUITE=smoke.
- tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper.
  Sensitive output values flow only via explicit sensitiveValue()
  accessor; toString / inspect / Symbol.toPrimitive emit REDACTED
  placeholder. Documents the CloudFormation-API limitation that Output
  Metadata.Sensitive opt-in isn't readable (regex is the sole signal).
- tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries
  populated.
- tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts
  form-encoded passwords from Playwright trace.
- scripts/smoke.sh + .env.example: local + CI identical invocation.
- .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly
  cron / push-to-main / workflow_dispatch); scope decides full vs
  scoped from changed paths; global serial concurrency (no
  cancel-in-progress — cancelled runs leave orphan AWS state);
  configure-aws-credentials with role-duration-seconds=21600 (6h)
  to match the role's max-session-duration; pre-deploy state check
  with auto-recovery for stranded stacks; SCP drift check
  (excluding FullAWSAccess, fail-soft for first 7 detections);
  quarantine-expiry check; CFN events captured BEFORE teardown;
  teardown with 3x60s retry, gated on aws-creds outcome so we don't
  burn 3min retrying without credentials.
- .github/workflows/quarterly-audit.yml: 3-monthly tracking issue
  (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate
  liveness, ProtectISB-fallback revisit).
- .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns
  review (until a maintainers team is provisioned).

Phase 4 — 17 per-scenario smoke specs
- One spec per scenario covering the auth-mode pattern (admin-login /
  public / sso-skip / umbrella). Bug-informed feature flows cite the
  historical regression that informed each test:
  - fixmystreet: /reports requires bin/update-all-reports; /admin
    must reach the dashboard without 2FA redirect
  - planx: SPA boots free of domain-allowlist / Airbrake errors;
    Hasura native /v1/version responds (Caddy elimination)
  - minute: magic-link sets cookie; same-origin fetch() works
    post-auth; /api/proxy/healthcheck reaches the backend (catches
    the basic-auth-breaks-fetch() regression and the ALB /api/*
    interception regression)
  - localgov-ims: Windows IIS multi-site routing; AdminPassword
    must not be the literal {{resolve:...}} token (catches the
    Lambda-custom-resource regression)
  - localgov-drupal: ndx_aws_ai module boots without Bedrock
    AccessDeniedException
  - simply-readable: SPA loads, credentials non-empty + non-token;
    reload produces no 5xx responses (catches BlueprintsBucketName
    mis-wire)
  - ai-contact-centre: PSTN claim matches UK toll-free / landline OR
    US toll-free (catches international fallback regression)
  - paperless-ngx: /documents view + /api/documents/ respond (S3
    Files mount integrity)
  - bops-planning: post-login URL is NOT on the Applicants port
    (catches the routing.rb single-tenant override regression)
  - digital-planning-register: register loads with planning markers
  - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park,
    text-to-speech, council-chatbot): FunctionURL not-5xx + not-403
    (catches the InvokeFunctionUrl + InvokeFunction dual-permission
    regression); council-chatbot uses POST not GET so the test isn't
    vacuous against a POST-only Lambda
  - quicksight-dashboard: landing + outputs only (sso-skip per auth-
    mode categorisation)
  - all-demo: discovers Output keys dynamically by parsing the
    committed template at test time; asserts every Output present,
    non-empty, and not the {{resolve:...}} literal; URL outputs match
    https?://

Phase 5 — pin every floating image tag
- 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*,
  dpr) pinned to sha-<7chars>@sha256:<digest>.
- 2 upstream images (docker.io/apache/tika 3.3.0.0-full,
  ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>.
- Removed legacy cloudformation/scenarios/minute/template.json (stale
  ECR references; nothing in the repo referenced it).

Phase 6 — Renovate adoption (replaces Dependabot)
- renovate.json: 6 group rules per the spec's pinning-strategy table;
  customManagers regex matching the new pin shape; osvVulnerabilityAlerts
  + security-priority group; pinDigests scoped to official actions/* +
  aws-actions/* only so the first run doesn't firehose; per-PR limits
  capped at 6.
- .github/workflows/renovate.yml: twice-daily + workflow_dispatch.
  Action pinned by digest to v46.1.14.
- .github/dependabot.yml deleted.

Operator follow-ups (not in this PR)
====================================
- NAP-548: migrate scenarios off legacy claude-3-haiku-20240307
- NAP-549: revisit ProtectISB fallback by 2026-11-12
- NAP-550: service-quota Console requests
- NAP-551: QuickSight subscription decision
- NAP-552: mint RENOVATE_TOKEN repo secret
- NAP-554: close in-flight Dependabot PRs
- NAP-555: T2b.5b + T3.8 end-to-end verifications

Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant