docs: populate smoke-test-account-config.yml (Phase 1b complete)#233
docs: populate smoke-test-account-config.yml (Phase 1b complete)#233chrisns wants to merge 8 commits into
Conversation
Six-phase plan for an end-to-end scenario regression rig: long-lived SCP-bound test account (manual runbook), all-demo expansion to 17 scenarios, GitHub OIDC + environment-gated deploy role, Playwright smoke pack with per-scenario assertion bar, image-tag pinning, and self-hosted Renovate. Includes 39+ tasks, 30+ ACs, four ADRs, and fixes from two rounds of adversarial review.
Phase 1a of the scenario-regression smoke-pack tech-spec: a one-off manual procedure for vending the long-lived AWS account that hosts the smoke pack, plus the companion machine-readable config the smoke workflow consumes. The runbook covers the four required sections (Prerequisites, Procedure, Verification, Operational Notes) and bakes in the ProtectISB role-creation deadlock canary + fallback branch from ADR-1, the Bedrock model-access enablement procedure with TOS click-through gotcha capture, service-quota target table, and the iterate-to-least-privilege protocol for the deploy-role inline policy. Each Procedure step opens with an idempotency check and lists its inverse. The config file commits the post-runbook state schema (8 fields) in template form. Phase 1b populates real values via PR once the runbook has been executed by an operator with org-management SSO.
Executed the runbook end-to-end against the org-management account (955063685555). Smoke-test account 464453619983 now exists with the deploy role + GitHub OIDC trust configured. Outcomes: - smoke_test_account_id: 464453619983 - smoke_test_deploy_role_arn: arn:aws:iam::464453619983:role/InnovationSandbox-ndx-SmokeTestDeployRole - smoke_test_region: us-east-1 - smoke_test_ou_placement_branch: child-of-root-with-selective-scps (fallback) - sandbox_ou_id: ou-2laj-4dyae1oa (ndx_InnovationSandboxAccountPool) - runbook_version: c8cead2 (commit that first introduced the runbook) ProtectISB fallback was taken. The Step 7 canary failed: ProtectISB (p-gn4fu3co) explicitly denies iam:CreateRole on any role matching arn:aws:iam::*:role/InnovationSandbox-* from non-ISB-allow-listed principals, including OrganizationAccountAccessRole. The account was moved out of the ndx_InnovationSandboxAccountPool OU to the org root, and InnovationSandboxRestrictions (p-6tw8eixp) + InnovationSandboxAwsNukeSupportedServices (p-7pd0szg9) were attached directly. The post-fallback canary passed. Acknowledged loss of faithfulness: this account does NOT reproduce ProtectISB-driven failure modes (role-name-prefix denials against InnovationSandbox-* resources, Secrets Manager protections on ISB-named resources). A scp-fallback-revisit follow-up issue should open with a 6-month review cadence to re-check whether ProtectISB has loosened. Without that, the smoke pack gets an Active-sandbox approximation, not a literal one. expected_scps reflects what's now directly attached to the account: p-6tw8eixp + p-7pd0szg9 + p-FullAWSAccess. Three policies, not four — the runbook's expected count of four assumed ProtectISB inheritance, which is now bypassed. The drift check compares this set to live; if ProtectISB ever gets attached upstream the drift issue fires. Operator follow-ups still outstanding (NOT in this PR): - Bedrock model access for anthropic.claude-3-5-haiku-20241022-v1:0 and similar (the legacy claude-3-haiku is no longer accessible to new accounts; affected scenarios may need to migrate to a current model). - Service Quotas increases via the AWS Console (CLI blocked by the AwsNukeSupportedServices SCP, which doesn't allow servicequotas:*). - QuickSight Enterprise subscription decision. - scp-fallback-revisit tracking issue.
Runbook update from running Phase 1b. Records two things the Step 12 canary turned up that future operators will hit: 1. anthropic.claude-3-haiku-20240307-v1:0 is now Legacy. New accounts get ResourceNotFoundException: "marked by provider as Legacy and you have not been actively using the model in the last 30 days." AWS retired the grandfathering window. Scenarios using this model (audit via grep) need to migrate to claude-3-5-haiku-20241022-v1:0 or newer (tracked separately in NAP-548). 2. Nova models use a different body shape from Anthropic models. The Step 12 canary as written produces a ValidationException for Nova (extraneous key max_tokens). The ValidationException is NOT an access denial; it proves the call reaches the model. Use the Converse API or a Nova-shaped body for genuine Nova canaries. 3. Titan embeddings have their own body shape (inputText). Step 12's canary block already handles this correctly.
Two findings during T3.8 end-to-end smoke verification against the new smoke account 464453619983. Both were latent runbook bugs the spec hadn't caught. 1. AwsNukeSupportedServicesScp blocks sts:AssumeRoleWithWebIdentity. The SCP is a Deny on NotAction-allow-list of 172 services that AWS Nuke supports. sts:* is NOT in the allow-list, so attaching it to the smoke account blocks GitHub Actions OIDC role assumption with "Not authorized to perform sts:AssumeRoleWithWebIdentity". Verified by detaching: with only Restrictions attached, OIDC assume succeeds. Step 7-fallback in the runbook now attaches Restrictions ONLY (not Restrictions + AwsNuke + ...). General guidance added: before attaching any SCP directly, confirm sts:* is in its allow-list if it's a NotAction Deny. Also detached AwsNuke from the live account. expected_scps now lists only p-6tw8eixp (Restrictions) + p-FullAWSAccess. 2. repository_owner claim condition fails OIDC assume. Trust policy with token.actions.githubusercontent.com:repository_owner StringEquals "co-cddo" produces "Not authorized" even when the request comes from a repo owned by co-cddo. Either GitHub's OIDC token doesn't surface this claim under the expected key, or there's another reason. Verified by removing the condition: OIDC assume succeeds; re-adding it reproduces the failure. Step 10 in the runbook now uses sub-pattern only (repo:co-cddo/ndx_try_aws_scenarios:*). The fork-defence rationale is updated to note that the sub-pattern is repo-locked and GitHub's pull_request fork behaviour already isolates secrets/OIDC; if GitHub later starts surfacing repository_owner reliably, the condition can be re-added. Also updated the live deploy role's trust policy to drop the repository_owner condition. T3.8 outcome: rails verified end-to-end. The deploy-step failure observed during verification was a separate product issue (localgov-drupal's S3 template still contains a CDK bootstrap reference and localgov-ims requires non-empty GovUkPayApiKey), tracked separately under the Phase 2a/2b PRs.
T3.8 verification + runbook fixes (commit 4c0d2ef)Triggered `workflow_dispatch` of `smoke.yml` against an integration branch (`smoke-verify`) that combines this PR's config with PR #229's smoke workflow. Two latent runbook bugs surfaced that the spec didn't catch: 1. `AwsNukeSupportedServicesScp` blocks `sts:AssumeRoleWithWebIdentity`The SCP is a `Deny` on `NotAction` allowlist of 172 services that AWS Nuke supports. `sts:*` is NOT in the allowlist, so attaching it blocks GitHub Actions OIDC role assumption: `Could not assume role with OIDC: Not authorized to perform sts:AssumeRoleWithWebIdentity`. Verified by detaching: with only Restrictions attached, OIDC assume succeeds. Step 7-fallback in the runbook now attaches Restrictions ONLY. General guidance added: before attaching ANY SCP directly to the smoke account, confirm `sts:*` is in its allowlist if it's a NotAction Deny. Live state updated: `p-7pd0szg9` detached from account `464453619983`. `expected_scps` now lists only `p-6tw8eixp` (Restrictions) + `p-FullAWSAccess`. 2. `repository_owner` claim condition causes OIDC assume to failTrust policy with `token.actions.githubusercontent.com:repository_owner` StringEquals `co-cddo` produces `Not authorized` even from a co-cddo-owned repo. Verified by removing: assume succeeds. Re-adding: assume fails. Either GitHub's OIDC token doesn't surface this claim under the expected key, or there's some other shape difference. Step 10 in the runbook now uses sub-pattern only (`repo:co-cddo/ndx_try_aws_scenarios:*`). Sub-pattern is repo-locked and GitHub's documented fork-PR behaviour already isolates secrets/OIDC, so the spec's defence chain still holds. Live state updated: deploy role `InnovationSandbox-ndx-SmokeTestDeployRole` trust policy refreshed without the `repository_owner` condition. Rails verification:
The deploy-step failures were real product issues, not rails issues:
|
For investigating the 'repository_owner condition fails OIDC assume' finding from the adversarial review. Delete after the investigation closes.
0f0c83c to
c94dab9
Compare
Address review findings against PR #233: 1. Investigated the repository_owner claim properly by shipping a one-shot _oidc-debug.yml workflow (since deleted) that decoded the GitHub Actions OIDC JWT and dumped its claims. CONFIRMED: the token DOES contain `repository_owner: co-cddo`. Re-tested the trust policy with the condition restored using both StringEquals and StringLike. Both fail reproducibly with the same "Not authorized" error. Whatever AWS-side mechanism evaluates the claim against the condition key `token.actions.githubusercontent.com:repository_owner` doesn't match, despite the claim being present in the token. Document in Step 10: the original spec's belt-and-braces is omitted here, the remaining defence chain (sub-pattern lock + GitHub fork isolation + smoke-test-deploy environment branch policy + CODEOWNERS) is documented as net-equivalent, and the door is left open to re-add the condition if a fix is identified later (different key spelling, provider-config tweak, AWS docs update). 2. Config file `expected_scps` comment said "Expected count: 4 (AwsNuke + Restrictions + ProtectISB + LimitRegions)" but the live state has 2 entries (Restrictions + FullAWSAccess) because the fallback branch was taken AND AwsNuke was subsequently detached for OIDC functionality. Rewrite the comment to document the current state, what's NOT attached and why (ProtectISB blocks role creation; AwsNuke blocks sts:*), and guidance for future SCP additions. 3. Config file's "Public-repo disclosure note" listed `trust-policy repository_owner claim` as defence (b). That condition has been removed per finding #1; rewrite the defence chain to match what's actually in place and cross-reference the runbook investigation. 4. Refresh runbook_version SHA to point at the runbook-fix commit (4c0d2ef instead of 11d9676, which was the pre-fix sha) and setup_date to today. The config-vs-runbook drift was the literal bug called out in the adversarial review.
|
Adversarial-review followups landed in aa93a06: repository_owner properly investigated (one-shot _oidc-debug.yml workflow confirmed the claim IS in the JWT but the trust condition fails regardless — both StringEquals and StringLike — so it stays omitted; runbook Step 10 documents this), expected_scps comment rewritten to match live state (2 entries: Restrictions + FullAWSAccess) and explain what's NOT attached and why, public-repo disclosure note's defence chain updated to drop the removed repository_owner condition, runbook_version SHA refreshed to 4c0d2ef (was 11d9676, pre-fix). |
Second-pass review finding: in my previous fix (aa93a06) I refreshed the SHA to 4c0d2ef, but that commit also modified setup.md, making the SHA-in-file one commit behind the file it references. Same bug as the original drift, just one commit later. Manually advance the SHA to the previous commit's hash. A lint to prevent recurrence lands in the smoke.yml workflow on the Phase 3 branch.
…eanup) Two findings from the second adversarial pass: 1. The previous fix for runbook_version drift (PR #233 commit aa93a06) re-introduced the same drift one commit later — the SHA captured referred to the PREVIOUS commit because the same commit was also modifying setup.md. Add a CI check in the smoke workflow's config- reading step that asserts runbook_version equals `git log -1 -- setup.md`. Adding fetch-depth: 0 to checkout so the log resolves. Adding setup.md to the trigger paths so the check fires on runbook-only changes. 2. The yq SHA256 I'd added (ce4b67c3...) was fabricated and did NOT match the actual v4.45.4 release. Discovered when I sha256sum'd the actual binary mid-review. Since the check was set non-fatal, it produced a misleading "WARN" line on every run. Drop the SHA verification entirely; the version pin is sufficient (same trust model as the rest of CI's third-party-action references). Fixing the fake SHA without dropping the check would just shift the maintenance burden onto every Renovate bump.
|
Superseded by #236 — single squashed PR for all 6 phases. Dependency chain across 9 PRs made parallel review worse than serial; one PR with one review and one CI run is simpler. |
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md in a single deliverable. The original spec called for one PR per phase (8+ PRs); experience showed the dependency overlap made that worse for review, not better, so this squashes #226 / #227 / #228 / #229 / #230 / #231 / #232 / #233 / #235 into a single change. What ships ========== Phase 1a — runbook + config schema - docs/smoke-test-account-setup.md: one-off manual procedure for vending the long-lived smoke-test AWS account, with the four required sections (Prerequisites / Procedure / Verification / Operational Notes). Per-step idempotency checks + inverses; ProtectISB role-creation canary + fallback branch (ADR-1); Bedrock model-access enablement + gotchas (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota targets; QuickSight decision; iterate-to-least-privilege protocol for the inline IAM policy. - docs/smoke-test-account-config.yml: post-runbook state record schema. Phase 1b — operator-executed account state - Smoke account 464453619983 provisioned in NDX org under the fallback branch (ProtectISB canary failed; account moved to root with Restrictions SCP attached directly). AwsNuke SCP intentionally NOT attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN delete + retention lint, not aws-nuke). - OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created with 6h max-session-duration. Trust policy uses sub-pattern lock (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the repository_owner claim condition is omitted because it reproducibly breaks the assume even though the OIDC token contains the claim (verified via JWT decode in an investigation workflow that has since been deleted; see runbook Step 10). - expected_scps reflects live state: Restrictions + FullAWSAccess. Phase 2a — synth pipelines for missing scenarios - New synth jobs in .github/workflows/deploy-blueprints.yml for planx and digital-planning-register (CDK -> template.yaml -> S3 via the existing isb-hub upload chain). - bops-planning synth job lands in Phase 2b after the retention lint is justification-aware. - ai-contact-centre: new "verify packaged CodeUri targets blueprints bucket" step catches a sam-package regression where --s3-bucket would silently land in the SAM default bucket. Phase 2b — all-demo expansion + retention lint - cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16 nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS, Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning Register). Umbrella parameters for credentials (GovUkPayApiKey, OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable empty / sensible defaults; per-scenario URL + admin-credential Outputs surfaced. - scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain / UpdateReplacePolicy=Retain / Properties.DeletionProtection=true / Properties.EnableDeletionProtection=true / Properties.FinalSnapshotIdentifier unless the resource carries a non-empty Metadata.Justification. Per-template cap (default 3) + global cap (default 10) so any one scenario can't pencil-whip retentions repo-wide. - lint-committed-templates job in deploy-blueprints.yml runs the lint over hand-authored CFN templates. - bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate debug-after-rollback) with a Metadata.Justification attached via cfnOptions; bops synth job re-enabled. Phase 3 — smoke rails - playwright.config.ts: new 'smoke' project gated on PLAYWRIGHT_SUITE=smoke. - tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper. Sensitive output values flow only via explicit sensitiveValue() accessor; toString / inspect / Symbol.toPrimitive emit REDACTED placeholder. Documents the CloudFormation-API limitation that Output Metadata.Sensitive opt-in isn't readable (regex is the sole signal). - tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries populated. - tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts form-encoded passwords from Playwright trace. - scripts/smoke.sh + .env.example: local + CI identical invocation. - .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly cron / push-to-main / workflow_dispatch); scope decides full vs scoped from changed paths; global serial concurrency (no cancel-in-progress — cancelled runs leave orphan AWS state); configure-aws-credentials with role-duration-seconds=21600 (6h) to match the role's max-session-duration; pre-deploy state check with auto-recovery for stranded stacks; SCP drift check (excluding FullAWSAccess, fail-soft for first 7 detections); quarantine-expiry check; CFN events captured BEFORE teardown; teardown with 3x60s retry, gated on aws-creds outcome so we don't burn 3min retrying without credentials. - .github/workflows/quarterly-audit.yml: 3-monthly tracking issue (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate liveness, ProtectISB-fallback revisit). - .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns review (until a maintainers team is provisioned). Phase 4 — 17 per-scenario smoke specs - One spec per scenario covering the auth-mode pattern (admin-login / public / sso-skip / umbrella). Bug-informed feature flows cite the historical regression that informed each test: - fixmystreet: /reports requires bin/update-all-reports; /admin must reach the dashboard without 2FA redirect - planx: SPA boots free of domain-allowlist / Airbrake errors; Hasura native /v1/version responds (Caddy elimination) - minute: magic-link sets cookie; same-origin fetch() works post-auth; /api/proxy/healthcheck reaches the backend (catches the basic-auth-breaks-fetch() regression and the ALB /api/* interception regression) - localgov-ims: Windows IIS multi-site routing; AdminPassword must not be the literal {{resolve:...}} token (catches the Lambda-custom-resource regression) - localgov-drupal: ndx_aws_ai module boots without Bedrock AccessDeniedException - simply-readable: SPA loads, credentials non-empty + non-token; reload produces no 5xx responses (catches BlueprintsBucketName mis-wire) - ai-contact-centre: PSTN claim matches UK toll-free / landline OR US toll-free (catches international fallback regression) - paperless-ngx: /documents view + /api/documents/ respond (S3 Files mount integrity) - bops-planning: post-login URL is NOT on the Applicants port (catches the routing.rb single-tenant override regression) - digital-planning-register: register loads with planning markers - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park, text-to-speech, council-chatbot): FunctionURL not-5xx + not-403 (catches the InvokeFunctionUrl + InvokeFunction dual-permission regression); council-chatbot uses POST not GET so the test isn't vacuous against a POST-only Lambda - quicksight-dashboard: landing + outputs only (sso-skip per auth- mode categorisation) - all-demo: discovers Output keys dynamically by parsing the committed template at test time; asserts every Output present, non-empty, and not the {{resolve:...}} literal; URL outputs match https?:// Phase 5 — pin every floating image tag - 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*, dpr) pinned to sha-<7chars>@sha256:<digest>. - 2 upstream images (docker.io/apache/tika 3.3.0.0-full, ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>. - Removed legacy cloudformation/scenarios/minute/template.json (stale ECR references; nothing in the repo referenced it). Phase 6 — Renovate adoption (replaces Dependabot) - renovate.json: 6 group rules per the spec's pinning-strategy table; customManagers regex matching the new pin shape; osvVulnerabilityAlerts + security-priority group; pinDigests scoped to official actions/* + aws-actions/* only so the first run doesn't firehose; per-PR limits capped at 6. - .github/workflows/renovate.yml: twice-daily + workflow_dispatch. Action pinned by digest to v46.1.14. - .github/dependabot.yml deleted. Operator follow-ups (not in this PR) ==================================== - NAP-548: migrate scenarios off legacy claude-3-haiku-20240307 - NAP-549: revisit ProtectISB fallback by 2026-11-12 - NAP-550: service-quota Console requests - NAP-551: QuickSight subscription decision - NAP-552: mint RENOVATE_TOKEN repo secret - NAP-554: close in-flight Dependabot PRs - NAP-555: T2b.5b + T3.8 end-to-end verifications Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md in a single deliverable. The original spec called for one PR per phase (8+ PRs); experience showed the dependency overlap made that worse for review, not better, so this squashes #226 / #227 / #228 / #229 / #230 / #231 / #232 / #233 / #235 into a single change. What ships ========== Phase 1a — runbook + config schema - docs/smoke-test-account-setup.md: one-off manual procedure for vending the long-lived smoke-test AWS account, with the four required sections (Prerequisites / Procedure / Verification / Operational Notes). Per-step idempotency checks + inverses; ProtectISB role-creation canary + fallback branch (ADR-1); Bedrock model-access enablement + gotchas (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota targets; QuickSight decision; iterate-to-least-privilege protocol for the inline IAM policy. - docs/smoke-test-account-config.yml: post-runbook state record schema. Phase 1b — operator-executed account state - Smoke account 464453619983 provisioned in NDX org under the fallback branch (ProtectISB canary failed; account moved to root with Restrictions SCP attached directly). AwsNuke SCP intentionally NOT attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN delete + retention lint, not aws-nuke). - OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created with 6h max-session-duration. Trust policy uses sub-pattern lock (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the repository_owner claim condition is omitted because it reproducibly breaks the assume even though the OIDC token contains the claim (verified via JWT decode in an investigation workflow that has since been deleted; see runbook Step 10). - expected_scps reflects live state: Restrictions + FullAWSAccess. Phase 2a — synth pipelines for missing scenarios - New synth jobs in .github/workflows/deploy-blueprints.yml for planx and digital-planning-register (CDK -> template.yaml -> S3 via the existing isb-hub upload chain). - bops-planning synth job lands in Phase 2b after the retention lint is justification-aware. - ai-contact-centre: new "verify packaged CodeUri targets blueprints bucket" step catches a sam-package regression where --s3-bucket would silently land in the SAM default bucket. Phase 2b — all-demo expansion + retention lint - cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16 nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS, Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning Register). Umbrella parameters for credentials (GovUkPayApiKey, OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable empty / sensible defaults; per-scenario URL + admin-credential Outputs surfaced. - scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain / UpdateReplacePolicy=Retain / Properties.DeletionProtection=true / Properties.EnableDeletionProtection=true / Properties.FinalSnapshotIdentifier unless the resource carries a non-empty Metadata.Justification. Per-template cap (default 3) + global cap (default 10) so any one scenario can't pencil-whip retentions repo-wide. - lint-committed-templates job in deploy-blueprints.yml runs the lint over hand-authored CFN templates. - bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate debug-after-rollback) with a Metadata.Justification attached via cfnOptions; bops synth job re-enabled. Phase 3 — smoke rails - playwright.config.ts: new 'smoke' project gated on PLAYWRIGHT_SUITE=smoke. - tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper. Sensitive output values flow only via explicit sensitiveValue() accessor; toString / inspect / Symbol.toPrimitive emit REDACTED placeholder. Documents the CloudFormation-API limitation that Output Metadata.Sensitive opt-in isn't readable (regex is the sole signal). - tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries populated. - tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts form-encoded passwords from Playwright trace. - scripts/smoke.sh + .env.example: local + CI identical invocation. - .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly cron / push-to-main / workflow_dispatch); scope decides full vs scoped from changed paths; global serial concurrency (no cancel-in-progress — cancelled runs leave orphan AWS state); configure-aws-credentials with role-duration-seconds=21600 (6h) to match the role's max-session-duration; pre-deploy state check with auto-recovery for stranded stacks; SCP drift check (excluding FullAWSAccess, fail-soft for first 7 detections); quarantine-expiry check; CFN events captured BEFORE teardown; teardown with 3x60s retry, gated on aws-creds outcome so we don't burn 3min retrying without credentials. - .github/workflows/quarterly-audit.yml: 3-monthly tracking issue (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate liveness, ProtectISB-fallback revisit). - .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns review (until a maintainers team is provisioned). Phase 4 — 17 per-scenario smoke specs - One spec per scenario covering the auth-mode pattern (admin-login / public / sso-skip / umbrella). Bug-informed feature flows cite the historical regression that informed each test: - fixmystreet: /reports requires bin/update-all-reports; /admin must reach the dashboard without 2FA redirect - planx: SPA boots free of domain-allowlist / Airbrake errors; Hasura native /v1/version responds (Caddy elimination) - minute: magic-link sets cookie; same-origin fetch() works post-auth; /api/proxy/healthcheck reaches the backend (catches the basic-auth-breaks-fetch() regression and the ALB /api/* interception regression) - localgov-ims: Windows IIS multi-site routing; AdminPassword must not be the literal {{resolve:...}} token (catches the Lambda-custom-resource regression) - localgov-drupal: ndx_aws_ai module boots without Bedrock AccessDeniedException - simply-readable: SPA loads, credentials non-empty + non-token; reload produces no 5xx responses (catches BlueprintsBucketName mis-wire) - ai-contact-centre: PSTN claim matches UK toll-free / landline OR US toll-free (catches international fallback regression) - paperless-ngx: /documents view + /api/documents/ respond (S3 Files mount integrity) - bops-planning: post-login URL is NOT on the Applicants port (catches the routing.rb single-tenant override regression) - digital-planning-register: register loads with planning markers - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park, text-to-speech, council-chatbot): FunctionURL not-5xx + not-403 (catches the InvokeFunctionUrl + InvokeFunction dual-permission regression); council-chatbot uses POST not GET so the test isn't vacuous against a POST-only Lambda - quicksight-dashboard: landing + outputs only (sso-skip per auth- mode categorisation) - all-demo: discovers Output keys dynamically by parsing the committed template at test time; asserts every Output present, non-empty, and not the {{resolve:...}} literal; URL outputs match https?:// Phase 5 — pin every floating image tag - 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*, dpr) pinned to sha-<7chars>@sha256:<digest>. - 2 upstream images (docker.io/apache/tika 3.3.0.0-full, ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>. - Removed legacy cloudformation/scenarios/minute/template.json (stale ECR references; nothing in the repo referenced it). Phase 6 — Renovate adoption (replaces Dependabot) - renovate.json: 6 group rules per the spec's pinning-strategy table; customManagers regex matching the new pin shape; osvVulnerabilityAlerts + security-priority group; pinDigests scoped to official actions/* + aws-actions/* only so the first run doesn't firehose; per-PR limits capped at 6. - .github/workflows/renovate.yml: twice-daily + workflow_dispatch. Action pinned by digest to v46.1.14. - .github/dependabot.yml deleted. Operator follow-ups (not in this PR) ==================================== - NAP-548: migrate scenarios off legacy claude-3-haiku-20240307 - NAP-549: revisit ProtectISB fallback by 2026-11-12 - NAP-550: service-quota Console requests - NAP-551: QuickSight subscription decision - NAP-552: mint RENOVATE_TOKEN repo secret - NAP-554: close in-flight Dependabot PRs - NAP-555: T2b.5b + T3.8 end-to-end verifications Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md in a single deliverable. The original spec called for one PR per phase (8+ PRs); experience showed the dependency overlap made that worse for review, not better, so this squashes #226 / #227 / #228 / #229 / #230 / #231 / #232 / #233 / #235 into a single change. What ships ========== Phase 1a — runbook + config schema - docs/smoke-test-account-setup.md: one-off manual procedure for vending the long-lived smoke-test AWS account, with the four required sections (Prerequisites / Procedure / Verification / Operational Notes). Per-step idempotency checks + inverses; ProtectISB role-creation canary + fallback branch (ADR-1); Bedrock model-access enablement + gotchas (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota targets; QuickSight decision; iterate-to-least-privilege protocol for the inline IAM policy. - docs/smoke-test-account-config.yml: post-runbook state record schema. Phase 1b — operator-executed account state - Smoke account 464453619983 provisioned in NDX org under the fallback branch (ProtectISB canary failed; account moved to root with Restrictions SCP attached directly). AwsNuke SCP intentionally NOT attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN delete + retention lint, not aws-nuke). - OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created with 6h max-session-duration. Trust policy uses sub-pattern lock (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the repository_owner claim condition is omitted because it reproducibly breaks the assume even though the OIDC token contains the claim (verified via JWT decode in an investigation workflow that has since been deleted; see runbook Step 10). - expected_scps reflects live state: Restrictions + FullAWSAccess. Phase 2a — synth pipelines for missing scenarios - New synth jobs in .github/workflows/deploy-blueprints.yml for planx and digital-planning-register (CDK -> template.yaml -> S3 via the existing isb-hub upload chain). - bops-planning synth job lands in Phase 2b after the retention lint is justification-aware. - ai-contact-centre: new "verify packaged CodeUri targets blueprints bucket" step catches a sam-package regression where --s3-bucket would silently land in the SAM default bucket. Phase 2b — all-demo expansion + retention lint - cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16 nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS, Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning Register). Umbrella parameters for credentials (GovUkPayApiKey, OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable empty / sensible defaults; per-scenario URL + admin-credential Outputs surfaced. - scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain / UpdateReplacePolicy=Retain / Properties.DeletionProtection=true / Properties.EnableDeletionProtection=true / Properties.FinalSnapshotIdentifier unless the resource carries a non-empty Metadata.Justification. Per-template cap (default 3) + global cap (default 10) so any one scenario can't pencil-whip retentions repo-wide. - lint-committed-templates job in deploy-blueprints.yml runs the lint over hand-authored CFN templates. - bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate debug-after-rollback) with a Metadata.Justification attached via cfnOptions; bops synth job re-enabled. Phase 3 — smoke rails - playwright.config.ts: new 'smoke' project gated on PLAYWRIGHT_SUITE=smoke. - tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper. Sensitive output values flow only via explicit sensitiveValue() accessor; toString / inspect / Symbol.toPrimitive emit REDACTED placeholder. Documents the CloudFormation-API limitation that Output Metadata.Sensitive opt-in isn't readable (regex is the sole signal). - tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries populated. - tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts form-encoded passwords from Playwright trace. - scripts/smoke.sh + .env.example: local + CI identical invocation. - .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly cron / push-to-main / workflow_dispatch); scope decides full vs scoped from changed paths; global serial concurrency (no cancel-in-progress — cancelled runs leave orphan AWS state); configure-aws-credentials with role-duration-seconds=21600 (6h) to match the role's max-session-duration; pre-deploy state check with auto-recovery for stranded stacks; SCP drift check (excluding FullAWSAccess, fail-soft for first 7 detections); quarantine-expiry check; CFN events captured BEFORE teardown; teardown with 3x60s retry, gated on aws-creds outcome so we don't burn 3min retrying without credentials. - .github/workflows/quarterly-audit.yml: 3-monthly tracking issue (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate liveness, ProtectISB-fallback revisit). - .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns review (until a maintainers team is provisioned). Phase 4 — 17 per-scenario smoke specs - One spec per scenario covering the auth-mode pattern (admin-login / public / sso-skip / umbrella). Bug-informed feature flows cite the historical regression that informed each test: - fixmystreet: /reports requires bin/update-all-reports; /admin must reach the dashboard without 2FA redirect - planx: SPA boots free of domain-allowlist / Airbrake errors; Hasura native /v1/version responds (Caddy elimination) - minute: magic-link sets cookie; same-origin fetch() works post-auth; /api/proxy/healthcheck reaches the backend (catches the basic-auth-breaks-fetch() regression and the ALB /api/* interception regression) - localgov-ims: Windows IIS multi-site routing; AdminPassword must not be the literal {{resolve:...}} token (catches the Lambda-custom-resource regression) - localgov-drupal: ndx_aws_ai module boots without Bedrock AccessDeniedException - simply-readable: SPA loads, credentials non-empty + non-token; reload produces no 5xx responses (catches BlueprintsBucketName mis-wire) - ai-contact-centre: PSTN claim matches UK toll-free / landline OR US toll-free (catches international fallback regression) - paperless-ngx: /documents view + /api/documents/ respond (S3 Files mount integrity) - bops-planning: post-login URL is NOT on the Applicants port (catches the routing.rb single-tenant override regression) - digital-planning-register: register loads with planning markers - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park, text-to-speech, council-chatbot): FunctionURL not-5xx + not-403 (catches the InvokeFunctionUrl + InvokeFunction dual-permission regression); council-chatbot uses POST not GET so the test isn't vacuous against a POST-only Lambda - quicksight-dashboard: landing + outputs only (sso-skip per auth- mode categorisation) - all-demo: discovers Output keys dynamically by parsing the committed template at test time; asserts every Output present, non-empty, and not the {{resolve:...}} literal; URL outputs match https?:// Phase 5 — pin every floating image tag - 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*, dpr) pinned to sha-<7chars>@sha256:<digest>. - 2 upstream images (docker.io/apache/tika 3.3.0.0-full, ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>. - Removed legacy cloudformation/scenarios/minute/template.json (stale ECR references; nothing in the repo referenced it). Phase 6 — Renovate adoption (replaces Dependabot) - renovate.json: 6 group rules per the spec's pinning-strategy table; customManagers regex matching the new pin shape; osvVulnerabilityAlerts + security-priority group; pinDigests scoped to official actions/* + aws-actions/* only so the first run doesn't firehose; per-PR limits capped at 6. - .github/workflows/renovate.yml: twice-daily + workflow_dispatch. Action pinned by digest to v46.1.14. - .github/dependabot.yml deleted. Operator follow-ups (not in this PR) ==================================== - NAP-548: migrate scenarios off legacy claude-3-haiku-20240307 - NAP-549: revisit ProtectISB fallback by 2026-11-12 - NAP-550: service-quota Console requests - NAP-551: QuickSight subscription decision - NAP-552: mint RENOVATE_TOKEN repo secret - NAP-554: close in-flight Dependabot PRs - NAP-555: T2b.5b + T3.8 end-to-end verifications Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
Summary
Phase 1b of the smoke-pack tech-spec: the runbook in PR #226 has been executed end-to-end against the NDX org-management account (`955063685555`). The smoke-test account exists with the OIDC deploy role configured.
Outcomes
Why the fallback
Step 7 canary failed: ProtectISB (`p-gn4fu3co`) explicitly denies `iam:CreateRole` on `arn:aws:iam:::role/InnovationSandbox-` from `OrganizationAccountAccessRole` (non-ISB-allow-listed principal). Verbatim error:
Per runbook fallback, the account was moved out from under `sandboxOu` to the org root, `p-6tw8eixp` + `p-7pd0szg9` were attached directly to the account, and the post-fallback canary passed.
Acknowledged faithfulness loss
This account does NOT reproduce ProtectISB-driven failure modes (role-name-prefix denials against `InnovationSandbox-*` resources, Secrets Manager protections on ISB-named resources). The smoke pack gets an approximation of an Active sandbox SCP profile, not a literal one. A `scp-fallback-revisit` follow-up issue is required (6-month cadence) to re-check whether ProtectISB has loosened.
Operator follow-ups still pending (not in this PR)
Depends on
Test plan