spec + Phase 1a: scenario-regression smoke-pack tech-spec + setup runbook#226
Closed
chrisns wants to merge 5 commits into
Closed
spec + Phase 1a: scenario-regression smoke-pack tech-spec + setup runbook#226chrisns wants to merge 5 commits into
chrisns wants to merge 5 commits into
Conversation
Six-phase plan for an end-to-end scenario regression rig: long-lived SCP-bound test account (manual runbook), all-demo expansion to 17 scenarios, GitHub OIDC + environment-gated deploy role, Playwright smoke pack with per-scenario assertion bar, image-tag pinning, and self-hosted Renovate. Includes 39+ tasks, 30+ ACs, four ADRs, and fixes from two rounds of adversarial review.
Phase 1a of the scenario-regression smoke-pack tech-spec: a one-off manual procedure for vending the long-lived AWS account that hosts the smoke pack, plus the companion machine-readable config the smoke workflow consumes. The runbook covers the four required sections (Prerequisites, Procedure, Verification, Operational Notes) and bakes in the ProtectISB role-creation deadlock canary + fallback branch from ADR-1, the Bedrock model-access enablement procedure with TOS click-through gotcha capture, service-quota target table, and the iterate-to-least-privilege protocol for the deploy-role inline policy. Each Procedure step opens with an idempotency check and lists its inverse. The config file commits the post-runbook state schema (8 fields) in template form. Phase 1b populates real values via PR once the runbook has been executed by an operator with org-management SSO.
This was referenced May 12, 2026
Address review finding: the runbook's Step 12 canary used the Anthropic-shaped body for all three model families. Sending it to Nova produces ValidationException (extraneous key max_tokens); sending it via --body 'JSON' instead of --body fileb://path errors with InvalidBase64. Future operators following the runbook would hit two red herrings before reaching real access denials. Fixes: - Use --body fileb://<file> for all three families (CLI requires base64 or fileb:// for binary args). - Provide per-family body files: claude-body.json, nova-body.json, titan-embed-body.json. - Run each model with its native body shape. - Replace the now-Legacy anthropic.claude-3-haiku-20240307-v1:0 with anthropic.claude-3-5-haiku-20241022-v1:0 (the recommended Anthropic drop-in per the Bedrock model-access gotcha section). The legacy model's "ResourceNotFoundException: marked as Legacy" was caught by T3.8 and is tracked in NAP-548; the runbook gotcha section warns about it, so the canary itself can stop using it. - Drop the redundant "Note for the Nova embedding models" block; the verification block above already demonstrates the embed shape.
Member
Author
|
Adversarial-review followup landed in 7255ee4: Bedrock canary now uses per-family body files (fileb://), switches from the now-Legacy claude-3-haiku-20240307 to claude-3-5-haiku-20241022, and drops the duplicated titan-embed note. |
OIDC-issued temporary credentials don't auto-refresh. The default 1h max-session-duration expires mid-run for any smoke deploy that exceeds 1 hour (single scenarios like localgov-drupal / planx take 45-60 min; all-demo nests 16 in parallel and historically takes 60-90 min). Once the credentials expire, every subsequent AWS call in the workflow fails NoCredentials, including teardown. 6h (21600s) covers any realistic smoke run with margin. The same fix was applied to the live role separately.
Third-pass finding: the previous fix to max-session-duration only applied to the create-role path. Re-running the runbook against an existing role (created at the older 3600s default) would silently keep the short session. Add an aws iam update-role call so re-runs refresh the duration in lockstep with the trust policy refresh.
5 tasks
Member
Author
|
Superseded by #236 — single squashed PR for all 6 phases. Dependency chain across 9 PRs made parallel review worse than serial; one PR with one review and one CI run is simpler. |
chrisns
added a commit
that referenced
this pull request
May 12, 2026
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md in a single deliverable. The original spec called for one PR per phase (8+ PRs); experience showed the dependency overlap made that worse for review, not better, so this squashes #226 / #227 / #228 / #229 / #230 / #231 / #232 / #233 / #235 into a single change. What ships ========== Phase 1a — runbook + config schema - docs/smoke-test-account-setup.md: one-off manual procedure for vending the long-lived smoke-test AWS account, with the four required sections (Prerequisites / Procedure / Verification / Operational Notes). Per-step idempotency checks + inverses; ProtectISB role-creation canary + fallback branch (ADR-1); Bedrock model-access enablement + gotchas (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota targets; QuickSight decision; iterate-to-least-privilege protocol for the inline IAM policy. - docs/smoke-test-account-config.yml: post-runbook state record schema. Phase 1b — operator-executed account state - Smoke account 464453619983 provisioned in NDX org under the fallback branch (ProtectISB canary failed; account moved to root with Restrictions SCP attached directly). AwsNuke SCP intentionally NOT attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN delete + retention lint, not aws-nuke). - OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created with 6h max-session-duration. Trust policy uses sub-pattern lock (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the repository_owner claim condition is omitted because it reproducibly breaks the assume even though the OIDC token contains the claim (verified via JWT decode in an investigation workflow that has since been deleted; see runbook Step 10). - expected_scps reflects live state: Restrictions + FullAWSAccess. Phase 2a — synth pipelines for missing scenarios - New synth jobs in .github/workflows/deploy-blueprints.yml for planx and digital-planning-register (CDK -> template.yaml -> S3 via the existing isb-hub upload chain). - bops-planning synth job lands in Phase 2b after the retention lint is justification-aware. - ai-contact-centre: new "verify packaged CodeUri targets blueprints bucket" step catches a sam-package regression where --s3-bucket would silently land in the SAM default bucket. Phase 2b — all-demo expansion + retention lint - cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16 nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS, Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning Register). Umbrella parameters for credentials (GovUkPayApiKey, OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable empty / sensible defaults; per-scenario URL + admin-credential Outputs surfaced. - scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain / UpdateReplacePolicy=Retain / Properties.DeletionProtection=true / Properties.EnableDeletionProtection=true / Properties.FinalSnapshotIdentifier unless the resource carries a non-empty Metadata.Justification. Per-template cap (default 3) + global cap (default 10) so any one scenario can't pencil-whip retentions repo-wide. - lint-committed-templates job in deploy-blueprints.yml runs the lint over hand-authored CFN templates. - bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate debug-after-rollback) with a Metadata.Justification attached via cfnOptions; bops synth job re-enabled. Phase 3 — smoke rails - playwright.config.ts: new 'smoke' project gated on PLAYWRIGHT_SUITE=smoke. - tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper. Sensitive output values flow only via explicit sensitiveValue() accessor; toString / inspect / Symbol.toPrimitive emit REDACTED placeholder. Documents the CloudFormation-API limitation that Output Metadata.Sensitive opt-in isn't readable (regex is the sole signal). - tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries populated. - tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts form-encoded passwords from Playwright trace. - scripts/smoke.sh + .env.example: local + CI identical invocation. - .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly cron / push-to-main / workflow_dispatch); scope decides full vs scoped from changed paths; global serial concurrency (no cancel-in-progress — cancelled runs leave orphan AWS state); configure-aws-credentials with role-duration-seconds=21600 (6h) to match the role's max-session-duration; pre-deploy state check with auto-recovery for stranded stacks; SCP drift check (excluding FullAWSAccess, fail-soft for first 7 detections); quarantine-expiry check; CFN events captured BEFORE teardown; teardown with 3x60s retry, gated on aws-creds outcome so we don't burn 3min retrying without credentials. - .github/workflows/quarterly-audit.yml: 3-monthly tracking issue (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate liveness, ProtectISB-fallback revisit). - .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns review (until a maintainers team is provisioned). Phase 4 — 17 per-scenario smoke specs - One spec per scenario covering the auth-mode pattern (admin-login / public / sso-skip / umbrella). Bug-informed feature flows cite the historical regression that informed each test: - fixmystreet: /reports requires bin/update-all-reports; /admin must reach the dashboard without 2FA redirect - planx: SPA boots free of domain-allowlist / Airbrake errors; Hasura native /v1/version responds (Caddy elimination) - minute: magic-link sets cookie; same-origin fetch() works post-auth; /api/proxy/healthcheck reaches the backend (catches the basic-auth-breaks-fetch() regression and the ALB /api/* interception regression) - localgov-ims: Windows IIS multi-site routing; AdminPassword must not be the literal {{resolve:...}} token (catches the Lambda-custom-resource regression) - localgov-drupal: ndx_aws_ai module boots without Bedrock AccessDeniedException - simply-readable: SPA loads, credentials non-empty + non-token; reload produces no 5xx responses (catches BlueprintsBucketName mis-wire) - ai-contact-centre: PSTN claim matches UK toll-free / landline OR US toll-free (catches international fallback regression) - paperless-ngx: /documents view + /api/documents/ respond (S3 Files mount integrity) - bops-planning: post-login URL is NOT on the Applicants port (catches the routing.rb single-tenant override regression) - digital-planning-register: register loads with planning markers - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park, text-to-speech, council-chatbot): FunctionURL not-5xx + not-403 (catches the InvokeFunctionUrl + InvokeFunction dual-permission regression); council-chatbot uses POST not GET so the test isn't vacuous against a POST-only Lambda - quicksight-dashboard: landing + outputs only (sso-skip per auth- mode categorisation) - all-demo: discovers Output keys dynamically by parsing the committed template at test time; asserts every Output present, non-empty, and not the {{resolve:...}} literal; URL outputs match https?:// Phase 5 — pin every floating image tag - 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*, dpr) pinned to sha-<7chars>@sha256:<digest>. - 2 upstream images (docker.io/apache/tika 3.3.0.0-full, ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>. - Removed legacy cloudformation/scenarios/minute/template.json (stale ECR references; nothing in the repo referenced it). Phase 6 — Renovate adoption (replaces Dependabot) - renovate.json: 6 group rules per the spec's pinning-strategy table; customManagers regex matching the new pin shape; osvVulnerabilityAlerts + security-priority group; pinDigests scoped to official actions/* + aws-actions/* only so the first run doesn't firehose; per-PR limits capped at 6. - .github/workflows/renovate.yml: twice-daily + workflow_dispatch. Action pinned by digest to v46.1.14. - .github/dependabot.yml deleted. Operator follow-ups (not in this PR) ==================================== - NAP-548: migrate scenarios off legacy claude-3-haiku-20240307 - NAP-549: revisit ProtectISB fallback by 2026-11-12 - NAP-550: service-quota Console requests - NAP-551: QuickSight subscription decision - NAP-552: mint RENOVATE_TOKEN repo secret - NAP-554: close in-flight Dependabot PRs - NAP-555: T2b.5b + T3.8 end-to-end verifications Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
chrisns
added a commit
that referenced
this pull request
May 13, 2026
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md in a single deliverable. The original spec called for one PR per phase (8+ PRs); experience showed the dependency overlap made that worse for review, not better, so this squashes #226 / #227 / #228 / #229 / #230 / #231 / #232 / #233 / #235 into a single change. What ships ========== Phase 1a — runbook + config schema - docs/smoke-test-account-setup.md: one-off manual procedure for vending the long-lived smoke-test AWS account, with the four required sections (Prerequisites / Procedure / Verification / Operational Notes). Per-step idempotency checks + inverses; ProtectISB role-creation canary + fallback branch (ADR-1); Bedrock model-access enablement + gotchas (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota targets; QuickSight decision; iterate-to-least-privilege protocol for the inline IAM policy. - docs/smoke-test-account-config.yml: post-runbook state record schema. Phase 1b — operator-executed account state - Smoke account 464453619983 provisioned in NDX org under the fallback branch (ProtectISB canary failed; account moved to root with Restrictions SCP attached directly). AwsNuke SCP intentionally NOT attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN delete + retention lint, not aws-nuke). - OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created with 6h max-session-duration. Trust policy uses sub-pattern lock (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the repository_owner claim condition is omitted because it reproducibly breaks the assume even though the OIDC token contains the claim (verified via JWT decode in an investigation workflow that has since been deleted; see runbook Step 10). - expected_scps reflects live state: Restrictions + FullAWSAccess. Phase 2a — synth pipelines for missing scenarios - New synth jobs in .github/workflows/deploy-blueprints.yml for planx and digital-planning-register (CDK -> template.yaml -> S3 via the existing isb-hub upload chain). - bops-planning synth job lands in Phase 2b after the retention lint is justification-aware. - ai-contact-centre: new "verify packaged CodeUri targets blueprints bucket" step catches a sam-package regression where --s3-bucket would silently land in the SAM default bucket. Phase 2b — all-demo expansion + retention lint - cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16 nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS, Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning Register). Umbrella parameters for credentials (GovUkPayApiKey, OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable empty / sensible defaults; per-scenario URL + admin-credential Outputs surfaced. - scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain / UpdateReplacePolicy=Retain / Properties.DeletionProtection=true / Properties.EnableDeletionProtection=true / Properties.FinalSnapshotIdentifier unless the resource carries a non-empty Metadata.Justification. Per-template cap (default 3) + global cap (default 10) so any one scenario can't pencil-whip retentions repo-wide. - lint-committed-templates job in deploy-blueprints.yml runs the lint over hand-authored CFN templates. - bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate debug-after-rollback) with a Metadata.Justification attached via cfnOptions; bops synth job re-enabled. Phase 3 — smoke rails - playwright.config.ts: new 'smoke' project gated on PLAYWRIGHT_SUITE=smoke. - tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper. Sensitive output values flow only via explicit sensitiveValue() accessor; toString / inspect / Symbol.toPrimitive emit REDACTED placeholder. Documents the CloudFormation-API limitation that Output Metadata.Sensitive opt-in isn't readable (regex is the sole signal). - tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries populated. - tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts form-encoded passwords from Playwright trace. - scripts/smoke.sh + .env.example: local + CI identical invocation. - .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly cron / push-to-main / workflow_dispatch); scope decides full vs scoped from changed paths; global serial concurrency (no cancel-in-progress — cancelled runs leave orphan AWS state); configure-aws-credentials with role-duration-seconds=21600 (6h) to match the role's max-session-duration; pre-deploy state check with auto-recovery for stranded stacks; SCP drift check (excluding FullAWSAccess, fail-soft for first 7 detections); quarantine-expiry check; CFN events captured BEFORE teardown; teardown with 3x60s retry, gated on aws-creds outcome so we don't burn 3min retrying without credentials. - .github/workflows/quarterly-audit.yml: 3-monthly tracking issue (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate liveness, ProtectISB-fallback revisit). - .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns review (until a maintainers team is provisioned). Phase 4 — 17 per-scenario smoke specs - One spec per scenario covering the auth-mode pattern (admin-login / public / sso-skip / umbrella). Bug-informed feature flows cite the historical regression that informed each test: - fixmystreet: /reports requires bin/update-all-reports; /admin must reach the dashboard without 2FA redirect - planx: SPA boots free of domain-allowlist / Airbrake errors; Hasura native /v1/version responds (Caddy elimination) - minute: magic-link sets cookie; same-origin fetch() works post-auth; /api/proxy/healthcheck reaches the backend (catches the basic-auth-breaks-fetch() regression and the ALB /api/* interception regression) - localgov-ims: Windows IIS multi-site routing; AdminPassword must not be the literal {{resolve:...}} token (catches the Lambda-custom-resource regression) - localgov-drupal: ndx_aws_ai module boots without Bedrock AccessDeniedException - simply-readable: SPA loads, credentials non-empty + non-token; reload produces no 5xx responses (catches BlueprintsBucketName mis-wire) - ai-contact-centre: PSTN claim matches UK toll-free / landline OR US toll-free (catches international fallback regression) - paperless-ngx: /documents view + /api/documents/ respond (S3 Files mount integrity) - bops-planning: post-login URL is NOT on the Applicants port (catches the routing.rb single-tenant override regression) - digital-planning-register: register loads with planning markers - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park, text-to-speech, council-chatbot): FunctionURL not-5xx + not-403 (catches the InvokeFunctionUrl + InvokeFunction dual-permission regression); council-chatbot uses POST not GET so the test isn't vacuous against a POST-only Lambda - quicksight-dashboard: landing + outputs only (sso-skip per auth- mode categorisation) - all-demo: discovers Output keys dynamically by parsing the committed template at test time; asserts every Output present, non-empty, and not the {{resolve:...}} literal; URL outputs match https?:// Phase 5 — pin every floating image tag - 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*, dpr) pinned to sha-<7chars>@sha256:<digest>. - 2 upstream images (docker.io/apache/tika 3.3.0.0-full, ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>. - Removed legacy cloudformation/scenarios/minute/template.json (stale ECR references; nothing in the repo referenced it). Phase 6 — Renovate adoption (replaces Dependabot) - renovate.json: 6 group rules per the spec's pinning-strategy table; customManagers regex matching the new pin shape; osvVulnerabilityAlerts + security-priority group; pinDigests scoped to official actions/* + aws-actions/* only so the first run doesn't firehose; per-PR limits capped at 6. - .github/workflows/renovate.yml: twice-daily + workflow_dispatch. Action pinned by digest to v46.1.14. - .github/dependabot.yml deleted. Operator follow-ups (not in this PR) ==================================== - NAP-548: migrate scenarios off legacy claude-3-haiku-20240307 - NAP-549: revisit ProtectISB fallback by 2026-11-12 - NAP-550: service-quota Console requests - NAP-551: QuickSight subscription decision - NAP-552: mint RENOVATE_TOKEN repo secret - NAP-554: close in-flight Dependabot PRs - NAP-555: T2b.5b + T3.8 end-to-end verifications Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
chrisns
added a commit
that referenced
this pull request
May 13, 2026
Implements every phase of _bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md in a single deliverable. The original spec called for one PR per phase (8+ PRs); experience showed the dependency overlap made that worse for review, not better, so this squashes #226 / #227 / #228 / #229 / #230 / #231 / #232 / #233 / #235 into a single change. What ships ========== Phase 1a — runbook + config schema - docs/smoke-test-account-setup.md: one-off manual procedure for vending the long-lived smoke-test AWS account, with the four required sections (Prerequisites / Procedure / Verification / Operational Notes). Per-step idempotency checks + inverses; ProtectISB role-creation canary + fallback branch (ADR-1); Bedrock model-access enablement + gotchas (legacy claude-3-haiku-20240307 retired, Nova body shape); service-quota targets; QuickSight decision; iterate-to-least-privilege protocol for the inline IAM policy. - docs/smoke-test-account-config.yml: post-runbook state record schema. Phase 1b — operator-executed account state - Smoke account 464453619983 provisioned in NDX org under the fallback branch (ProtectISB canary failed; account moved to root with Restrictions SCP attached directly). AwsNuke SCP intentionally NOT attached (it blocks sts:AssumeRoleWithWebIdentity and we use CFN delete + retention lint, not aws-nuke). - OIDC provider + InnovationSandbox-ndx-SmokeTestDeployRole created with 6h max-session-duration. Trust policy uses sub-pattern lock (`repo:co-cddo/ndx_try_aws_scenarios:*`) + aud condition; the repository_owner claim condition is omitted because it reproducibly breaks the assume even though the OIDC token contains the claim (verified via JWT decode in an investigation workflow that has since been deleted; see runbook Step 10). - expected_scps reflects live state: Restrictions + FullAWSAccess. Phase 2a — synth pipelines for missing scenarios - New synth jobs in .github/workflows/deploy-blueprints.yml for planx and digital-planning-register (CDK -> template.yaml -> S3 via the existing isb-hub upload chain). - bops-planning synth job lands in Phase 2b after the retention lint is justification-aware. - ai-contact-centre: new "verify packaged CodeUri targets blueprints bucket" step catches a sam-package regression where --s3-bucket would silently land in the SAM default bucket. Phase 2b — all-demo expansion + retention lint - cloudformation/scenarios/all-demo/template.yaml expanded from 7 to 16 nested scenarios (Minute, FixMyStreet, AI Contact Centre, LocalGov IMS, Paperless-ngx, PlanX, Bops Planning, Simply Readable, Digital Planning Register). Umbrella parameters for credentials (GovUkPayApiKey, OSVectorTilesApiKey, DprImageUri, DprCouncilConfig) with overridable empty / sensible defaults; per-scenario URL + admin-credential Outputs surfaced. - scripts/lint-retention-policies.sh: forbids DeletionPolicy=Retain / UpdateReplacePolicy=Retain / Properties.DeletionProtection=true / Properties.EnableDeletionProtection=true / Properties.FinalSnapshotIdentifier unless the resource carries a non-empty Metadata.Justification. Per-template cap (default 3) + global cap (default 10) so any one scenario can't pencil-whip retentions repo-wide. - lint-committed-templates job in deploy-blueprints.yml runs the lint over hand-authored CFN templates. - bops-planning's LogGroup keeps RemovalPolicy.RETAIN (deliberate debug-after-rollback) with a Metadata.Justification attached via cfnOptions; bops synth job re-enabled. Phase 3 — smoke rails - playwright.config.ts: new 'smoke' project gated on PLAYWRIGHT_SUITE=smoke. - tests/smoke/fixtures/cfn-outputs.ts: SDK-v3 DescribeStacks helper. Sensitive output values flow only via explicit sensitiveValue() accessor; toString / inspect / Symbol.toPrimitive emit REDACTED placeholder. Documents the CloudFormation-API limitation that Output Metadata.Sensitive opt-in isn't readable (regex is the sole signal). - tests/smoke/fixtures/assertion-bar.ts: 17 AssertionBarRow entries populated. - tests/smoke/fixtures/secure-form.ts: fillPassword wrapper redacts form-encoded passwords from Playwright trace. - scripts/smoke.sh + .env.example: local + CI identical invocation. - .github/workflows/smoke.yml: trigger matrix (PR-scoped / nightly cron / push-to-main / workflow_dispatch); scope decides full vs scoped from changed paths; global serial concurrency (no cancel-in-progress — cancelled runs leave orphan AWS state); configure-aws-credentials with role-duration-seconds=21600 (6h) to match the role's max-session-duration; pre-deploy state check with auto-recovery for stranded stacks; SCP drift check (excluding FullAWSAccess, fail-soft for first 7 detections); quarantine-expiry check; CFN events captured BEFORE teardown; teardown with 3x60s retry, gated on aws-creds outcome so we don't burn 3min retrying without credentials. - .github/workflows/quarterly-audit.yml: 3-monthly tracking issue (spend, orphan sweep, deploy-role policy drift, SCP drift, Renovate liveness, ProtectISB-fallback revisit). - .github/CODEOWNERS: smoke-pack sensitive paths require @chrisns review (until a maintainers team is provisioned). Phase 4 — 17 per-scenario smoke specs - One spec per scenario covering the auth-mode pattern (admin-login / public / sso-skip / umbrella). Bug-informed feature flows cite the historical regression that informed each test: - fixmystreet: /reports requires bin/update-all-reports; /admin must reach the dashboard without 2FA redirect - planx: SPA boots free of domain-allowlist / Airbrake errors; Hasura native /v1/version responds (Caddy elimination) - minute: magic-link sets cookie; same-origin fetch() works post-auth; /api/proxy/healthcheck reaches the backend (catches the basic-auth-breaks-fetch() regression and the ALB /api/* interception regression) - localgov-ims: Windows IIS multi-site routing; AdminPassword must not be the literal {{resolve:...}} token (catches the Lambda-custom-resource regression) - localgov-drupal: ndx_aws_ai module boots without Bedrock AccessDeniedException - simply-readable: SPA loads, credentials non-empty + non-token; reload produces no 5xx responses (catches BlueprintsBucketName mis-wire) - ai-contact-centre: PSTN claim matches UK toll-free / landline OR US toll-free (catches international fallback regression) - paperless-ngx: /documents view + /api/documents/ respond (S3 Files mount integrity) - bops-planning: post-login URL is NOT on the Applicants port (catches the routing.rb single-tenant override regression) - digital-planning-register: register loads with planning markers - public-Lambda scenarios (foi-redaction, planning-ai, smart-car-park, text-to-speech, council-chatbot): FunctionURL not-5xx + not-403 (catches the InvokeFunctionUrl + InvokeFunction dual-permission regression); council-chatbot uses POST not GET so the test isn't vacuous against a POST-only Lambda - quicksight-dashboard: landing + outputs only (sso-skip per auth- mode categorisation) - all-demo: discovers Output keys dynamically by parsing the committed template at test time; asserts every Output present, non-empty, and not the {{resolve:...}} literal; URL outputs match https?:// Phase 5 — pin every floating image tag - 10 own-GHCR images (fixmystreet, localgov_drupal, minute_*, planx-*, dpr) pinned to sha-<7chars>@sha256:<digest>. - 2 upstream images (docker.io/apache/tika 3.3.0.0-full, ghcr.io/paperless-ngx/paperless-ngx 2.9) pinned to <tag>@sha256:<digest>. - Removed legacy cloudformation/scenarios/minute/template.json (stale ECR references; nothing in the repo referenced it). Phase 6 — Renovate adoption (replaces Dependabot) - renovate.json: 6 group rules per the spec's pinning-strategy table; customManagers regex matching the new pin shape; osvVulnerabilityAlerts + security-priority group; pinDigests scoped to official actions/* + aws-actions/* only so the first run doesn't firehose; per-PR limits capped at 6. - .github/workflows/renovate.yml: twice-daily + workflow_dispatch. Action pinned by digest to v46.1.14. - .github/dependabot.yml deleted. Operator follow-ups (not in this PR) ==================================== - NAP-548: migrate scenarios off legacy claude-3-haiku-20240307 - NAP-549: revisit ProtectISB fallback by 2026-11-12 - NAP-550: service-quota Console requests - NAP-551: QuickSight subscription decision - NAP-552: mint RENOVATE_TOKEN repo secret - NAP-554: close in-flight Dependabot PRs - NAP-555: T2b.5b + T3.8 end-to-end verifications Closes: #226, #227, #228, #229, #230, #231, #232, #233, #235.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bundles the multi-PR plan and its first implementation deliverable into a single review.
5b53c89): 6-phase plan to gate scenario regressions via a single `all-demo` smoke pack running against a long-lived test account that inherits ISB SCPs. Five ADRs commit the headline decisions (OU placement, Renovate vs Dependabot, runbook-not-IaC, retention-lint not nuke-tool). See `_bmad-output/implementation-artifacts/tech-spec-scenario-regression-smoke-pack.md`.The runbook is reviewed FIRST (this PR) so the procedure is correct before any AWS state changes. Phase 1b (operator executes the merged runbook + PRs the populated config) follows once a reviewer has signed off here.
Required reviewer profile
Security / org-admin: someone who can scrutinise the org-management CLI procedure, the deploy-role trust policy (OIDC subject + repository_owner claim), the baseline inline policy, and the ProtectISB fallback branch. The Procedure section reads end-to-end; if any step seems unclear or unsafe, comment inline.
Test plan
This PR is documentation. Substantive verification happens in Phase 1b when an operator follows the runbook against the real AWS Organizations API.