Context
Follow-up to #1235 (INF-05 fix). That PR correctly tightened the ECR delete loop scope and removed the silent || 'cudly' fallback from rollback.yml's validate job. Two adjacent hardening opportunities in the same two files weren't part of INF-05's surgical scope but are the same shape of risk.
1. cleanup-staging.yml: AWS error is silently swallowed at the ECR cleanup loop's edge
Both copies of the loop (lines 78-95 and 148-165) start with:
for REPO in $(aws ecr describe-repositories \
--query "repositories[?starts_with(repositoryName,'cudly-staging')].repositoryName" \
--output text 2>/dev/null); do
2>/dev/null swallows any AWS error (auth failure, network, throttle, region typo). On error, the command substitution evaluates to empty, the for loop body never executes, the post-PR defense-in-depth case guard never fires, and the workflow proceeds to the terraform destroy step. The error surfaces later (during destroy) but at the wrong layer — it looks like a destroy problem, not a credential/perm problem.
Cleaner shape: capture the listing into a temp var with set -o pipefail semantics, and exit 1 on listing failure. The cleanup is already gated on a typed "destroy" confirmation, so failing loud here is the right move (memory feedback_no_silent_fallbacks.md).
2. rollback.yml verify-image step still uses inline ${{ vars.ECR_REPOSITORY }} instead of the env-var pattern
In #1235, the validate job was hardened from ${{ vars.ECR_REPOSITORY }} inline-interpolation to a proper env-var indirection:
env:
ECR_REPOSITORY: ${{ vars.ECR_REPOSITORY }}
run: |
...
IMAGE_URI="...${ECR_REPOSITORY}:..."
The parallel verify-image > Verify AWS image step (.github/workflows/rollback.yml:133-154) was given the same presence check but kept the inline pattern:
- name: Verify AWS image
if: startsWith(inputs.cloud, 'aws')
run: |
IMAGE_TAG="${{ inputs.image_tag }}"
REPO="${{ vars.ECR_REPOSITORY }}"
...
aws ecr describe-images \
--repository-name $REPO \
--image-ids imageTag=$IMAGE_TAG \
Repo variables come from a closed admin allowlist, so the realistic exploit surface is small, but the inconsistency within the same file is a footgun for future copy-paste. Two micro-changes for consistency with #1235:
- Move
vars.ECR_REPOSITORY (and arguably vars.AWS_REGION) into a step-level env: block; reference as ${ECR_REPOSITORY} / ${AWS_REGION} in the script body.
- Quote
"$REPO" and "$IMAGE_TAG" in the aws ecr describe-images invocation (both are shell-safe today thanks to the ^[a-zA-Z0-9._-]+$ regex on inputs.image_tag, but quoting is free defense-in-depth).
Why now
Both are low-severity hygiene items in the same two files PR #1235 already opened. Bundling them with the same blast-radius rationale (cleanup of force-delete loops + the rollback path) keeps the audit trail tight.
Out of scope
- GCP
--filter="name:cudly-staging" in destroy-gcp uses GCP filter-substring semantics; a tighter regex/exact filter would parallel the same hardening shape but is owned by a separate cleanup path. Not bundling here.
Context
Follow-up to #1235 (INF-05 fix). That PR correctly tightened the ECR delete loop scope and removed the silent
|| 'cudly'fallback fromrollback.yml'svalidatejob. Two adjacent hardening opportunities in the same two files weren't part of INF-05's surgical scope but are the same shape of risk.1.
cleanup-staging.yml: AWS error is silently swallowed at the ECR cleanup loop's edgeBoth copies of the loop (lines 78-95 and 148-165) start with:
2>/dev/nullswallows any AWS error (auth failure, network, throttle, region typo). On error, the command substitution evaluates to empty, theforloop body never executes, the post-PR defense-in-depthcaseguard never fires, and the workflow proceeds to theterraform destroystep. The error surfaces later (duringdestroy) but at the wrong layer — it looks like a destroy problem, not a credential/perm problem.Cleaner shape: capture the listing into a temp var with
set -o pipefailsemantics, andexit 1on listing failure. The cleanup is already gated on a typed "destroy" confirmation, so failing loud here is the right move (memoryfeedback_no_silent_fallbacks.md).2.
rollback.ymlverify-imagestep still uses inline${{ vars.ECR_REPOSITORY }}instead of the env-var patternIn #1235, the
validatejob was hardened from${{ vars.ECR_REPOSITORY }}inline-interpolation to a proper env-var indirection:The parallel
verify-image > Verify AWS imagestep (.github/workflows/rollback.yml:133-154) was given the same presence check but kept the inline pattern:Repo variables come from a closed admin allowlist, so the realistic exploit surface is small, but the inconsistency within the same file is a footgun for future copy-paste. Two micro-changes for consistency with #1235:
vars.ECR_REPOSITORY(and arguablyvars.AWS_REGION) into a step-levelenv:block; reference as${ECR_REPOSITORY}/${AWS_REGION}in the script body."$REPO"and"$IMAGE_TAG"in theaws ecr describe-imagesinvocation (both are shell-safe today thanks to the^[a-zA-Z0-9._-]+$regex oninputs.image_tag, but quoting is free defense-in-depth).Why now
Both are low-severity hygiene items in the same two files PR #1235 already opened. Bundling them with the same blast-radius rationale (cleanup of force-delete loops + the rollback path) keeps the audit trail tight.
Out of scope
--filter="name:cudly-staging"indestroy-gcpuses GCP filter-substring semantics; a tighter regex/exact filter would parallel the same hardening shape but is owned by a separate cleanup path. Not bundling here.