Skip to content

ci: harden remaining staging cleanup + rollback hardening (silent AWS error swallowing, verify-image var-interpolation) #1330

Description

@cristim

Context

Follow-up to #1235 (INF-05 fix). That PR correctly tightened the ECR delete loop scope and removed the silent || 'cudly' fallback from rollback.yml's validate job. Two adjacent hardening opportunities in the same two files weren't part of INF-05's surgical scope but are the same shape of risk.

1. cleanup-staging.yml: AWS error is silently swallowed at the ECR cleanup loop's edge

Both copies of the loop (lines 78-95 and 148-165) start with:

for REPO in $(aws ecr describe-repositories \
  --query "repositories[?starts_with(repositoryName,'cudly-staging')].repositoryName" \
  --output text 2>/dev/null); do

2>/dev/null swallows any AWS error (auth failure, network, throttle, region typo). On error, the command substitution evaluates to empty, the for loop body never executes, the post-PR defense-in-depth case guard never fires, and the workflow proceeds to the terraform destroy step. The error surfaces later (during destroy) but at the wrong layer — it looks like a destroy problem, not a credential/perm problem.

Cleaner shape: capture the listing into a temp var with set -o pipefail semantics, and exit 1 on listing failure. The cleanup is already gated on a typed "destroy" confirmation, so failing loud here is the right move (memory feedback_no_silent_fallbacks.md).

2. rollback.yml verify-image step still uses inline ${{ vars.ECR_REPOSITORY }} instead of the env-var pattern

In #1235, the validate job was hardened from ${{ vars.ECR_REPOSITORY }} inline-interpolation to a proper env-var indirection:

env:
  ECR_REPOSITORY: ${{ vars.ECR_REPOSITORY }}
run: |
  ...
  IMAGE_URI="...${ECR_REPOSITORY}:..."

The parallel verify-image > Verify AWS image step (.github/workflows/rollback.yml:133-154) was given the same presence check but kept the inline pattern:

- name: Verify AWS image
  if: startsWith(inputs.cloud, 'aws')
  run: |
    IMAGE_TAG="${{ inputs.image_tag }}"
    REPO="${{ vars.ECR_REPOSITORY }}"
    ...
    aws ecr describe-images \
        --repository-name $REPO \
        --image-ids imageTag=$IMAGE_TAG \

Repo variables come from a closed admin allowlist, so the realistic exploit surface is small, but the inconsistency within the same file is a footgun for future copy-paste. Two micro-changes for consistency with #1235:

  • Move vars.ECR_REPOSITORY (and arguably vars.AWS_REGION) into a step-level env: block; reference as ${ECR_REPOSITORY} / ${AWS_REGION} in the script body.
  • Quote "$REPO" and "$IMAGE_TAG" in the aws ecr describe-images invocation (both are shell-safe today thanks to the ^[a-zA-Z0-9._-]+$ regex on inputs.image_tag, but quoting is free defense-in-depth).

Why now

Both are low-severity hygiene items in the same two files PR #1235 already opened. Bundling them with the same blast-radius rationale (cleanup of force-delete loops + the rollback path) keeps the audit trail tight.

Out of scope

  • GCP --filter="name:cudly-staging" in destroy-gcp uses GCP filter-substring semantics; a tighter regex/exact filter would parallel the same hardening shape but is owned by a separate cleanup path. Not bundling here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions