Skip to content

Commit 53c53ff

Browse files
committed
ci(e2e-cloud-test): purge stale e2e-test-box rows + use signal timeout
Two suite-quality fixes that surfaced in the last Tokyo run: (1) The 'Run E2E suite' step's first failure was a hard-coded 409 'Box with name e2e-test-box already exists' on test_create_named_box. test_create_named_box uses a stable fixture name ('e2e-test-box') and only deletes the row in its `finally` block. Any prior run that died mid-flight (e.g. before the libboxlite image-pull fix landed) left the row behind, and the next run 409s at [1%]. Add a surgical pre-pytest DELETE on box rows with that exact name + admin-default-org id — preserves all other in-flight state and short-circuits the leftover. (2) pytest-timeout's thread method dumps a stack and then `os._exit(99)`s when a test hangs. test_exec_timeout_kills_long_command tripped that and killed pytest itself at [43%], so the FAILURES section was never printed and the back half of the suite never ran. Switch to `--timeout-method=signal` (SIGALRM, GHA runners are Linux) so the timed-out test fails in place and the suite continues, giving us one shot at the full failure picture instead of needing to iterate per hanging test.
1 parent d0fd23c commit 53c53ff

1 file changed

Lines changed: 44 additions & 1 deletion

File tree

.github/workflows/e2e-cloud-test.yml

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -246,6 +246,41 @@ jobs:
246246
| grep -E '\|4\|8192\|20$' \
247247
|| { echo "::error::admin-org quota SELECT did not show target values (|4|8192|20)"; exit 1; }
248248
249+
- name: Purge stale named-fixture box rows (idempotent)
250+
# test_create_named_box uses a stable name ('e2e-test-box') and
251+
# cleans up via `auto_remove=True` only when the test reaches
252+
# its `finally`. Any prior run that failed before that point
253+
# (e.g. runs from before the image-pull fix in libboxlite/store)
254+
# leaves the row behind, and the next run then 409s with
255+
# "Box with name e2e-test-box already exists" at [1%].
256+
# Surgical delete: only the named-fixture rows on the admin's
257+
# default org (no broad wipe — preserves any in-flight test
258+
# state for unrelated suites). box_last_activity / ssh_access
259+
# cascade-delete via FK.
260+
run: |
261+
set -euo pipefail
262+
CLUSTER="${{ steps.resources.outputs.cluster }}"
263+
PRIMARY_TD=$(aws ecs describe-services --cluster "$CLUSTER" --services Api \
264+
--query 'services[0].deployments[?status==`PRIMARY`]|[0].taskDefinition' --output text)
265+
TASK=$(aws ecs list-tasks --cluster "$CLUSTER" --service-name Api \
266+
--query 'taskArns[]' --output text \
267+
| tr '\t' '\n' \
268+
| while read -r arn; do
269+
TD=$(aws ecs describe-tasks --cluster "$CLUSTER" --tasks "$arn" \
270+
--query 'tasks[0].taskDefinitionArn' --output text)
271+
if [ "$TD" = "$PRIMARY_TD" ]; then echo "$arn"; break; fi
272+
done)
273+
[ -n "$TASK" ] || { echo "::error::No Api task on PRIMARY deployment"; exit 1; }
274+
SQL='DELETE FROM "box" WHERE "name" = '\''e2e-test-box'\'' AND "organizationId" = (SELECT "organizationId" FROM "organization_user" WHERE "userId" = '\''boxlite-admin'\'' AND "isDefaultForUser" = true LIMIT 1); SELECT count(*) FROM "box" WHERE "name" = '\''e2e-test-box'\'' AND "organizationId" = (SELECT "organizationId" FROM "organization_user" WHERE "userId" = '\''boxlite-admin'\'' AND "isDefaultForUser" = true LIMIT 1);'
275+
SQL_B64=$(printf '%s' "$SQL" | base64 -w0)
276+
OUT=/tmp/ecs_purge.log
277+
aws ecs execute-command --cluster "$CLUSTER" --task "$TASK" --container Api --interactive \
278+
--command "sh -c \"echo ${SQL_B64} | base64 -d | PAGER=cat PGPASSWORD=\\\"\\\$DB_PASSWORD\\\" psql -h \\\"\\\$DB_HOST\\\" -U \\\"\\\$DB_USERNAME\\\" -d \\\"\\\$DB_DATABASE\\\" -A -t -P pager=off -v ON_ERROR_STOP=1 -f -\"" \
279+
2>&1 | tee "$OUT"
280+
# Post-purge: zero rows with the fixture name on this org.
281+
tr -d '\r' < "$OUT" | grep -E '^0$' \
282+
|| { echo "::error::post-purge SELECT count for e2e-test-box did not show 0"; exit 1; }
283+
249284
# ──────────────────────────────────────────────────────────────────
250285
# Build SDK from THIS checkout (path_prefix and other source-tree
251286
# additions may be ahead of the PyPI release with the same version
@@ -334,9 +369,17 @@ jobs:
334369
# pytest-timeout caps individual test wall time so a hung VM /
335370
# wedged exec doesn't burn the full 45-minute job timeout (which
336371
# would prevent the on-failure log capture step from running).
372+
#
373+
# --timeout-method=signal raises a Timeout exception in the test
374+
# via SIGALRM (UNIX-only, GHA runners are Linux). The previous
375+
# `thread` method dumps a stack and then `os._exit(99)`s, which
376+
# killed pytest itself the first time a single test hung —
377+
# leaving the rest of the suite uncovered and the FAILURES
378+
# section unprinted. Signal lets the timed-out test fail in
379+
# place and pytest carries on to the next test.
337380
timeout 35m python3 -m pytest scripts/test/e2e/cases/ \
338381
-v --tb=short --no-header -p no:cacheprovider \
339-
--timeout=180 --timeout-method=thread \
382+
--timeout=180 --timeout-method=signal \
340383
--junit-xml=pytest-junit.xml
341384
342385
- name: Upload pytest junit XML

0 commit comments

Comments
 (0)