Skip to content

Commit 3b29644

Browse files
sdesalascursoragent
authored andcommitted
[CI] Rescue ES Docker container logs on failed snapshot-verify steps (elastic#270460)
**Related: elastic#270430** (same pipeline, complementary). ## Summary The `kibana-elasticsearch-snapshot-verify` pipeline has been failing due to a suspected ES snapshot performance regression. When an FTR test times out, we're no longer getting ES logs which could help us to diagnose the problem further. This PR adds some logic to capture `docker logs` for ES containers (`es01`/`es02`/`es03` and their `-linked` variants) in the Buildkite `post-command` hook, before the existing artifact upload. The captured files land at `.es/<name>-<short-id>.log` (same as the existing logs that are captured successfully) and are picked up by the existing `.es/es*.log` glob during artifact upload. ## The gap `extractAndArchiveLogs` only writes those files when FTR's **async** teardown reaches it. On failed steps the **sync** `process.on('exit')` handler `teardownServerlessClusterSync` runs instead - it `docker kill`s the containers without capturing logs. Containers are killed but not removed, so `docker logs` still works from the post-command hook. <img width="1392" height="1105" alt="image" src="https://github.com/user-attachments/assets/3857c233-6a9c-4c65-acf0-d0fdc6b4f4e6" /> ## The fix ```bash if command -v docker >/dev/null 2>&1; then for cname in es01 es02 es03 es01-linked es02-linked es03-linked; do if ! docker container inspect "$cname" >/dev/null 2>&1; then continue; fi cid=$(docker inspect --format '{{.Id}}' "$cname" 2>/dev/null | cut -c1-12) out=".es/${cname}-${cid}.log" if [[ -n "$cid" && ! -s "$out" ]]; then mkdir -p .es docker logs "$cname" > "$out" 2>&1 || true fi done fi ``` - Pipeline-scoped via `BUILDKITE_PIPELINE_SLUG`. - Idempotent via `! -s "$out"` - never overwrites the success-path capture. - Same filename pattern as `extractAndArchiveLogs`, so the existing `.es/es*.log` glob uploads it. ## Failure mode coverage | Scenario | Before | After | |---|---|---| | Tests pass | Yes (via `extractAndArchiveLogs`) | Yes, unchanged (rescue skipped by `-s` guard) | | Test assertion fails | Usually, races with async teardown | Yes | | Uncaught exception / `process.exit()` | No | Yes | | Step timeout (`SIGTERM`) | No | Yes | | Test process `SIGKILL` (hard cancel, OOM-kill) | No | Yes | | Agent VM lost (spot preemption) | No | No (hook can't run) | Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent cefdb30 commit 3b29644

1 file changed

Lines changed: 17 additions & 0 deletions

File tree

.buildkite/scripts/lifecycle/post_command.sh

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,23 @@ if [[ "$IS_TEST_EXECUTION_STEP" == "true" ]]; then
4747
'.es/uiam*.log'
4848
)
4949

50+
if [[ "${BUILDKITE_PIPELINE_SLUG:-}" == "kibana-elasticsearch-snapshot-verify" ]]; then
51+
# Rescue ES Docker container logs the in-process teardown may have missed.
52+
if command -v docker >/dev/null 2>&1; then
53+
for cname in es01 es02 es03 es01-linked es02-linked es03-linked; do
54+
if ! docker container inspect "$cname" >/dev/null 2>&1; then
55+
continue
56+
fi
57+
cid=$(docker inspect --format '{{.Id}}' "$cname" 2>/dev/null | cut -c1-12)
58+
out=".es/${cname}-${cid}.log"
59+
if [[ -n "$cid" && ! -s "$out" ]]; then
60+
mkdir -p .es
61+
docker logs "$cname" > "$out" 2>&1 || true
62+
fi
63+
done
64+
fi
65+
fi
66+
5067
buildkite-agent artifact upload "$(printf '%s;' "${ARTIFACT_PATTERNS[@]}")"
5168

5269
if [[ $BUILDKITE_COMMAND_EXIT_STATUS -ne 0 ]]; then

0 commit comments

Comments
 (0)