Commit 3b29644
[CI] Rescue ES Docker container logs on failed snapshot-verify steps (elastic#270460)
**Related: elastic#270430** (same pipeline, complementary).
## Summary
The `kibana-elasticsearch-snapshot-verify` pipeline has been failing due
to a suspected ES snapshot performance regression. When an FTR test
times out, we're no longer getting ES logs which could help us to
diagnose the problem further.
This PR adds some logic to capture `docker logs` for ES containers
(`es01`/`es02`/`es03` and their `-linked` variants) in the Buildkite
`post-command` hook, before the existing artifact upload. The captured
files land at `.es/<name>-<short-id>.log` (same as the existing logs
that are captured successfully) and are picked up by the existing
`.es/es*.log` glob during artifact upload.
## The gap
`extractAndArchiveLogs` only writes those files when FTR's **async**
teardown reaches it. On failed steps the **sync** `process.on('exit')`
handler `teardownServerlessClusterSync` runs instead - it `docker kill`s
the containers without capturing logs. Containers are killed but not
removed, so `docker logs` still works from the post-command hook.
<img width="1392" height="1105" alt="image"
src="https://github.com/user-attachments/assets/3857c233-6a9c-4c65-acf0-d0fdc6b4f4e6"
/>
## The fix
```bash
if command -v docker >/dev/null 2>&1; then
for cname in es01 es02 es03 es01-linked es02-linked es03-linked; do
if ! docker container inspect "$cname" >/dev/null 2>&1; then continue; fi
cid=$(docker inspect --format '{{.Id}}' "$cname" 2>/dev/null | cut -c1-12)
out=".es/${cname}-${cid}.log"
if [[ -n "$cid" && ! -s "$out" ]]; then
mkdir -p .es
docker logs "$cname" > "$out" 2>&1 || true
fi
done
fi
```
- Pipeline-scoped via `BUILDKITE_PIPELINE_SLUG`.
- Idempotent via `! -s "$out"` - never overwrites the success-path
capture.
- Same filename pattern as `extractAndArchiveLogs`, so the existing
`.es/es*.log` glob uploads it.
## Failure mode coverage
| Scenario | Before | After |
|---|---|---|
| Tests pass | Yes (via `extractAndArchiveLogs`) | Yes, unchanged
(rescue skipped by `-s` guard) |
| Test assertion fails | Usually, races with async teardown | Yes |
| Uncaught exception / `process.exit()` | No | Yes |
| Step timeout (`SIGTERM`) | No | Yes |
| Test process `SIGKILL` (hard cancel, OOM-kill) | No | Yes |
| Agent VM lost (spot preemption) | No | No (hook can't run) |
Co-authored-by: Cursor <cursoragent@cursor.com>1 parent cefdb30 commit 3b29644
1 file changed
Lines changed: 17 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
50 | 67 | | |
51 | 68 | | |
52 | 69 | | |
| |||
0 commit comments