Skip to content

Commit f538398

Browse files
leifericfclaude
andcommitted
diag: SIGABRT watchdog on CI test hang so backtrace survives
v0.255.9 fixed one real bug at transient-survives-gc-yield (use- after-free reproducible locally under ASan), but mino's CI matrix still hangs at the same test on macos-14 / ubuntu-24.04{,-arm}. Without a stack trace from the hung process we can't tell whether mino is in a tight loop, a pthread cv_wait, a GC mark-stack drain, or somewhere else. .github/workflows/ci.yml's Test step now wraps `./mino tests/ run.clj` in a watchdog: * Streams stderr to /tmp/test_trace.log (already an artifact on failure since v0.255.8). * Backgrounds mino via `(exec ./mino ...)` so the subshell's $! is mino's pid directly. * 7m30s into the run, if mino is still alive, sends SIGABRT. * mino's existing crash_handler (main.c:711) prints fatal signal, GC stats (minor / major / live / alloc / freed / phase / remset), and a libc-backtrace stack frame list. * The 30s buffer before GHA's own SIGKILL gives the handler time to flush and exit cleanly with code 134. Local behaviour unchanged. The wrapper only lives in CI. This release contains no runtime change; v0.255.9's fix is still load-bearing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent b9603f9 commit f538398

3 files changed

Lines changed: 77 additions & 4 deletions

File tree

.github/workflows/ci.yml

Lines changed: 42 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -89,14 +89,53 @@ jobs:
8989
# an opaque "Test timed out" actionable. Keeps trace off
9090
# locally (env-gated) so a normal `./mino tests/run.clj`
9191
# produces the same output as before.
92+
#
93+
# Watchdog wrapper: GHA's timeout-minutes sends a SIGKILL after
94+
# the cap, which gives no diagnostic at all on a hang. We
95+
# spawn mino in the background, sleep just inside the cap,
96+
# then SIGABRT it -- mino's crash_handler (main.c:711) prints
97+
# a backtrace + gc stats on SIGABRT, so a hang now leaves a
98+
# readable stack in the log instead of a silent kill. mino
99+
# exits non-zero after the dump, which fails the step
100+
# normally (no continue-on-error masking).
92101
env:
93102
MINO_TEST_TRACE: "1"
94103
run: |
95-
set -o pipefail
96-
./mino tests/run.clj 2> >(tee /tmp/test_trace.log >&2)
104+
set +e
105+
# Pre-create + tail the trace file so its lines stream to
106+
# the live job log as mino emits them. Without the tail,
107+
# the trace only appears via the failure artifact, which
108+
# makes a live `gh run watch` opaque.
109+
: > /tmp/test_trace.log
110+
(tail -F /tmp/test_trace.log 2>/dev/null) &
111+
TAIL_PID=$!
112+
# exec replaces the subshell with mino so $! is mino's
113+
# pid directly -- the watchdog's kill -ABRT then lands
114+
# on mino, not on an outer shell wrapper.
115+
(exec ./mino tests/run.clj) 2> /tmp/test_trace.log &
116+
MINO_PID=$!
117+
# Wake at 7m30s (the cap is 8m) so SIGABRT has time to
118+
# run mino's handler before GHA's own SIGKILL lands.
119+
# mino's crash_handler (main.c:711) prints a backtrace +
120+
# GC stats on SIGABRT, so a hang now leaves a readable
121+
# stack in the log + trace artifact instead of a silent
122+
# kill. mino exits 134 (128 + SIGABRT) after the dump.
123+
(sleep 450; if kill -0 $MINO_PID 2>/dev/null; then
124+
echo "##[warning]Watchdog firing SIGABRT on hung mino (pid $MINO_PID)"
125+
kill -ABRT $MINO_PID
126+
fi) &
127+
WD_PID=$!
128+
wait $MINO_PID
129+
RC=$?
130+
# Give the trace tail a moment to flush mino's last lines.
131+
sleep 1
132+
kill $WD_PID 2>/dev/null || true
133+
kill $TAIL_PID 2>/dev/null || true
134+
exit $RC
97135
# Tests usually finish in seconds; a hang means a deadlock, not
98136
# a slow runner. Cap so we get diagnostic output instead of
99-
# waiting on the 6h job-default timeout.
137+
# waiting on the 6h job-default timeout. The watchdog above
138+
# fires 30s before this cap so we keep the stack trace.
100139
timeout-minutes: 8
101140
# The Windows test suite has documented divergence: cmd.exe's
102141
# echo emits a trailing space before \n, which the proc-test

CHANGELOG.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,39 @@
11
# Changelog
22

3+
## v0.255.10 — Diagnostic: SIGABRT Watchdog on CI Test Hang
4+
5+
A diagnostic-only release that converts the remaining CI test
6+
hang from "silent SIGKILL at the 8-min cap" into "SIGABRT 30s
7+
before the cap, mino's crash_handler dumps a backtrace + GC
8+
stats". v0.255.9 fixed one real bug at `transient-survives-gc-
9+
yield` (use-after-free reproducible locally under ASan) but
10+
mino's CI matrix still hangs at the same test on macos-14 /
11+
ubuntu-24.04 / ubuntu-24.04-arm. Without a stack trace from the
12+
hung process we can't tell whether mino is in a tight loop, a
13+
pthread cv_wait, a GC mark-stack drain, or somewhere else.
14+
15+
`.github/workflows/ci.yml`'s Test step now wraps `./mino
16+
tests/run.clj` in a watchdog:
17+
18+
* Streams stderr to `/tmp/test_trace.log` (already an artifact
19+
on failure since v0.255.8).
20+
* Backgrounds mino via `(exec ./mino ...)` so the subshell's
21+
`$!` is mino's pid directly.
22+
* 7m30s into the run, if mino is still alive, sends SIGABRT.
23+
* mino's existing crash_handler (`main.c:711`) prints
24+
`[mino] fatal SIGABRT (signal 6)`, GC stats (minor / major /
25+
live / alloc / freed / phase / remset), and a libc-backtrace
26+
stack frame list.
27+
* The 30s buffer before GHA's own SIGKILL gives the handler
28+
time to flush stdout/stderr and exit cleanly with code 134.
29+
30+
Local behaviour unchanged: a plain `./mino tests/run.clj` still
31+
exits 0 in ~1.6s without the watchdog firing. The wrapper only
32+
lives in CI.
33+
34+
This release does not include any runtime change; v0.255.9's
35+
fix is still load-bearing for the use-after-free path.
36+
337
## v0.255.9 — Fix: `(gc!)` During In-Flight Major Mark Use-After-Free
438

539
Root cause of the v0.255.6 / .7 / .8 CI hang: `mino_gc_collect(MINO_GC_FULL)`

src/mino.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
*/
2929
#define MINO_VERSION_MAJOR 0
3030
#define MINO_VERSION_MINOR 255
31-
#define MINO_VERSION_PATCH 9
31+
#define MINO_VERSION_PATCH 10
3232

3333
/*
3434
* Human-readable version string of the *linked* runtime, e.g. "0.48.0".

0 commit comments

Comments
 (0)