Problem
GHA runners crash with System.IO.IOException: No space left on device on the runner's internal diagnostic log (/home/runner/actions-runner/cached/2.334.0/_diag/Worker_*.log), not on build storage. This kills the runner process before the build step completes.
The crash consistently affects [rhoai] hermetic builds (which download ~3GB prefetch deps + subscription setup) but not [odh] builds on the same commit. It affects both PR builds and manually-triggered push builds that include [rhoai].
Evidence
PR builds (symlink PR #3633): Multiple [rhoai] jobs crash with identical stack trace:
runtime-minimal-ubi9-python-3.12 · linux/amd64 [rhoai]
runtime-datascience-ubi9-python-3.12 · linux/amd64 [rhoai]
jupyter-minimal-ubi9-python-3.12 · linux/s390x [rhoai]
Main push builds (manual dispatch): Same crash on run 25937471902:
jupyter-minimal-ubi9-python-3.12 · linux/ppc64le / rhoai
jupyter-minimal-ubi9-python-3.12 · linux/amd64 / rhoai
Auto push builds (odh only): No _diag crashes — only unrelated pytest failures.
Disk layout at crash time
From the LVM overlay step logs:
/dev/root 72G 58G 15G 80% /
/dev/sdb1 74G 8.1G 62G 12% /mnt
/dev/mapper/buildvg-buildlv 72G 5.8M 70G 1% /home/runner/.local/share/containers
The build storage (buildvg-buildlv) has 70GB free. Root has 15GB free. The _diag/Worker_*.log lives on root — and something fills those 15GB during the build.
The visible step logs for the crashed jobs total only ~150-280KB, so it's not the build stdout filling root. The runner's internal _diag tracing captures more than what's visible in the step logs.
Diagnostic monitoring added
PR #3633 commit 9d99f4c3b adds background monitoring to the build template:
df -h + _diag directory size every 30s
fatrace — traces which processes write to root fs
bpftrace — sums bytes written per process every 30s
- Each long-running step (prefetch, build) tails the monitor inline so output is visible even if the runner crashes
Hypotheses
-
Runner diagnostic tracing bloat: The runner traces all stdout/stderr to _diag/Worker_*.log internally, not just what's shown in step logs. Hermetic prefetch + subscription setup may produce verbose internal tracing that fills root.
-
Container storage leaking to root: Despite the LVM overlay, some container operations (layer extraction, cache) may write to root before the overlay is mounted.
-
Go module cache / tool downloads: go build for bin/buildinputs, uv sync, and other setup steps download to ~/go/, ~/.cache/uv/ etc. on root.
-
QEMU emulation overhead: Cross-arch builds (s390x, ppc64le via qemu-user) may produce additional core dumps or temp files on root.
-
Runner image got larger: If the base runner image grew, 15GB remaining after LVM setup is no longer enough. May need to adjust LVM allocation to leave more space on root.
Potential fixes
- Move
_diag to LVM volume: Symlink _diag directory to the 70GB build volume
- Leave more root space in LVM: Adjust
gha_lvm_overlay.sh to take less from root
- Truncate
_diag logs periodically: Background job that truncates Worker_*.log when it exceeds a threshold
- Reduce runner tracing: Check if
ACTIONS_RUNNER_DEBUG is somehow enabled, or if there's a way to reduce internal trace verbosity
Related issues
Next steps
Problem
GHA runners crash with
System.IO.IOException: No space left on deviceon the runner's internal diagnostic log (/home/runner/actions-runner/cached/2.334.0/_diag/Worker_*.log), not on build storage. This kills the runner process before the build step completes.The crash consistently affects
[rhoai]hermetic builds (which download ~3GB prefetch deps + subscription setup) but not[odh]builds on the same commit. It affects both PR builds and manually-triggered push builds that include[rhoai].Evidence
PR builds (symlink PR #3633): Multiple
[rhoai]jobs crash with identical stack trace:runtime-minimal-ubi9-python-3.12 · linux/amd64 [rhoai]runtime-datascience-ubi9-python-3.12 · linux/amd64 [rhoai]jupyter-minimal-ubi9-python-3.12 · linux/s390x [rhoai]Main push builds (manual dispatch): Same crash on run 25937471902:
jupyter-minimal-ubi9-python-3.12 · linux/ppc64le / rhoaijupyter-minimal-ubi9-python-3.12 · linux/amd64 / rhoaiAuto push builds (odh only): No
_diagcrashes — only unrelated pytest failures.Disk layout at crash time
From the LVM overlay step logs:
The build storage (
buildvg-buildlv) has 70GB free. Root has 15GB free. The_diag/Worker_*.loglives on root — and something fills those 15GB during the build.The visible step logs for the crashed jobs total only ~150-280KB, so it's not the build stdout filling root. The runner's internal
_diagtracing captures more than what's visible in the step logs.Diagnostic monitoring added
PR #3633 commit
9d99f4c3badds background monitoring to the build template:df -h+_diagdirectory size every 30sfatrace— traces which processes write to root fsbpftrace— sums bytes written per process every 30sHypotheses
Runner diagnostic tracing bloat: The runner traces all stdout/stderr to
_diag/Worker_*.loginternally, not just what's shown in step logs. Hermetic prefetch + subscription setup may produce verbose internal tracing that fills root.Container storage leaking to root: Despite the LVM overlay, some container operations (layer extraction, cache) may write to root before the overlay is mounted.
Go module cache / tool downloads:
go buildforbin/buildinputs,uv sync, and other setup steps download to~/go/,~/.cache/uv/etc. on root.QEMU emulation overhead: Cross-arch builds (s390x, ppc64le via qemu-user) may produce additional core dumps or temp files on root.
Runner image got larger: If the base runner image grew, 15GB remaining after LVM setup is no longer enough. May need to adjust LVM allocation to leave more space on root.
Potential fixes
_diagto LVM volume: Symlink_diagdirectory to the 70GB build volumegha_lvm_overlay.shto take less from root_diaglogs periodically: Background job that truncatesWorker_*.logwhen it exceeds a thresholdACTIONS_RUNNER_DEBUGis somehow enabled, or if there's a way to reduce internal trace verbosityRelated issues
_diagcrash stack traceNext steps
_diag, adjust LVM, or reduce tracing)🧩 Analysis chain
🏁 Script executed:
Repository: opendatahub-io/notebooks
Length of output: 19930
🏁 Script executed: