Skip to content

ci: GHA runner crashes with "No space left on device" on _diag/Worker_*.log during [rhoai] builds #3641

@jiridanek

Description

@jiridanek

Problem

GHA runners crash with System.IO.IOException: No space left on device on the runner's internal diagnostic log (/home/runner/actions-runner/cached/2.334.0/_diag/Worker_*.log), not on build storage. This kills the runner process before the build step completes.

The crash consistently affects [rhoai] hermetic builds (which download ~3GB prefetch deps + subscription setup) but not [odh] builds on the same commit. It affects both PR builds and manually-triggered push builds that include [rhoai].

Evidence

PR builds (symlink PR #3633): Multiple [rhoai] jobs crash with identical stack trace:

  • runtime-minimal-ubi9-python-3.12 · linux/amd64 [rhoai]
  • runtime-datascience-ubi9-python-3.12 · linux/amd64 [rhoai]
  • jupyter-minimal-ubi9-python-3.12 · linux/s390x [rhoai]

Main push builds (manual dispatch): Same crash on run 25937471902:

  • jupyter-minimal-ubi9-python-3.12 · linux/ppc64le / rhoai
  • jupyter-minimal-ubi9-python-3.12 · linux/amd64 / rhoai

Auto push builds (odh only): No _diag crashes — only unrelated pytest failures.

Disk layout at crash time

From the LVM overlay step logs:

/dev/root                     72G   58G   15G  80% /
/dev/sdb1                     74G  8.1G   62G  12% /mnt
/dev/mapper/buildvg-buildlv   72G  5.8M   70G   1% /home/runner/.local/share/containers

The build storage (buildvg-buildlv) has 70GB free. Root has 15GB free. The _diag/Worker_*.log lives on root — and something fills those 15GB during the build.

The visible step logs for the crashed jobs total only ~150-280KB, so it's not the build stdout filling root. The runner's internal _diag tracing captures more than what's visible in the step logs.

Diagnostic monitoring added

PR #3633 commit 9d99f4c3b adds background monitoring to the build template:

  • df -h + _diag directory size every 30s
  • fatrace — traces which processes write to root fs
  • bpftrace — sums bytes written per process every 30s
  • Each long-running step (prefetch, build) tails the monitor inline so output is visible even if the runner crashes

Hypotheses

  1. Runner diagnostic tracing bloat: The runner traces all stdout/stderr to _diag/Worker_*.log internally, not just what's shown in step logs. Hermetic prefetch + subscription setup may produce verbose internal tracing that fills root.

  2. Container storage leaking to root: Despite the LVM overlay, some container operations (layer extraction, cache) may write to root before the overlay is mounted.

  3. Go module cache / tool downloads: go build for bin/buildinputs, uv sync, and other setup steps download to ~/go/, ~/.cache/uv/ etc. on root.

  4. QEMU emulation overhead: Cross-arch builds (s390x, ppc64le via qemu-user) may produce additional core dumps or temp files on root.

  5. Runner image got larger: If the base runner image grew, 15GB remaining after LVM setup is no longer enough. May need to adjust LVM allocation to leave more space on root.

Potential fixes

  • Move _diag to LVM volume: Symlink _diag directory to the 70GB build volume
  • Leave more root space in LVM: Adjust gha_lvm_overlay.sh to take less from root
  • Truncate _diag logs periodically: Background job that truncates Worker_*.log when it exceeds a threshold
  • Reduce runner tracing: Check if ACTIONS_RUNNER_DEBUG is somehow enabled, or if there's a way to reduce internal trace verbosity

Related issues

Next steps

Pinned by jiridanek

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions