Skip to content

ci: add OOM kill detection to integration test runs#1028

Draft
cpuguy83 wants to merge 2 commits intoproject-dalec:mainfrom
cpuguy83:fix_test_memory_usage
Draft

ci: add OOM kill detection to integration test runs#1028
cpuguy83 wants to merge 2 commits intoproject-dalec:mainfrom
cpuguy83:fix_test_memory_usage

Conversation

@cpuguy83
Copy link
Copy Markdown
Collaborator

@cpuguy83 cpuguy83 commented Apr 8, 2026

Summary

  • Adds a background dmesg monitor to integration test jobs that watches for kernel OOM killer messages during the test run
  • After tests complete (pass or fail), checks if any OOM kills occurred and emits a ::warning:: annotation if so
  • Captures the tail of dmesg output into the existing test reports artifact for post-mortem analysis

Motivation

We've been seeing flaky CI failures on deb distros where dpkg reports a subprocess segfault (exit status 139) during postinst scripts (e.g. libc-bin). The suspicion is that these are OOM kills on the CI runners (2 vCPUs, 7 GB RAM), particularly during QEMU-emulated arm64 builds. There's currently no way to confirm this.

This PR adds passive detection so we can correlate test failures with OOM events and decide on the right mitigation (e.g. limiting BuildKit parallelism).

Copilot AI review requested due to automatic review settings April 8, 2026 22:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves CI observability for flaky integration test runs by adding passive detection/reporting of kernel OOM-killer events during the integration test job, and attaching relevant kernel log context to the existing artifacts.

Changes:

  • Adds a background dmesg follower that filters and records potential OOM-killer messages during integration tests.
  • Adds an always-run post-test step that stops the monitor, emits a GitHub Actions warning if OOM activity was detected, and copies logs into /tmp/reports.
  • Captures the last 200 lines of dmesg into the reports artifact for post-mortem debugging.

@cpuguy83 cpuguy83 marked this pull request as draft April 8, 2026 23:11
Monitor kernel dmesg for OOM killer messages during integration tests
to help diagnose flaky dpkg segfaults (exit status 139) on deb distros
that may be caused by memory pressure on the CI runners.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
@cpuguy83 cpuguy83 force-pushed the fix_test_memory_usage branch from 663e68e to 5879ce3 Compare April 8, 2026 23:24
Add timeout signaling from test2json2gha to GITHUB_OUTPUT so subsequent
CI steps can detect when tests timed out. On timeout, the dump logs step
now collects goroutine stacks, a binary heap profile, and the dockerd
binary from the runner for offline analysis with go tool pprof.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants