ci: add OOM kill detection to integration test runs#1028
Draft
cpuguy83 wants to merge 2 commits intoproject-dalec:mainfrom
Draft
ci: add OOM kill detection to integration test runs#1028cpuguy83 wants to merge 2 commits intoproject-dalec:mainfrom
cpuguy83 wants to merge 2 commits intoproject-dalec:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves CI observability for flaky integration test runs by adding passive detection/reporting of kernel OOM-killer events during the integration test job, and attaching relevant kernel log context to the existing artifacts.
Changes:
- Adds a background
dmesgfollower that filters and records potential OOM-killer messages during integration tests. - Adds an always-run post-test step that stops the monitor, emits a GitHub Actions warning if OOM activity was detected, and copies logs into
/tmp/reports. - Captures the last 200 lines of
dmesginto the reports artifact for post-mortem debugging.
Monitor kernel dmesg for OOM killer messages during integration tests to help diagnose flaky dpkg segfaults (exit status 139) on deb distros that may be caused by memory pressure on the CI runners. Signed-off-by: Brian Goff <cpuguy83@gmail.com>
663e68e to
5879ce3
Compare
Add timeout signaling from test2json2gha to GITHUB_OUTPUT so subsequent CI steps can detect when tests timed out. On timeout, the dump logs step now collects goroutine stacks, a binary heap profile, and the dockerd binary from the runner for offline analysis with go tool pprof. Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
dmesgmonitor to integration test jobs that watches for kernel OOM killer messages during the test run::warning::annotation if sodmesgoutput into the existing test reports artifact for post-mortem analysisMotivation
We've been seeing flaky CI failures on deb distros where dpkg reports a subprocess segfault (exit status 139) during postinst scripts (e.g.
libc-bin). The suspicion is that these are OOM kills on the CI runners (2 vCPUs, 7 GB RAM), particularly during QEMU-emulated arm64 builds. There's currently no way to confirm this.This PR adds passive detection so we can correlate test failures with OOM events and decide on the right mitigation (e.g. limiting BuildKit parallelism).