-
Notifications
You must be signed in to change notification settings - Fork 14.3k
MINOR Capture heap dump after OOM on CI #19031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
It looks like the UserQuotaTest is the likely culprit. From https://github.com/apache/kafka/actions/runs/13523225366/job/37791507296
and from https://github.com/apache/kafka/actions/runs/13534291834/job/37823471305
It appears that the Gradle worker is trying to send results to the main process which causes a long GC pause which triggers this "GC overhead limit exceeded" error. |
build.gradle
Outdated
@@ -54,7 +54,7 @@ ext { | |||
buildVersionFileName = "kafka-version.properties" | |||
|
|||
defaultMaxHeapSize = "2g" | |||
defaultJvmArgs = ["-Xss4m", "-XX:+UseParallelGC"] | |||
defaultJvmArgs = ["-Xss4m", "-XX:+UseParallelGC", "-XX:-UseGCOverheadLimit"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ijuma WDYT about disabling this feature? From what I can tell, this will prevent a long GC pause from triggering an OOM. Instead, the build would likely just timeout (which it's doing anyways with the OOM happing in the Gradle worker).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you said, the build is unlikely to succeed in either case. The GC overhead thing at least gives a hint that there is a memory leak or the heap is too small. Isn't that better than a timeout with no information?
Seems to have reproduced here: https://github.com/apache/kafka/actions/runs/13550598471/job/37873138268?pr=19031 No activity for a while after
This suggests that |
I've not been able to reproduce the Gradle OOM that we're seeing on trunk, however I saw a different OOM over on my fork. https://github.com/mumrah/kafka/actions/runs/13639578283/job/38126853058
(Heap dump was uploaded here https://github.com/mumrah/kafka/actions/runs/13639578283/artifacts/2685005476) This at least shows that the OOM arguments and heap dump archiving are working. |
mkdir -p heap-dumps | ||
HEAP_DUMP_DIR=$(readlink -f heap-dumps) | ||
timeout ${TIMEOUT_MINUTES}m ./gradlew --continue --no-scan \ | ||
-Dorg.gradle.jvmargs="-Xmx4g -Xss4m -XX:+UseParallelGC -XX:+UseGCOverheadLimit -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$HEAP_DUMP_DIR" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to pass the heap dump path directly to Gradle as well as to JUnit (inside build.gradle). That's why we have this apparent duplication.
@chia7712 Even with the recent JUnit heap increase, we are seeing OOM on trunk https://github.com/apache/kafka/actions/runs/13807500150/job/38622104924#step:6:616 Can you take a look at this one? |
roger that |
However, the memory offered by Github action is 16GB. Our CI executes four workers, each running four tests in parallel. Consequently, if a PR happens to encounter a set of 'expensive' tests simultaneously, it may exceed the memory limit. |
@chia7712 thanks for investigating. The metric creation inside a loop sounds like the likely culprit. |
WDYT about merging this PR @chia7712? I think it would help us diagnose OOM much faster if we had a heap dump (which of course includes the thread dump in it) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree collecting the log of OOM is a good idea. two small comments are left. PTAL
mkdir -p heap-dumps | ||
HEAP_DUMP_DIR=$(readlink -f heap-dumps) | ||
timeout ${TIMEOUT_MINUTES}m ./gradlew --continue --no-scan \ | ||
-Dorg.gradle.jvmargs="-Xmx4g -Xss4m -XX:+UseParallelGC -XX:+ExitOnOutOfMemoryError -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$HEAP_DUMP_DIR" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add comment to gradle.properties
to remind us to keep the memory configs consistent?
-PcommitId=xxxxxxxxxxxxxxxx \ | ||
$TEST_TASK | ||
exitcode="$?" | ||
find heap-dumps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it used to debug? if so, maybe using ls
can offer more useful output?
We have seen a few OOM errors on trunk lately. This patch adds the ability to capture a heap dump when this happens so we can better determine if the error was due to something in Gradle or within our own tests (like a memory leak).