-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix shutdown deadlock caused by unbuffered async error channel #14452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix shutdown deadlock caused by unbuffered async error channel #14452
Conversation
|
|
|
Hi @bogdandrutu |
|
Hi @songy23 This PR fixes a shutdown deadlock caused by an unbuffered async error channel (can hang the Collector during SIGTERM in Kubernetes). |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #14452 +/- ##
==========================================
- Coverage 91.86% 91.84% -0.02%
==========================================
Files 677 677
Lines 42680 42680
==========================================
- Hits 39206 39200 -6
- Misses 2421 2425 +4
- Partials 1053 1055 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi @songy23 👋 |
|
A test to both confirm the issue and prevent regressions would be good. |
716303a to
820092a
Compare
Thanks for the suggestion! |
|
Did you run your test, or did you purely rely on an LLM's answer? |
Thanks for calling that out — I did run the test locally. You’re right that the initial version of the test did not clearly differentiate the buggy behavior from the fixed one. I’ve since updated the test to explicitly exercise the shutdown path where a fatal error is reported after the main loop exits, and verified that it completes without hanging. I ran the updated test locally using: With the buffered channel fix in place, the test passes reliably. Without the fix, shutdown can block indefinitely, which the test is now designed to catch via the timeout. Please let me know if you’d like the test structured differently or made more explicit. |
e734674 to
38708d3
Compare
Signed-off-by: WHOIM1205 <[email protected]>
fc6d320 to
f5d3086
Compare
Summary
The Collector can deadlock during shutdown if a component reports a
StatusFatalErrorwhile the shutdown sequence is already in progress.This happens because
asyncErrorChannelis unbuffered and its only receiver exits before shutdown begins, causing a blocking send and an indefinite hang.This PR fixes the issue by buffering
asyncErrorChannelwith capacity 1, preserving existing behavior while guaranteeing shutdown never blocks.Problem Description
Root cause
asyncErrorChannelis created as an unbuffered channelshutdown()startsStatusFatalErrorasyncErrorChannelResult
TerminatingReproduction Steps (Realistic)
kubectl rollout restart)StatusFatalErrorduring shutdownTerminatinguntil SIGKILLFix