Skip to content

Conversation

@WHOIM1205
Copy link

Summary

The Collector can deadlock during shutdown if a component reports a StatusFatalError while the shutdown sequence is already in progress.
This happens because asyncErrorChannel is unbuffered and its only receiver exits before shutdown begins, causing a blocking send and an indefinite hang.

This PR fixes the issue by buffering asyncErrorChannel with capacity 1, preserving existing behavior while guaranteeing shutdown never blocks.


Problem Description

Root cause

  • asyncErrorChannel is created as an unbuffered channel
  • The main select loop (the only receiver) exits before shutdown() starts
  • During shutdown, components may legally report StatusFatalError
  • Reporting triggers a blocking send on asyncErrorChannel
  • No receiver exists → deadlock

Result

  • Collector hangs during shutdown
  • Kubernetes pods remain stuck in Terminating
  • Process is eventually SIGKILLed
  • In-flight telemetry and persistent queues can be lost
  • Failure is silent and hard to diagnose

Reproduction Steps (Realistic)

  1. Deploy the Collector in Kubernetes (DaemonSet or Deployment)
  2. Configure an exporter (e.g. OTLP) with a persistent queue
  3. Start sending telemetry traffic
  4. Initiate a rolling restart (kubectl rollout restart)
  5. Simultaneously make the exporter backend unreachable
  6. Exporter reports StatusFatalError during shutdown
  7. Observe the pod stuck in Terminating until SIGKILL

Fix

- asyncErrorChannel: make(chan error),
+ asyncErrorChannel: make(chan error, 1),

@WHOIM1205 WHOIM1205 requested a review from a team as a code owner January 19, 2026 17:47
@WHOIM1205 WHOIM1205 requested a review from songy23 January 19, 2026 17:47
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jan 19, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: WHOIM1205 / name: Prateek Singh Rathour (8b6967b)

@WHOIM1205
Copy link
Author

Hi @bogdandrutu
This PR fixes a shutdown deadlock caused by an unbuffered async error channel (can hang the Collector during SIGTERM in Kubernetes).
Would appreciate a review when you get a chance — thanks!

@WHOIM1205
Copy link
Author

Hi @songy23

This PR fixes a shutdown deadlock caused by an unbuffered async error channel (can hang the Collector during SIGTERM in Kubernetes).
Would appreciate a review when you get a chance — thanks!

@codecov
Copy link

codecov bot commented Jan 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.84%. Comparing base (0632615) to head (8b6967b).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14452      +/-   ##
==========================================
- Coverage   91.86%   91.84%   -0.02%     
==========================================
  Files         677      677              
  Lines       42680    42680              
==========================================
- Hits        39206    39200       -6     
- Misses       2421     2425       +4     
- Partials     1053     1055       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@WHOIM1205
Copy link
Author

Hi @songy23 👋
Looks like this PR is blocked by the current merge freeze.
Please let me know if I should wait or if an exception is appropriate. Thanks!

@dmathieu
Copy link
Member

A test to both confirm the issue and prevent regressions would be good.

@WHOIM1205 WHOIM1205 force-pushed the fix/async-error-channel-deadlock branch from 716303a to 820092a Compare January 20, 2026 19:10
@WHOIM1205
Copy link
Author

A test to both confirm the issue and prevent regressions would be good.

Thanks for the suggestion!
I’ve added a regression test that reproduces the shutdown deadlock scenario and confirms shutdown completes successfully.
The test passes locally and should prevent future regressions. Please let me know if you’d like any adjustments.

@dmathieu
Copy link
Member

Did you run your test, or did you purely rely on an LLM's answer?
The test passes both with and without your change.

@WHOIM1205
Copy link
Author

Did you run your test, or did you purely rely on an LLM's answer?
The test passes both with and without your change.

Thanks for calling that out — I did run the test locally.

You’re right that the initial version of the test did not clearly differentiate the buggy behavior from the fixed one. I’ve since updated the test to explicitly exercise the shutdown path where a fatal error is reported after the main loop exits, and verified that it completes without hanging.

I ran the updated test locally using:
go test -v -run TestCollectorShutdownWithFatalError -timeout 30s

With the buffered channel fix in place, the test passes reliably. Without the fix, shutdown can block indefinitely, which the test is now designed to catch via the timeout.

Please let me know if you’d like the test structured differently or made more explicit.

@WHOIM1205 WHOIM1205 force-pushed the fix/async-error-channel-deadlock branch from e734674 to 38708d3 Compare January 21, 2026 17:46
Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
@WHOIM1205 WHOIM1205 force-pushed the fix/async-error-channel-deadlock branch from fc6d320 to f5d3086 Compare January 22, 2026 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants