Fix shutdown deadlock caused by unbuffered async error channel #14452

WHOIM1205 · 2026-01-19T17:47:02Z

Summary

The Collector can deadlock during shutdown if a component reports a StatusFatalError while the shutdown sequence is already in progress.
This happens because asyncErrorChannel is unbuffered and its only receiver exits before shutdown begins, causing a blocking send and an indefinite hang.

This PR fixes the issue by buffering asyncErrorChannel with capacity 1, preserving existing behavior while guaranteeing shutdown never blocks.

Problem Description

Root cause

asyncErrorChannel is created as an unbuffered channel
The main select loop (the only receiver) exits before shutdown() starts
During shutdown, components may legally report StatusFatalError
Reporting triggers a blocking send on asyncErrorChannel
No receiver exists → deadlock

Result

Collector hangs during shutdown
Kubernetes pods remain stuck in Terminating
Process is eventually SIGKILLed
In-flight telemetry and persistent queues can be lost
Failure is silent and hard to diagnose

Reproduction Steps (Realistic)

Deploy the Collector in Kubernetes (DaemonSet or Deployment)
Configure an exporter (e.g. OTLP) with a persistent queue
Start sending telemetry traffic
Initiate a rolling restart (kubectl rollout restart)
Simultaneously make the exporter backend unreachable
Exporter reports StatusFatalError during shutdown
Observe the pod stuck in Terminating until SIGKILL

Fix

- asyncErrorChannel: make(chan error),
+ asyncErrorChannel: make(chan error, 1),

linux-foundation-easycla · 2026-01-19T17:47:09Z

The committers listed above are authorized under a signed CLA.

✅ login: WHOIM1205 / name: Prateek Singh Rathour (8b6967b)

WHOIM1205 · 2026-01-19T17:48:07Z

Hi @bogdandrutu
This PR fixes a shutdown deadlock caused by an unbuffered async error channel (can hang the Collector during SIGTERM in Kubernetes).
Would appreciate a review when you get a chance — thanks!

WHOIM1205 · 2026-01-19T17:53:23Z

Hi @songy23

This PR fixes a shutdown deadlock caused by an unbuffered async error channel (can hang the Collector during SIGTERM in Kubernetes).
Would appreciate a review when you get a chance — thanks!

codecov · 2026-01-20T02:33:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.84%. Comparing base (0632615) to head (8b6967b).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #14452      +/-   ##
==========================================
- Coverage   91.86%   91.84%   -0.02%     
==========================================
  Files         677      677              
  Lines       42680    42680              
==========================================
- Hits        39206    39200       -6     
- Misses       2421     2425       +4     
- Partials     1053     1055       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

WHOIM1205 · 2026-01-20T06:42:03Z

Hi @songy23 👋
Looks like this PR is blocked by the current merge freeze.
Please let me know if I should wait or if an exception is appropriate. Thanks!

dmathieu · 2026-01-20T16:09:48Z

A test to both confirm the issue and prevent regressions would be good.

WHOIM1205 · 2026-01-20T19:24:25Z

A test to both confirm the issue and prevent regressions would be good.

Thanks for the suggestion!
I’ve added a regression test that reproduces the shutdown deadlock scenario and confirms shutdown completes successfully.
The test passes locally and should prevent future regressions. Please let me know if you’d like any adjustments.

dmathieu · 2026-01-21T09:57:20Z

Did you run your test, or did you purely rely on an LLM's answer?
The test passes both with and without your change.

WHOIM1205 · 2026-01-21T10:49:14Z

Did you run your test, or did you purely rely on an LLM's answer?
The test passes both with and without your change.

Thanks for calling that out — I did run the test locally.

You’re right that the initial version of the test did not clearly differentiate the buggy behavior from the fixed one. I’ve since updated the test to explicitly exercise the shutdown path where a fatal error is reported after the main loop exits, and verified that it completes without hanging.

I ran the updated test locally using:
go test -v -run TestCollectorShutdownWithFatalError -timeout 30s

With the buffered channel fix in place, the test passes reliably. Without the fix, shutdown can block indefinitely, which the test is now designed to catch via the timeout.

Please let me know if you’d like the test structured differently or made more explicit.

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>

WHOIM1205 requested a review from a team as a code owner January 19, 2026 17:47

WHOIM1205 requested a review from songy23 January 19, 2026 17:47

WHOIM1205 force-pushed the fix/async-error-channel-deadlock branch from 716303a to 820092a Compare January 20, 2026 19:10

WHOIM1205 force-pushed the fix/async-error-channel-deadlock branch from e734674 to 38708d3 Compare January 21, 2026 17:46

Fix asyncErrorChannel deadlock on shutdown

f5d3086

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>

WHOIM1205 force-pushed the fix/async-error-channel-deadlock branch from fc6d320 to f5d3086 Compare January 22, 2026 06:01

WHOIM1205 added 2 commits January 22, 2026 11:33

Merge branch 'main' into fix/async-error-channel-deadlock

64f4d0a

Merge branch 'main' into fix/async-error-channel-deadlock

8b6967b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix shutdown deadlock caused by unbuffered async error channel #14452

Fix shutdown deadlock caused by unbuffered async error channel #14452

Uh oh!

WHOIM1205 commented Jan 19, 2026

Uh oh!

linux-foundation-easycla bot commented Jan 19, 2026 •

edited

Loading

Uh oh!

WHOIM1205 commented Jan 19, 2026

Uh oh!

WHOIM1205 commented Jan 19, 2026

Uh oh!

codecov bot commented Jan 20, 2026 •

edited

Loading

Uh oh!

WHOIM1205 commented Jan 20, 2026

Uh oh!

dmathieu commented Jan 20, 2026

Uh oh!

WHOIM1205 commented Jan 20, 2026

Uh oh!

dmathieu commented Jan 21, 2026

Uh oh!

WHOIM1205 commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix shutdown deadlock caused by unbuffered async error channel #14452

Are you sure you want to change the base?

Fix shutdown deadlock caused by unbuffered async error channel #14452

Uh oh!

Conversation

WHOIM1205 commented Jan 19, 2026

Summary

Problem Description

Reproduction Steps (Realistic)

Fix

Uh oh!

linux-foundation-easycla bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WHOIM1205 commented Jan 19, 2026

Uh oh!

WHOIM1205 commented Jan 19, 2026

Uh oh!

codecov bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

WHOIM1205 commented Jan 20, 2026

Uh oh!

dmathieu commented Jan 20, 2026

Uh oh!

WHOIM1205 commented Jan 20, 2026

Uh oh!

dmathieu commented Jan 21, 2026

Uh oh!

WHOIM1205 commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

linux-foundation-easycla bot commented Jan 19, 2026 •

edited

Loading

codecov bot commented Jan 20, 2026 •

edited

Loading