-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix batch processor shutdown deadlock & goroutine leak under load #14463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix batch processor shutdown deadlock & goroutine leak under load #14463
Conversation
Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
1c0b498 to
d39b5b0
Compare
| }() | ||
|
|
||
| // Wait for the shard to be blocked in export | ||
| time.Sleep(50 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid time.Sleep calls, please.
| close(shutdownDone) | ||
| }() | ||
|
|
||
| // Unblock the consumer so shutdown can complete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not needed to comment all the lines
| var errTooManyBatchers = consumererror.NewPermanent(errors.New("too many batcher metadata-value combinations")) | ||
|
|
||
| // errShuttingDown is returned when data is received while the processor is shutting down. | ||
| var errShuttingDown = errors.New("batch processor is shutting down") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the component sending the data, I think this can be retried if it is not marked as permanent.
Summary
This PR fixes a critical shutdown deadlock and goroutine leak in the core batch processor caused by blocking channel sends that ignore context cancellation and shutdown signals.
Under high load, this bug can cause the OpenTelemetry Collector to hang indefinitely during shutdown, resulting in SIGKILL by Kubernetes and 100% loss of buffered telemetry.
Problem Description
The batch processor’s
consume()path performs an unconditional blocking send to an internal buffered channel.Key issues:
consume()ignorescontext.ContextDuring shutdown:
consume()This leads to deadlock, goroutine leaks, and silent telemetry loss.
Affected Code
processor/batchprocessor/batch_processor.gosingleShardBatcher.consumemultiShardBatcher.consumeRoot Cause
The batch processor assumes:
These assumptions break under load:
Fix
Make
consume()context- and shutdown-aware:selectctx.Done()(request cancellation / deadlines)shutdownC(processor shutdown signal)This preserves existing behavior while preventing deadlocks and leaks.
Tests Added
Shutdown Safety
Context Cancellation
Consume*respects request timeoutscontext.DeadlineExceededinstead of blockingRegression Coverage
Reproduction (Before Fix)
otlpreceiverbatchprocessorImpact
Before
After
Risk Assessment