Fix batch processor shutdown deadlock & goroutine leak under load #14463

WHOIM1205 · 2026-01-21T19:33:40Z

Summary

This PR fixes a critical shutdown deadlock and goroutine leak in the core batch processor caused by blocking channel sends that ignore context cancellation and shutdown signals.

Under high load, this bug can cause the OpenTelemetry Collector to hang indefinitely during shutdown, resulting in SIGKILL by Kubernetes and 100% loss of buffered telemetry.

Problem Description

The batch processor’s consume() path performs an unconditional blocking send to an internal buffered channel.

Key issues:

consume() ignores context.Context
Channel buffer is bounded
Blocked senders are never released during shutdown

During shutdown:

Receivers shut down first (topological order)
In-flight requests block in consume()
Batch processor shutdown never completes
Collector hangs until force-killed

This leads to deadlock, goroutine leaks, and silent telemetry loss.

Affected Code

processor/batchprocessor/batch_processor.go
singleShardBatcher.consume
multiShardBatcher.consume

Root Cause

The batch processor assumes:

Channel sends will not block for long
Draining buffers during shutdown is sufficient

These assumptions break under load:

Producers can block indefinitely
Shutdown does not signal or unblock senders
Context cancellation is ignored

Fix

Make consume() context- and shutdown-aware:

Replace unconditional channel send with select
Respect:
- ctx.Done() (request cancellation / deadlines)
- shutdownC (processor shutdown signal)
Return an error instead of blocking forever

This preserves existing behavior while preventing deadlocks and leaks.

Tests Added

Shutdown Safety

Shutdown while sends are blocked
Single-shard and multi-shard batchers
Traces, Metrics, Logs

Context Cancellation

Consume* respects request timeouts
Returns context.DeadlineExceeded instead of blocking

Regression Coverage

Normal batch behavior remains unchanged

Reproduction (Before Fix)

Start Collector with:
- otlp receiver
- batch processor
- Slow exporter
Send high-volume telemetry continuously
Trigger shutdown (SIGTERM / pod eviction)
Observe:
- Shutdown hangs
- Collector is SIGKILLed
- Buffered telemetry is lost

Impact

Before

Shutdown deadlock
Goroutine leaks
SIGKILL during rolling updates
Silent loss of buffered telemetry

After

Clean and fast shutdown
No blocked goroutines
Proper cancellation handling
Telemetry integrity preserved

Risk Assessment

Low risk
Minimal, localized change
Comprehensive test coverage
No steady-state behavior change

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>

iblancasa · 2026-01-22T11:05:15Z

processor/batchprocessor/batch_processor_test.go

+	}()
+
+	// Wait for the shard to be blocked in export
+	time.Sleep(50 * time.Millisecond)


Avoid time.Sleep calls, please.

iblancasa · 2026-01-22T11:05:47Z

processor/batchprocessor/batch_processor_test.go

+		close(shutdownDone)
+	}()
+
+	// Unblock the consumer so shutdown can complete


It is not needed to comment all the lines

iblancasa · 2026-01-22T11:35:33Z

processor/batchprocessor/batch_processor.go

 var errTooManyBatchers = consumererror.NewPermanent(errors.New("too many batcher metadata-value combinations"))

+// errShuttingDown is returned when data is received while the processor is shutting down.
+var errShuttingDown = errors.New("batch processor is shutting down")


Depending on the component sending the data, I think this can be retried if it is not marked as permanent.

jmacd · 2026-01-26T16:28:41Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix batch processor shutdown deadlock & goroutine leak under load #14463

Fix batch processor shutdown deadlock & goroutine leak under load #14463

WHOIM1205 commented Jan 21, 2026

Uh oh!

iblancasa Jan 22, 2026

Uh oh!

iblancasa Jan 22, 2026

Uh oh!

iblancasa Jan 22, 2026

Uh oh!

jmacd commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix batch processor shutdown deadlock & goroutine leak under load #14463

Are you sure you want to change the base?

Fix batch processor shutdown deadlock & goroutine leak under load #14463

Conversation

WHOIM1205 commented Jan 21, 2026

Summary

Problem Description

Affected Code

Root Cause

Fix

Tests Added

Shutdown Safety

Context Cancellation

Regression Coverage

Reproduction (Before Fix)

Impact

Before

After

Risk Assessment

Uh oh!

iblancasa Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

iblancasa Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

iblancasa Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

jmacd commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants