Skip to content

Conversation

@WHOIM1205
Copy link

Summary

This PR fixes a critical shutdown deadlock and goroutine leak in the core batch processor caused by blocking channel sends that ignore context cancellation and shutdown signals.

Under high load, this bug can cause the OpenTelemetry Collector to hang indefinitely during shutdown, resulting in SIGKILL by Kubernetes and 100% loss of buffered telemetry.


Problem Description

The batch processor’s consume() path performs an unconditional blocking send to an internal buffered channel.

Key issues:

  • consume() ignores context.Context
  • Channel buffer is bounded
  • Blocked senders are never released during shutdown

During shutdown:

  1. Receivers shut down first (topological order)
  2. In-flight requests block in consume()
  3. Batch processor shutdown never completes
  4. Collector hangs until force-killed

This leads to deadlock, goroutine leaks, and silent telemetry loss.


Affected Code

  • processor/batchprocessor/batch_processor.go
  • singleShardBatcher.consume
  • multiShardBatcher.consume

Root Cause

The batch processor assumes:

  • Channel sends will not block for long
  • Draining buffers during shutdown is sufficient

These assumptions break under load:

  • Producers can block indefinitely
  • Shutdown does not signal or unblock senders
  • Context cancellation is ignored

Fix

Make consume() context- and shutdown-aware:

  • Replace unconditional channel send with select
  • Respect:
    • ctx.Done() (request cancellation / deadlines)
    • shutdownC (processor shutdown signal)
  • Return an error instead of blocking forever

This preserves existing behavior while preventing deadlocks and leaks.


Tests Added

Shutdown Safety

  • Shutdown while sends are blocked
  • Single-shard and multi-shard batchers
  • Traces, Metrics, Logs

Context Cancellation

  • Consume* respects request timeouts
  • Returns context.DeadlineExceeded instead of blocking

Regression Coverage

  • Normal batch behavior remains unchanged

Reproduction (Before Fix)

  1. Start Collector with:
    • otlp receiver
    • batch processor
    • Slow exporter
  2. Send high-volume telemetry continuously
  3. Trigger shutdown (SIGTERM / pod eviction)
  4. Observe:
    • Shutdown hangs
    • Collector is SIGKILLed
    • Buffered telemetry is lost

Impact

Before

  • Shutdown deadlock
  • Goroutine leaks
  • SIGKILL during rolling updates
  • Silent loss of buffered telemetry

After

  • Clean and fast shutdown
  • No blocked goroutines
  • Proper cancellation handling
  • Telemetry integrity preserved

Risk Assessment

  • Low risk
  • Minimal, localized change
  • Comprehensive test coverage
  • No steady-state behavior change

@WHOIM1205 WHOIM1205 requested a review from a team as a code owner January 21, 2026 19:33
Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
@WHOIM1205 WHOIM1205 force-pushed the fix/batch-processor-shutdown-deadlock branch from 1c0b498 to d39b5b0 Compare January 21, 2026 19:36
}()

// Wait for the shard to be blocked in export
time.Sleep(50 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid time.Sleep calls, please.

close(shutdownDone)
}()

// Unblock the consumer so shutdown can complete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not needed to comment all the lines

var errTooManyBatchers = consumererror.NewPermanent(errors.New("too many batcher metadata-value combinations"))

// errShuttingDown is returned when data is received while the processor is shutting down.
var errShuttingDown = errors.New("batch processor is shutting down")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the component sending the data, I think this can be retried if it is not marked as permanent.

@jmacd
Copy link
Contributor

jmacd commented Jan 26, 2026

See also #14473

See also #13583

I do not think we should maintain this component.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants