Skip to content

Conversation

@Arunodoy18
Copy link

Description

This PR adds connection pooling support to the OTLP exporter to resolve performance issues in high-throughput and high-latency environments.

Motivation

As reported in the issue, users experience unreliability in the OTLP exporter with:

  • High throughput scenarios (10K+ spans/sec)
  • High-latency network connections (e.g., cross-region deployments)
  • AWS ALB limiting HTTP/2 streams to 128

The single gRPC connection becomes a bottleneck, causing queue overflow and dropped spans.

Changes:

##Core Implementation:

  • *Added connection_pool_sizeconfiguration parameter** toConfig` struct

    • Default: 0 (uses 1 connection for backward compatibility)
    • Range: 0-256 connections
    • Validated in Config.Validate()
  • Implemented connection pool in baseExporter

    • Maintains multiple gRPC connections in slices
    • Round-robin load balancing using atomic counter
    • All data types (traces, metrics, logs, profiles) use connection pool
  • Thread-safe round-robin distribution*

    • getNextExporterIndex()method usesatomic.Uint32`
    • Optimized for single-connection case (no atomic ops)

Documentation

  • Updated README.md with configuration details and examples
  • Added changelog entry in .chloggen/
  • Included high-throughput configuration example

Bug Fix

Testing

  • ✅ All existing tests pass
  • ✅ No compilation errors
  • ✅ Configuration validation works correctly
  • ✅ Backward compatible

Usage Example

``yaml
exporters:
otlp/high-throughput:
endpoint: otel-gateway:443
connection_pool_size: 5 # Creates 5 gRPC connections
compression: snappy
timeout: 20s
sending_queue:
num_consumers: 100
queue_size: 2000

When service.telemetry.metrics.level is set to 'none', the collector
should skip registering process metrics to avoid errors on platforms
where gopsutil is not supported (such as AIX).

This change conditionally registers process metrics only when the
metrics level is not LevelNone, preventing the 'failed to register
process metrics: not implemented yet' error on unsupported platforms.

Fixes regression introduced in v0.136.0 where the check for metrics
level was removed.
Similar to the resolution for pcommon.Value in previous changes, this update
ensures consistent documentation across all pdata types by clarifying that
calling functions on zero-initialized instances is invalid usage.

Changes:
- Updated template files (one_of_field.go, one_of_message_value.go) to generate
  improved comment wording
- Updated pcommon/value.go comments manually
- Updated all generated pdata files to use consistent wording:
  'is invalid and will cause a panic' instead of 'will cause a panic'

This makes it clearer that using zero-initialized instances is not just
dangerous but explicitly invalid usage, improving API documentation clarity.
…onfig file endpoints

Fixes open-telemetry#14286

When both OTEL_EXPORTER_OTLP_TRACES_ENDPOINT environment variable and
a configured endpoint in the config file are present, the URL scheme
from the environment variable was incorrectly overriding the scheme
from the config file, resulting in mixed endpoints (e.g., http scheme
from env var + path from config file).

This fix ensures that environment variables do not override explicitly
configured endpoints by temporarily unsetting the OTEL_EXPORTER_OTLP_*_ENDPOINT
environment variables before creating the SDK, then restoring them afterward.

According to the OpenTelemetry specification, explicit configuration
should take precedence over environment variables.

Changes:
- Modified sdk.go to temporarily unset OTEL_EXPORTER_OTLP_*_ENDPOINT
  environment variables before calling config.NewSDK()
- Added helper functions unsetOTLPEndpointEnvVars() and restoreEnvVars()
- Added comprehensive tests to verify env vars don't override config
…cenarios

This enhancement adds a connection_pool_size configuration option to the OTLP
exporter, enabling multiple gRPC connections with round-robin load balancing.

Key changes:
- Add connection_pool_size config parameter (default: 0, uses 1 connection)
- Implement round-robin load balancing across multiple connections
- Support for 1-256 concurrent gRPC connections
- Backward compatible: default behavior unchanged

This resolves performance issues in high-throughput environments (10K+ spans/sec)
and high-latency network scenarios where a single gRPC connection becomes a
bottleneck.

Also fixes unrelated service.go issue per contributor feedback on PR open-telemetry#14342.
@Arunodoy18 Arunodoy18 requested review from a team, bogdandrutu and dmitryax as code owners January 6, 2026 07:23
@Arunodoy18
Copy link
Author

I hope this works well, as was adressed through the issue .if any unrelated changes occur or anything which is wrong , please do tell after the review.
Thank you

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show me some data the demonstrate this is needed? gRPC says that you don't need to do this and it will automatically use multiple sockets, etc.

@tank-500m
Copy link

It doesn’t seem general enough to justify inclusion in the core component.
Also, there appear to be viable alternatives (e.g., the loadbalancing exporter in opentelemetry-collector-contrib).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants