Skip to content

OTel Memory Leak #6315

Open
Open
@tjsampson

Description

@tjsampson

Description

Memory Leak in otel library code.

Environment

  • OS: Linux
  • Architecture: x86
  • Go Version: 1.23.2
    - go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.56.0
    • go.opentelemetry.io/otel v1.31.0
    • go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.31.0
    • go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.31.0
    • go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.31.0
    • go.opentelemetry.io/otel/sdk v1.31.0
    • go.opentelemetry.io/otel/trace v1.31.0

Steps To Reproduce

See Comment here: #5190 (comment)

I am pretty sure this is still an issue or something else in the golang otel ecosystem. I will get a pprof setup possibly tomorrow, but here's some anecdotal evidence I have:

Screenshot 2024-11-07 at 6 49 14 PM

Pretty easy to see when tracing was implemented from that graph. And yes. I have removed our tracing implementation and it's back to normal memory usage.

Here is a rough draft of our setup. Please let me know if I am doing anything egregiously dumb, but for the most part, it's all pretty standard stuff take from various docs:

go.mod

go 1.23.2

require (
	go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.56.0
	go.opentelemetry.io/otel v1.31.0
	go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.31.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.31.0
	go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.31.0
	go.opentelemetry.io/otel/sdk v1.31.0
	go.opentelemetry.io/otel/trace v1.31.0
)

We are wrapping the otelhttp.NewHandler around the toplevel muxer, so everything is traced. Yes, I know this is expensive, but it shouldn't leak memory. Eventually we will change this to include/exclude/drop stuff, just so we aren't taking in so much volume (ping routes, health checks, etc.) and do more aggressive down sampling.

func NewApi(c config.Config) *Api {
	return &Api{
		c:           &c,
		controllers: []Controller{newControllers(c)},
		server: &http.Server{
			ReadTimeout:  c.ReadTimeout,
			WriteTimeout: c.WriteTimeout,
			IdleTimeout:  c.IdleTimeout,
			Addr:         fmt.Sprintf(":%d", c.Port),
			Handler:      otelhttp.NewHandler(chi.NewMux(), "INGRESS", otelhttp.WithFilter(traceFilter)),
		},
		done:       make(chan bool),
		sigChannel: make(chan os.Signal, 1024),
	}
}

Here is how we are initializing our trace and metrics providers once on boot:

// TracerProvider an OTLP exporter, and configures the corresponding trace provider.
func TracerProvider(ctx context.Context, res *resource.Resource) (func(context.Context) error, error) {
	// Set up a trace exporter
	traceExporter, err := otlptrace.New(ctx, otlptracegrpc.NewClient())
	if err != nil {
		return nil, errors.Wrap(err, "failed to create trace exporter")
	}

	// Register the trace exporter with a TracerProvider, using a batch
	// span processor to aggregate spans before export.
	tracerProvider := sdktrace.NewTracerProvider(
		sdktrace.WithSampler(sdktrace.AlwaysSample()),
		sdktrace.WithResource(res),
		sdktrace.WithBatcher(traceExporter),
	)
	otel.SetTracerProvider(tracerProvider)
	otel.SetTextMapPropagator(
		propagation.NewCompositeTextMapPropagator(
			propagation.TraceContext{},
			propagation.Baggage{},
		))

	// Shutdown will flush any remaining spans and shut down the exporter.
	return tracerProvider.Shutdown, nil
}

// MeterProvider an OTLP exporter, and configures the corresponding meter provider.
func MeterProvider(ctx context.Context, res *resource.Resource) (func(context.Context) error, error) {
	metricExporter, err := otlpmetricgrpc.New(ctx)
	if err != nil {
		return nil, errors.Wrap(err, "failed to create metric exporter")
	}

	meterProvider := sdkmetric.NewMeterProvider(
		sdkmetric.WithReader(sdkmetric.NewPeriodicReader(metricExporter)),
		sdkmetric.WithResource(res),
	)
	otel.SetMeterProvider(meterProvider)

	return meterProvider.Shutdown, nil
}

Then called and shutdown on main:

	shutDownTracer, err := traceinstrument.TracerProvider(ctx, traceRes)
	if err != nil {
		log.Logger.Fatal("failed to create trace provider", zap.Error(err))
	}
	
	defer func(onShutdown func(ctx context.Context) error) {
		if errr := onShutdown(ctx); errr != nil {
			log.Logger.Error("error shutting down trace provider", zap.Error(errr))
		}
	}(shutDownTracer)

	shutdownTraceMetrics, err := traceinstrument.MeterProvider(ctx, traceRes)
	if err != nil {
		log.Logger.Fatal("failed to create meter provider", zap.Error(err))
	}

	defer func(onShutdown func(ctx context.Context) error) {
		if errr := onShutdown(ctx); errr != nil {
			log.Logger.Error("error shutting down metrics provider", zap.Error(errr))
		}
	}(shutdownTraceMetrics)

Note. We are also using the otelhttp.NewTransport to wrap the default logging transport:

http.DefaultTransport = otelhttp.NewTransport(http.DefaultTransport)

If we remove tracing setup, memory usage goes back to normal. So the leak is definitely in our tracing setup.

Expected behavior

Memory does not continuously increase over time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingresponse neededWaiting on user input before progress can be made

    Type

    No type

    Projects

    Status

    Triaging

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions