Skip to content

tracing: fix BatchSpanProcessor goroutine leak on config reload#7826

Closed
Dean2026 wants to merge 1 commit into
caddyserver:masterfrom
Dean2026:fix/tracing-bsp-goroutine-leak
Closed

tracing: fix BatchSpanProcessor goroutine leak on config reload#7826
Dean2026 wants to merge 1 commit into
caddyserver:masterfrom
Dean2026:fix/tracing-bsp-goroutine-leak

Conversation

@Dean2026

@Dean2026 Dean2026 commented Jun 17, 2026

Copy link
Copy Markdown

tracing: fix BatchSpanProcessor goroutine leak on config reload

Description

When the tracing directive is enabled, each config reload leaks one
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue
goroutine. On a server that reloads frequently (e.g. polling a remote
config source every few seconds) this accumulates into tens of thousands
of leaked goroutines over time.

A goroutine dump shows many identical stacks:

goroutine ... [select]:
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue(...)
	.../sdk/trace/batch_span_processor.go:327
go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor.func2()
	.../sdk/trace/batch_span_processor.go:129
created by go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor
	.../sdk/trace/batch_span_processor.go:127

Root cause

tracing keeps a global, reference-counted TracerProvider so it can be
reused across reloads (tracerProvider.getTracerProvider). Caddy reloads
provision the new config before cleaning up the old one, so the counter
never drops to 0 and the provider is correctly reused — Shutdown is
never called, by design.

The problem is on the caller side in newOpenTelemetryWrapper:

traceExporter, err := autoexport.NewSpanExporter(ctx)
...
tracerProvider := globalTracerProvider.getTracerProvider(
    sdktrace.WithBatcher(traceExporter),   // evaluated on every reload
    sdktrace.WithResource(res),
)

sdktrace.WithBatcher(e) is WithSpanProcessor(NewBatchSpanProcessor(e)),
and NewBatchSpanProcessor starts its processQueue goroutine eagerly at
construction time
— not when the option is applied. The option is built
on every Provision (every reload), but getTracerProvider only applies
it when it actually creates a new provider (t.tracerProvider == nil). On
the reuse path the option is silently discarded, so the just-started
BatchSpanProcessor goroutine is orphaned: it is never registered with any
provider and therefore never shut down. Result: one leaked goroutine (plus
a leaked exporter) per reload.

Fix

Defer construction of the exporter/batcher until a new provider is actually
needed. getTracerProvider now takes a buildOpts factory that is invoked
only on the create path, so nothing with a side effect is built on the
reuse path.

The reference counter is now incremented only after the provider is
successfully obtained, preserving the previous semantics where a failed
exporter creation did not affect the counter.

Testing

  • Test_tracersProvider_buildOptsOnlyOnCreate — asserts buildOpts runs
    exactly once across one create + five reuses (the regression guard).
  • Test_tracersProvider_buildOptsError — asserts that on a build error the
    provider stays nil and the counter is not incremented.
  • Existing tracing tests updated for the new signature and still pass.

Verified manually with a reload loop: before the fix, 50 reloads leaked 50
processQueue goroutines; after the fix, 0 are leaked while the provider is
still reused (counter stays at 1).


🤖 AI Disclosure: Claude Opus 4.8 was used to generate the patch, and verified at the staging deployment.

When the `tracing` directive is enabled, each config reload leaks one
`go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue`
goroutine. On a server that reloads frequently (e.g. polling a remote
config source every few seconds) this accumulates into tens of thousands
of leaked goroutines over time.

A goroutine dump shows many identical stacks:

```
goroutine ... [select]:
go.opentelemetry.io/otel/sdk/trace.(*batchSpanProcessor).processQueue(...)
	.../sdk/trace/batch_span_processor.go:327
go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor.func2()
	.../sdk/trace/batch_span_processor.go:129
created by go.opentelemetry.io/otel/sdk/trace.NewBatchSpanProcessor
	.../sdk/trace/batch_span_processor.go:127
```

`tracing` keeps a global, reference-counted `TracerProvider` so it can be
reused across reloads (`tracerProvider.getTracerProvider`). Caddy reloads
provision the new config *before* cleaning up the old one, so the counter
never drops to 0 and the provider is correctly reused — `Shutdown` is
never called, by design.

The problem is on the caller side in `newOpenTelemetryWrapper`:

```go
traceExporter, err := autoexport.NewSpanExporter(ctx)
...
tracerProvider := globalTracerProvider.getTracerProvider(
    sdktrace.WithBatcher(traceExporter),   // evaluated on every reload
    sdktrace.WithResource(res),
)
```

`sdktrace.WithBatcher(e)` is `WithSpanProcessor(NewBatchSpanProcessor(e))`,
and `NewBatchSpanProcessor` **starts its `processQueue` goroutine eagerly at
construction time** — not when the option is applied. The option is built
on every `Provision` (every reload), but `getTracerProvider` only applies
it when it actually creates a new provider (`t.tracerProvider == nil`). On
the reuse path the option is silently discarded, so the just-started
BatchSpanProcessor goroutine is orphaned: it is never registered with any
provider and therefore never shut down. Result: one leaked goroutine (plus
a leaked exporter) per reload.

Defer construction of the exporter/batcher until a new provider is actually
needed. `getTracerProvider` now takes a `buildOpts` factory that is invoked
only on the create path, so nothing with a side effect is built on the
reuse path.

The reference counter is now incremented only after the provider is
successfully obtained, preserving the previous semantics where a failed
exporter creation did not affect the counter.

- `Test_tracersProvider_buildOptsOnlyOnCreate` — asserts `buildOpts` runs
  exactly once across one create + five reuses (the regression guard).
- `Test_tracersProvider_buildOptsError` — asserts that on a build error the
  provider stays nil and the counter is not incremented.
- Existing tracing tests updated for the new signature and still pass.

Verified manually with a reload loop: before the fix, 50 reloads leaked 50
`processQueue` goroutines; after the fix, 0 are leaked while the provider is
still reused (counter stays at 1).
@CLAassistant

CLAassistant commented Jun 17, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@mholt

mholt commented Jun 17, 2026

Copy link
Copy Markdown
Member

Closing due to missing AI disclosure (PR template deleted/ignored). We can reopen it if this is remedied.

@mholt mholt closed this Jun 17, 2026
@Dean2026

Copy link
Copy Markdown
Author

Closing due to missing AI disclosure (PR template deleted/ignored). We can reopen it if this is remedied.

I have updated the AI disclosure section, please review it, thanks.

@steadytao

Copy link
Copy Markdown
Member

Code direction seems right from a glance, just deferring exporter/batcher construction until the provider is actually created. One small thing before I reopen is that the PR description mentions Test_tracersProvider_buildOptsError but I only see Test_tracersProvider_buildOptsOnlyOnCreate in the branch. Could you add that build test please?

@Dean2026

Dean2026 commented Jun 18, 2026

Copy link
Copy Markdown
Author

Thanks for the review! I've added Test_tracersProvider_buildOptsError in commit 5754aca.

Ready for you to reopen whenever you'd like. Thanks!

@steadytao

Copy link
Copy Markdown
Member

Please push that commit to your PR branch...

@Dean2026

Copy link
Copy Markdown
Author

The commit is already on the branch — you can see it here: https://github.com/Dean2026/caddy/commits/fix/tracing-bsp-goroutine-leak/ .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants