Component(s)
extension/healthcheckv2
What happened?
Description
When the context passed to the collector's Run method is cancelled and a healthcheck extension is present, the extension's Shutdown method blocks and deadlocks the collector.
Steps to Reproduce
Here's a test that triggers the failure:
// Copyright The OpenTelemetry Authors
// SPDX-License-Identifier: Apache-2.0
package healthcheckv2extension_test
import (
"context"
"testing"
"time"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"go.opentelemetry.io/collector/component"
"go.opentelemetry.io/collector/confmap"
"go.opentelemetry.io/collector/otelcol"
"go.opentelemetry.io/collector/otelcol/otelcoltest"
"github.com/open-telemetry/opentelemetry-collector-contrib/extension/healthcheckv2extension"
"github.com/open-telemetry/opentelemetry-collector-contrib/internal/common/testutil"
)
// staticProvider is a trivial confmap.Provider that returns a fixed map.
type staticProvider struct {
m map[string]any
}
func (p *staticProvider) Retrieve(_ context.Context, _ string, _ confmap.WatcherFunc) (*confmap.Retrieved, error) {
return confmap.NewRetrieved(p.m)
}
func (*staticProvider) Scheme() string { return "static" }
func (*staticProvider) Shutdown(context.Context) error { return nil }
// TestCollectorContextCancelDeadlock demonstrates that cancelling the context
// passed to col.Run causes a deadlock inside the healthcheckv2 extension.
func TestCollectorContextCancelDeadlock(t *testing.T) {
// Build factories: nop receiver + nop exporter + healthcheckv2 extension.
factories, err := otelcoltest.NopFactories()
require.NoError(t, err)
hcFactory := healthcheckv2extension.NewFactory()
factories.Extensions[hcFactory.Type()] = hcFactory
// Pick a free port for the healthcheck HTTP server.
endpoint := testutil.GetAvailableLocalAddress(t)
cfgMap := map[string]any{
"extensions": map[string]any{
"healthcheckv2": map[string]any{
"use_v2": true,
"http": map[string]any{"endpoint": endpoint},
},
},
"receivers": map[string]any{"nop": nil},
"exporters": map[string]any{"nop": nil},
"service": map[string]any{
"extensions": []any{"healthcheckv2"},
"pipelines": map[string]any{
"traces": map[string]any{
"receivers": []any{"nop"},
"exporters": []any{"nop"},
},
},
},
}
providerFactory := confmap.NewProviderFactory(func(_ confmap.ProviderSettings) confmap.Provider {
return &staticProvider{m: cfgMap}
})
col, err := otelcol.NewCollector(otelcol.CollectorSettings{
BuildInfo: component.NewDefaultBuildInfo(),
Factories: func() (otelcol.Factories, error) { return factories, nil },
ConfigProviderSettings: otelcol.ConfigProviderSettings{
ResolverSettings: confmap.ResolverSettings{
URIs: []string{"static:config"},
ProviderFactories: []confmap.ProviderFactory{providerFactory},
},
},
})
require.NoError(t, err)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
runDone := make(chan error, 1)
go func() {
runDone <- col.Run(ctx)
}()
require.Eventually(t, func() bool {
return col.GetState() == otelcol.StateRunning
}, 5*time.Second, 100*time.Millisecond)
// Cancel the run context. In a real deployment this happens when the process
// receives SIGTERM and it is plumbed into the collector's run context, or when
// the service context passed at line 385 of otelcol/collector.go is cancelled.
cancel()
select {
case err := <-runDone:
assert.NoError(t, err)
case <-time.After(500 * time.Millisecond):
t.Fatal("col.Run deadlocked: cancelling the run context exited the eventLoop " +
"goroutine before the pipeline finished shutting down, leaving " +
"ComponentStatusChanged blocked on the unbuffered eventCh forever")
}
}
Expected Result
The collector should exit cleanly.
Actual Result
Deadlock.
Collector version
v0.150.0
Additional context
Root cause:
- NewHealthCheckExtension immediately launches go hc.eventLoop(ctx), where ctx
is the context from the factory call inside service.Start — the same context
that col.Run received.
- eventLoop exits early when ctx is cancelled (case <-ctx.Done(): return).
- The collector's shutdown sequence then tears down the pipeline. Each component
reports StatusStopping/StatusStopped by calling ComponentStatusChanged on the
extension. ComponentStatusChanged sends to the unbuffered eventCh.
- Nobody is reading eventCh (eventLoop exited). The send blocks. The shutdown
goroutine is stuck. Shutdown() on the extension is never reached, so eventCh
is never closed. col.Run never returns → deadlock.
The bug on the extension side is that the context passed to Start should only be used for cancelling Start itself, not stopping background tasks. We should use a done channel for that instead.
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.
Component(s)
extension/healthcheckv2
What happened?
Description
When the context passed to the collector's Run method is cancelled and a healthcheck extension is present, the extension's Shutdown method blocks and deadlocks the collector.
Steps to Reproduce
Here's a test that triggers the failure:
Expected Result
The collector should exit cleanly.
Actual Result
Deadlock.
Collector version
v0.150.0
Additional context
Root cause:
is the context from the factory call inside service.Start — the same context
that col.Run received.
reports StatusStopping/StatusStopped by calling ComponentStatusChanged on the
extension. ComponentStatusChanged sends to the unbuffered eventCh.
goroutine is stuck. Shutdown() on the extension is never reached, so eventCh
is never closed. col.Run never returns → deadlock.
The bug on the extension side is that the context passed to
Startshould only be used for cancellingStartitself, not stopping background tasks. We should use a done channel for that instead.Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding
+1orme too, to help us triage it. Learn more here.