Skip to content

[extension/healthcheckv2] Deadlock when the collector run context is cancelled #47591

@swiatekm

Description

@swiatekm

Component(s)

extension/healthcheckv2

What happened?

Description

When the context passed to the collector's Run method is cancelled and a healthcheck extension is present, the extension's Shutdown method blocks and deadlocks the collector.

Steps to Reproduce

Here's a test that triggers the failure:

// Copyright The OpenTelemetry Authors
// SPDX-License-Identifier: Apache-2.0

package healthcheckv2extension_test

import (
	"context"
	"testing"
	"time"

	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
	"go.opentelemetry.io/collector/component"
	"go.opentelemetry.io/collector/confmap"
	"go.opentelemetry.io/collector/otelcol"
	"go.opentelemetry.io/collector/otelcol/otelcoltest"

	"github.com/open-telemetry/opentelemetry-collector-contrib/extension/healthcheckv2extension"
	"github.com/open-telemetry/opentelemetry-collector-contrib/internal/common/testutil"
)

// staticProvider is a trivial confmap.Provider that returns a fixed map.
type staticProvider struct {
	m map[string]any
}

func (p *staticProvider) Retrieve(_ context.Context, _ string, _ confmap.WatcherFunc) (*confmap.Retrieved, error) {
	return confmap.NewRetrieved(p.m)
}

func (*staticProvider) Scheme() string             { return "static" }
func (*staticProvider) Shutdown(context.Context) error { return nil }

// TestCollectorContextCancelDeadlock demonstrates that cancelling the context
// passed to col.Run causes a deadlock inside the healthcheckv2 extension.
func TestCollectorContextCancelDeadlock(t *testing.T) {
	// Build factories: nop receiver + nop exporter + healthcheckv2 extension.
	factories, err := otelcoltest.NopFactories()
	require.NoError(t, err)
	hcFactory := healthcheckv2extension.NewFactory()
	factories.Extensions[hcFactory.Type()] = hcFactory

	// Pick a free port for the healthcheck HTTP server.
	endpoint := testutil.GetAvailableLocalAddress(t)

	cfgMap := map[string]any{
		"extensions": map[string]any{
			"healthcheckv2": map[string]any{
				"use_v2": true,
				"http":   map[string]any{"endpoint": endpoint},
			},
		},
		"receivers": map[string]any{"nop": nil},
		"exporters": map[string]any{"nop": nil},
		"service": map[string]any{
			"extensions": []any{"healthcheckv2"},
			"pipelines": map[string]any{
				"traces": map[string]any{
					"receivers": []any{"nop"},
					"exporters": []any{"nop"},
				},
			},
		},
	}

	providerFactory := confmap.NewProviderFactory(func(_ confmap.ProviderSettings) confmap.Provider {
		return &staticProvider{m: cfgMap}
	})

	col, err := otelcol.NewCollector(otelcol.CollectorSettings{
		BuildInfo: component.NewDefaultBuildInfo(),
		Factories: func() (otelcol.Factories, error) { return factories, nil },
		ConfigProviderSettings: otelcol.ConfigProviderSettings{
			ResolverSettings: confmap.ResolverSettings{
				URIs:              []string{"static:config"},
				ProviderFactories: []confmap.ProviderFactory{providerFactory},
			},
		},
	})
	require.NoError(t, err)

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	runDone := make(chan error, 1)
	go func() {
		runDone <- col.Run(ctx)
	}()

	require.Eventually(t, func() bool {
		return col.GetState() == otelcol.StateRunning
	}, 5*time.Second, 100*time.Millisecond)

	// Cancel the run context. In a real deployment this happens when the process
	// receives SIGTERM and it is plumbed into the collector's run context, or when
	// the service context passed at line 385 of otelcol/collector.go is cancelled.
	cancel()

	select {
	case err := <-runDone:
		assert.NoError(t, err)
	case <-time.After(500 * time.Millisecond):
		t.Fatal("col.Run deadlocked: cancelling the run context exited the eventLoop " +
			"goroutine before the pipeline finished shutting down, leaving " +
			"ComponentStatusChanged blocked on the unbuffered eventCh forever")
	}
}

Expected Result

The collector should exit cleanly.

Actual Result

Deadlock.

Collector version

v0.150.0

Additional context

Root cause:

  • NewHealthCheckExtension immediately launches go hc.eventLoop(ctx), where ctx
    is the context from the factory call inside service.Start — the same context
    that col.Run received.
  • eventLoop exits early when ctx is cancelled (case <-ctx.Done(): return).
  • The collector's shutdown sequence then tears down the pipeline. Each component
    reports StatusStopping/StatusStopped by calling ComponentStatusChanged on the
    extension. ComponentStatusChanged sends to the unbuffered eventCh.
  • Nobody is reading eventCh (eventLoop exited). The send blocks. The shutdown
    goroutine is stuck. Shutdown() on the extension is never reached, so eventCh
    is never closed. col.Run never returns → deadlock.

The bug on the extension side is that the context passed to Start should only be used for cancelling Start itself, not stopping background tasks. We should use a done channel for that instead.

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions