Skip to content

[BUG]: Multi-target mode (/probe endpoint) reuses collectors causing incorrect elasticsearch_version metric #1112

@whg517

Description

@whg517

Description

When using the /probe endpoint for multi-target scraping, the elasticsearch_version metric always shows information from the first target queried, regardless of which target is being scraped. This is because the cluster-info collector is cached globally and reused across all probe requests.

Steps to Reproduce

  1. Start the elasticsearch_exporter:

    ./elasticsearch_exporter
  2. Make a probe request to the first target (e.g., Elasticsearch 7.17.0):

    curl 'http://localhost:9114/probe?target=es-cluster-1:9200' | grep elasticsearch_version

    Output shows: elasticsearch_version{cluster="cluster-1",...,version="7.17.0"} 1

  3. Make a probe request to a different target (e.g., Elasticsearch 8.10.0):

    curl 'http://localhost:9114/probe?target=es-cluster-2:9200' | grep elasticsearch_version

    Expected: elasticsearch_version{cluster="cluster-2",...,version="8.10.0"} 1
    Actual: elasticsearch_version{cluster="cluster-1",...,version="7.17.0"} 1

Expected Behavior

Each /probe request should query its target's Elasticsearch cluster and return the correct version information for that specific target.

Actual Behavior

The elasticsearch_version metric always shows information from the first target that was queried, because all probe requests share the same cached ClusterInfoCollector instance.

Root Cause Analysis

Problem Location

The bug is in collector/collector.go at lines 42 and 117-133:

var (
    initiatedCollectorsMtx = sync.Mutex{}
    initiatedCollectors    = make(map[string]Collector)  // ⚠️ Global cache
    // ...
)

func NewElasticsearchCollector(...) (*ElasticsearchCollector, error) {
    // ...
    for key, enabled := range collectorState {
        // ...
        if collector, ok := initiatedCollectors[key]; ok {
            collectors[key] = collector  // ⚠️ Reuses cached collector
        } else {
            collector, err := factories[key](logger, e.esURL, e.httpClient)
            // ...
            initiatedCollectors[key] = collector  // ⚠️ Caches globally
        }
    }
    // ...
}

Why Only cluster-info is Affected

In main.go, the /probe endpoint creates collectors in two ways:

  1. Through ElasticsearchCollector (lines 335-342):

    exp, err := collector.NewElasticsearchCollector(
        logger, []string{},
        collector.WithElasticsearchURL(targetURL),
        collector.WithHTTPClient(probeClient),
    )

    This creates collectors registered via registerCollector(), including:

    • cluster-info ✅ (default enabled) - AFFECTED
    • data-stream (default disabled)
    • snapshots (default disabled)
    • Other optional collectors...
  2. Directly instantiated (lines 344-356):

    reg.MustRegister(collector.NewClusterHealth(logger, probeClient, targetURL))
    reg.MustRegister(collector.NewNodes(logger, probeClient, targetURL, *esAllNodes, *esNode))
    // ... etc

    These are created fresh each time - NOT AFFECTED

The cluster-info collector is the only default-enabled collector that goes through the caching mechanism, which is why only elasticsearch_version shows incorrect data.

Technical Details

The ClusterInfoCollector struct stores the target URL and HTTP client:

// collector/cluster_info.go
type ClusterInfoCollector struct {
    logger *slog.Logger
    u      *url.URL        // ⚠️ Bound to first target
    hc     *http.Client    // ⚠️ Bound to first target's client
}

func (c *ClusterInfoCollector) Update(_ context.Context, ch chan<- prometheus.Metric) error {
    resp, err := c.hc.Get(c.u.String())  // ⚠️ Always queries first target
    // ...
}

When the first probe request creates ClusterInfoCollector, it's initialized with target1:9200. This instance is cached in initiatedCollectors["cluster-info"]. All subsequent probe requests reuse this same instance, so they all query target1:9200.

Impact

  • Severity: Medium - affects only the elasticsearch_version metric in multi-target mode
  • Scope: Only the /probe endpoint is affected; single-target mode (using --es.uri flag) works correctly
  • Users monitoring multiple Elasticsearch clusters via /probe will see incorrect version information
  • Other metrics (cluster health, nodes, indices, etc.) are NOT affected because they use directly instantiated collectors

Proposed Solutions

Solution 1: Add Option to Skip Caching (Recommended)

Add a flag to NewElasticsearchCollector to bypass the cache for probe requests:

// collector/collector.go
type ElasticsearchCollector struct {
    Collectors map[string]Collector
    logger     *slog.Logger
    esURL      *url.URL
    httpClient *http.Client
    skipCache  bool  // Add this field
}

func WithSkipCache(skip bool) Option {
    return func(e *ElasticsearchCollector) error {
        e.skipCache = skip
        return nil
    }
}

func NewElasticsearchCollector(...) (*ElasticsearchCollector, error) {
    // ...
    for key, enabled := range collectorState {
        // ...
        // Only use cache if not skipping
        if !e.skipCache {
            if collector, ok := initiatedCollectors[key]; ok {
                collectors[key] = collector
                continue
            }
        }
        // Always create new collector if skipCache is true
        collector, err := factories[key](logger, e.esURL, e.httpClient)
        // ...
        collectors[key] = collector
        if !e.skipCache {
            initiatedCollectors[key] = collector
        }
    }
    // ...
}

Then in main.go for /probe:

exp, err := collector.NewElasticsearchCollector(
    logger, []string{},
    collector.WithElasticsearchURL(targetURL),
    collector.WithHTTPClient(probeClient),
    collector.WithSkipCache(true),  // Add this
)

Solution 2: Use Target URL as Part of Cache Key

Modify the cache key to include the target URL:

cacheKey := fmt.Sprintf("%s:%s", key, e.esURL.String())
if collector, ok := initiatedCollectors[cacheKey]; ok {
    collectors[key] = collector
}

However, this could lead to unbounded cache growth.

Solution 3: Disable cluster-info for Probe Mode

Explicitly disable the cluster-info collector for probe requests, since it's redundant when other collectors provide the necessary metrics.

Environment

  • Version: master branch (as of December 2024)
  • Affected Feature: Multi-target monitoring using the /probe endpoint
  • Affected Metric: elasticsearch_version

Workaround

As a temporary workaround, users can:

  1. Restart the exporter between scraping different targets (not practical)
  2. Run multiple exporter instances, one per target (defeats the purpose of /probe)
  3. Use single-target mode with separate exporter instances

Additional Notes

This issue highlights a design assumption that collectors would only be used for a single target. The caching optimization works well for single-target mode but breaks multi-target functionality. The fix should maintain backward compatibility and performance for single-target mode while properly supporting probe mode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions