[BUG]: Multi-target mode (/probe endpoint) reuses collectors causing incorrect elasticsearch_version metric


## Description

When using the `/probe` endpoint for multi-target scraping, the `elasticsearch_version` metric always shows information from the **first target** queried, regardless of which target is being scraped. This is because the `cluster-info` collector is cached globally and reused across all probe requests.

## Steps to Reproduce

1. Start the elasticsearch_exporter:
   ```bash
   ./elasticsearch_exporter
   ```

2. Make a probe request to the first target (e.g., Elasticsearch 7.17.0):
   ```bash
   curl 'http://localhost:9114/probe?target=es-cluster-1:9200' | grep elasticsearch_version
   ```
   Output shows: `elasticsearch_version{cluster="cluster-1",...,version="7.17.0"} 1`

3. Make a probe request to a different target (e.g., Elasticsearch 8.10.0):
   ```bash
   curl 'http://localhost:9114/probe?target=es-cluster-2:9200' | grep elasticsearch_version
   ```
   **Expected**: `elasticsearch_version{cluster="cluster-2",...,version="8.10.0"} 1`
   **Actual**: `elasticsearch_version{cluster="cluster-1",...,version="7.17.0"} 1` ❌

## Expected Behavior

Each `/probe` request should query its target's Elasticsearch cluster and return the correct version information for that specific target.

## Actual Behavior

The `elasticsearch_version` metric always shows information from the first target that was queried, because all probe requests share the same cached `ClusterInfoCollector` instance.

## Root Cause Analysis

### Problem Location

The bug is in `collector/collector.go` at lines 42 and 117-133:

```go
var (
    initiatedCollectorsMtx = sync.Mutex{}
    initiatedCollectors    = make(map[string]Collector)  // ⚠️ Global cache
    // ...
)

func NewElasticsearchCollector(...) (*ElasticsearchCollector, error) {
    // ...
    for key, enabled := range collectorState {
        // ...
        if collector, ok := initiatedCollectors[key]; ok {
            collectors[key] = collector  // ⚠️ Reuses cached collector
        } else {
            collector, err := factories[key](logger, e.esURL, e.httpClient)
            // ...
            initiatedCollectors[key] = collector  // ⚠️ Caches globally
        }
    }
    // ...
}
```

### Why Only cluster-info is Affected

In main.go, the `/probe` endpoint creates collectors in two ways:

1. **Through `ElasticsearchCollector`** (lines 335-342):
   ```go
   exp, err := collector.NewElasticsearchCollector(
       logger, []string{},
       collector.WithElasticsearchURL(targetURL),
       collector.WithHTTPClient(probeClient),
   )
   ```
   This creates collectors registered via `registerCollector()`, including:
   - `cluster-info` ✅ (default enabled) - **AFFECTED**
   - `data-stream` (default disabled)
   - `snapshots` (default disabled)
   - Other optional collectors...

2. **Directly instantiated** (lines 344-356):
   ```go
   reg.MustRegister(collector.NewClusterHealth(logger, probeClient, targetURL))
   reg.MustRegister(collector.NewNodes(logger, probeClient, targetURL, *esAllNodes, *esNode))
   // ... etc
   ```
   These are created fresh each time - **NOT AFFECTED**

The `cluster-info` collector is the **only default-enabled collector** that goes through the caching mechanism, which is why only `elasticsearch_version` shows incorrect data.

### Technical Details

The `ClusterInfoCollector` struct stores the target URL and HTTP client:

```go
// collector/cluster_info.go
type ClusterInfoCollector struct {
    logger *slog.Logger
    u      *url.URL        // ⚠️ Bound to first target
    hc     *http.Client    // ⚠️ Bound to first target's client
}

func (c *ClusterInfoCollector) Update(_ context.Context, ch chan<- prometheus.Metric) error {
    resp, err := c.hc.Get(c.u.String())  // ⚠️ Always queries first target
    // ...
}
```

When the first probe request creates `ClusterInfoCollector`, it's initialized with `target1:9200`. This instance is cached in `initiatedCollectors["cluster-info"]`. All subsequent probe requests reuse this same instance, so they all query `target1:9200`.

## Impact

- **Severity**: Medium - affects only the `elasticsearch_version` metric in multi-target mode
- **Scope**: Only the `/probe` endpoint is affected; single-target mode (using `--es.uri` flag) works correctly
- Users monitoring multiple Elasticsearch clusters via `/probe` will see incorrect version information
- Other metrics (cluster health, nodes, indices, etc.) are **NOT affected** because they use directly instantiated collectors

## Proposed Solutions

### Solution 1: Add Option to Skip Caching (Recommended)

Add a flag to `NewElasticsearchCollector` to bypass the cache for probe requests:

```go
// collector/collector.go
type ElasticsearchCollector struct {
    Collectors map[string]Collector
    logger     *slog.Logger
    esURL      *url.URL
    httpClient *http.Client
    skipCache  bool  // Add this field
}

func WithSkipCache(skip bool) Option {
    return func(e *ElasticsearchCollector) error {
        e.skipCache = skip
        return nil
    }
}

func NewElasticsearchCollector(...) (*ElasticsearchCollector, error) {
    // ...
    for key, enabled := range collectorState {
        // ...
        // Only use cache if not skipping
        if !e.skipCache {
            if collector, ok := initiatedCollectors[key]; ok {
                collectors[key] = collector
                continue
            }
        }
        // Always create new collector if skipCache is true
        collector, err := factories[key](logger, e.esURL, e.httpClient)
        // ...
        collectors[key] = collector
        if !e.skipCache {
            initiatedCollectors[key] = collector
        }
    }
    // ...
}
```

Then in main.go for `/probe`:

```go
exp, err := collector.NewElasticsearchCollector(
    logger, []string{},
    collector.WithElasticsearchURL(targetURL),
    collector.WithHTTPClient(probeClient),
    collector.WithSkipCache(true),  // Add this
)
```

### Solution 2: Use Target URL as Part of Cache Key

Modify the cache key to include the target URL:

```go
cacheKey := fmt.Sprintf("%s:%s", key, e.esURL.String())
if collector, ok := initiatedCollectors[cacheKey]; ok {
    collectors[key] = collector
}
```

However, this could lead to unbounded cache growth.

### Solution 3: Disable cluster-info for Probe Mode

Explicitly disable the `cluster-info` collector for probe requests, since it's redundant when other collectors provide the necessary metrics.

## Environment

- **Version**: master branch (as of December 2024)
- **Affected Feature**: Multi-target monitoring using the `/probe` endpoint
- **Affected Metric**: `elasticsearch_version`

## Workaround

As a temporary workaround, users can:
1. Restart the exporter between scraping different targets (not practical)
2. Run multiple exporter instances, one per target (defeats the purpose of `/probe`)
3. Use single-target mode with separate exporter instances

## Additional Notes

This issue highlights a design assumption that collectors would only be used for a single target. The caching optimization works well for single-target mode but breaks multi-target functionality. The fix should maintain backward compatibility and performance for single-target mode while properly supporting probe mode.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: Multi-target mode (/probe endpoint) reuses collectors causing incorrect elasticsearch_version metric #1112

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis

Problem Location

Why Only cluster-info is Affected

Technical Details

Impact

Proposed Solutions

Solution 1: Add Option to Skip Caching (Recommended)

Solution 2: Use Target URL as Part of Cache Key

Solution 3: Disable cluster-info for Probe Mode

Environment

Workaround

Additional Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: Multi-target mode (/probe endpoint) reuses collectors causing incorrect elasticsearch_version metric #1112

Description

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause Analysis

Problem Location

Why Only cluster-info is Affected

Technical Details

Impact

Proposed Solutions

Solution 1: Add Option to Skip Caching (Recommended)

Solution 2: Use Target URL as Part of Cache Key

Solution 3: Disable cluster-info for Probe Mode

Environment

Workaround

Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions