-
Notifications
You must be signed in to change notification settings - Fork 812
Description
Description
When using the /probe endpoint for multi-target scraping, the elasticsearch_version metric always shows information from the first target queried, regardless of which target is being scraped. This is because the cluster-info collector is cached globally and reused across all probe requests.
Steps to Reproduce
-
Start the elasticsearch_exporter:
./elasticsearch_exporter
-
Make a probe request to the first target (e.g., Elasticsearch 7.17.0):
curl 'http://localhost:9114/probe?target=es-cluster-1:9200' | grep elasticsearch_version
Output shows:
elasticsearch_version{cluster="cluster-1",...,version="7.17.0"} 1 -
Make a probe request to a different target (e.g., Elasticsearch 8.10.0):
curl 'http://localhost:9114/probe?target=es-cluster-2:9200' | grep elasticsearch_version
Expected:
elasticsearch_version{cluster="cluster-2",...,version="8.10.0"} 1
Actual:elasticsearch_version{cluster="cluster-1",...,version="7.17.0"} 1❌
Expected Behavior
Each /probe request should query its target's Elasticsearch cluster and return the correct version information for that specific target.
Actual Behavior
The elasticsearch_version metric always shows information from the first target that was queried, because all probe requests share the same cached ClusterInfoCollector instance.
Root Cause Analysis
Problem Location
The bug is in collector/collector.go at lines 42 and 117-133:
var (
initiatedCollectorsMtx = sync.Mutex{}
initiatedCollectors = make(map[string]Collector) // ⚠️ Global cache
// ...
)
func NewElasticsearchCollector(...) (*ElasticsearchCollector, error) {
// ...
for key, enabled := range collectorState {
// ...
if collector, ok := initiatedCollectors[key]; ok {
collectors[key] = collector // ⚠️ Reuses cached collector
} else {
collector, err := factories[key](logger, e.esURL, e.httpClient)
// ...
initiatedCollectors[key] = collector // ⚠️ Caches globally
}
}
// ...
}Why Only cluster-info is Affected
In main.go, the /probe endpoint creates collectors in two ways:
-
Through
ElasticsearchCollector(lines 335-342):exp, err := collector.NewElasticsearchCollector( logger, []string{}, collector.WithElasticsearchURL(targetURL), collector.WithHTTPClient(probeClient), )
This creates collectors registered via
registerCollector(), including:cluster-info✅ (default enabled) - AFFECTEDdata-stream(default disabled)snapshots(default disabled)- Other optional collectors...
-
Directly instantiated (lines 344-356):
reg.MustRegister(collector.NewClusterHealth(logger, probeClient, targetURL)) reg.MustRegister(collector.NewNodes(logger, probeClient, targetURL, *esAllNodes, *esNode)) // ... etc
These are created fresh each time - NOT AFFECTED
The cluster-info collector is the only default-enabled collector that goes through the caching mechanism, which is why only elasticsearch_version shows incorrect data.
Technical Details
The ClusterInfoCollector struct stores the target URL and HTTP client:
// collector/cluster_info.go
type ClusterInfoCollector struct {
logger *slog.Logger
u *url.URL // ⚠️ Bound to first target
hc *http.Client // ⚠️ Bound to first target's client
}
func (c *ClusterInfoCollector) Update(_ context.Context, ch chan<- prometheus.Metric) error {
resp, err := c.hc.Get(c.u.String()) // ⚠️ Always queries first target
// ...
}When the first probe request creates ClusterInfoCollector, it's initialized with target1:9200. This instance is cached in initiatedCollectors["cluster-info"]. All subsequent probe requests reuse this same instance, so they all query target1:9200.
Impact
- Severity: Medium - affects only the
elasticsearch_versionmetric in multi-target mode - Scope: Only the
/probeendpoint is affected; single-target mode (using--es.uriflag) works correctly - Users monitoring multiple Elasticsearch clusters via
/probewill see incorrect version information - Other metrics (cluster health, nodes, indices, etc.) are NOT affected because they use directly instantiated collectors
Proposed Solutions
Solution 1: Add Option to Skip Caching (Recommended)
Add a flag to NewElasticsearchCollector to bypass the cache for probe requests:
// collector/collector.go
type ElasticsearchCollector struct {
Collectors map[string]Collector
logger *slog.Logger
esURL *url.URL
httpClient *http.Client
skipCache bool // Add this field
}
func WithSkipCache(skip bool) Option {
return func(e *ElasticsearchCollector) error {
e.skipCache = skip
return nil
}
}
func NewElasticsearchCollector(...) (*ElasticsearchCollector, error) {
// ...
for key, enabled := range collectorState {
// ...
// Only use cache if not skipping
if !e.skipCache {
if collector, ok := initiatedCollectors[key]; ok {
collectors[key] = collector
continue
}
}
// Always create new collector if skipCache is true
collector, err := factories[key](logger, e.esURL, e.httpClient)
// ...
collectors[key] = collector
if !e.skipCache {
initiatedCollectors[key] = collector
}
}
// ...
}Then in main.go for /probe:
exp, err := collector.NewElasticsearchCollector(
logger, []string{},
collector.WithElasticsearchURL(targetURL),
collector.WithHTTPClient(probeClient),
collector.WithSkipCache(true), // Add this
)Solution 2: Use Target URL as Part of Cache Key
Modify the cache key to include the target URL:
cacheKey := fmt.Sprintf("%s:%s", key, e.esURL.String())
if collector, ok := initiatedCollectors[cacheKey]; ok {
collectors[key] = collector
}However, this could lead to unbounded cache growth.
Solution 3: Disable cluster-info for Probe Mode
Explicitly disable the cluster-info collector for probe requests, since it's redundant when other collectors provide the necessary metrics.
Environment
- Version: master branch (as of December 2024)
- Affected Feature: Multi-target monitoring using the
/probeendpoint - Affected Metric:
elasticsearch_version
Workaround
As a temporary workaround, users can:
- Restart the exporter between scraping different targets (not practical)
- Run multiple exporter instances, one per target (defeats the purpose of
/probe) - Use single-target mode with separate exporter instances
Additional Notes
This issue highlights a design assumption that collectors would only be used for a single target. The caching optimization works well for single-target mode but breaks multi-target functionality. The fix should maintain backward compatibility and performance for single-target mode while properly supporting probe mode.