Summary
Add topology/availability zone (AZ) awareness to the Target Allocator's consistent-hashing and least-weighted allocation strategies, so that targets are preferentially assigned to collectors running in the same AZ. This minimizes cross-AZ network traffic and cloud data transfer costs during Prometheus scraping.
Motivation
In multi-AZ Kubernetes clusters on AWS/GCP/Azure:
- Cross-AZ data transfer is billed ($0.01-0.02/GB in both directions)
- The Target Allocator currently assigns targets without considering topology — a collector in
us-east-1a may scrape targets in us-east-1c
- At scale (thousands of targets, 15-30s scrape intervals), cross-AZ traffic accumulates significantly
- The
per-node strategy avoids this but requires DaemonSet mode and cannot balance load
Proposed Design
Approach: Extend existing strategies with built-in AZ awareness
Rather than creating new strategies or a decorator pattern, add zone-aware logic directly into consistent-hashing and least-weighted. Activated via a new topology config section:
allocation_strategy: "least-weighted"
topology:
zone_aware: true
zone_label: "topology.kubernetes.io/zone" # default
When zone_aware: false (default), behavior is identical to today. Zero breaking changes.
Algorithm
GetCollectorForTarget(collectors, target):
target_zone = target.Labels.Get(target_zone_label)
IF target_zone != "" AND same_zone_collectors_exist(target_zone):
return inner_strategy(same_zone_collectors_only, target)
ELSE:
return inner_strategy(all_collectors, target) # FAILOVER
- consistent-hashing: Maintains per-zone hash rings + one global ring. Same-zone → zone ring. Failover → global ring.
- least-weighted: Maintains
collectorsPerZone index. Same-zone → pick least-loaded in zone. Failover → pick least-loaded globally.
Zone Detection
Collectors: NodeZoneResolver watches K8s Node objects, maps pod.Spec.NodeName → topology.kubernetes.io/zone label. Populates new Collector.Zone field.
Targets: Read from __meta_kubernetes_endpointslice_endpoint_zone Prometheus SD label (available since Prometheus 2.48). Falls back to node-based resolution.
Failover
- If target's AZ has no collectors → distribute across ALL collectors using base strategy
- Targets without zone info → global distribution (no zone preference)
- Clusters without zone labels → zone awareness is a no-op (graceful degradation)
Observability
New metrics:
opentelemetry_allocator_zone_cross_zone_assignments — targets assigned to different AZ
opentelemetry_allocator_zone_uncovered_count — zones with targets but no collectors
opentelemetry_allocator_zone_collector_count{zone} — collectors per AZ
opentelemetry_allocator_zone_target_count{zone} — targets per AZ
New API endpoint: GET /zones — returns zone topology snapshot (collector/target distribution, uncovered zones, balance ratio)
Log warnings when zone coverage gaps detected.
Example Scenarios
4 collectors, 3 AZs (1 in az-a, 1 in az-b, 2 in az-c):
- az-a targets → collector-0
- az-b targets → collector-1
- az-c targets → collector-2 + collector-3 (split 50/50 by inner strategy)
2 collectors, 3 AZs (1 in az-a, 1 in az-b, 0 in az-c):
- az-a targets → collector-0 (same-zone)
- az-b targets → collector-1 (same-zone)
- az-c targets → FAILOVER spread across collector-0 + collector-1
- Metric:
zone_uncovered_count=1, log warning emitted
Implementation Plan
- Config + Collector.Zone field + NodeZoneResolver — adds zone field, no behavior change
- ZoneTopology map + metrics — builds topology, records metrics
- Zone-aware consistent-hashing — per-zone hash rings with failover
- Zone-aware least-weighted — per-zone collector index with failover
- API /zones endpoint + HTML views — read-only topology endpoints
- Integration tests + documentation
Kubernetes Alignment
- Uses standard
topology.kubernetes.io/zone label (stable since K8s 1.17)
- Leverages EndpointSlice zone field (K8s 1.21+)
- Mirrors Kubernetes topology-aware routing hints pattern (KEP-2433)
- Complements
topologySpreadConstraints on collector StatefulSet
RBAC
Requires nodes watch permission (cluster-scoped) for the zone resolver:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
Alternatives Considered
- Decorator/wrapper pattern — rejected for complexity; integrating into strategies is cleaner
- New strategy names (e.g.,
zone-consistent-hashing) — rejected; combinatorial explosion
- Prometheus relabel_configs — works for standalone but doesn't solve TA-managed allocation
- Extend loadbalancingexporter — solves a different problem (export routing, not scrape assignment)
Open Questions
- Should zone info be provided via config/annotations as alternative to node label lookup (for restricted RBAC environments)?
/kind feature
/area target-allocator
Summary
Add topology/availability zone (AZ) awareness to the Target Allocator's
consistent-hashingandleast-weightedallocation strategies, so that targets are preferentially assigned to collectors running in the same AZ. This minimizes cross-AZ network traffic and cloud data transfer costs during Prometheus scraping.Motivation
In multi-AZ Kubernetes clusters on AWS/GCP/Azure:
us-east-1amay scrape targets inus-east-1cper-nodestrategy avoids this but requires DaemonSet mode and cannot balance loadProposed Design
Approach: Extend existing strategies with built-in AZ awareness
Rather than creating new strategies or a decorator pattern, add zone-aware logic directly into
consistent-hashingandleast-weighted. Activated via a newtopologyconfig section:When
zone_aware: false(default), behavior is identical to today. Zero breaking changes.Algorithm
collectorsPerZoneindex. Same-zone → pick least-loaded in zone. Failover → pick least-loaded globally.Zone Detection
Collectors:
NodeZoneResolverwatches K8s Node objects, mapspod.Spec.NodeName→topology.kubernetes.io/zonelabel. Populates newCollector.Zonefield.Targets: Read from
__meta_kubernetes_endpointslice_endpoint_zonePrometheus SD label (available since Prometheus 2.48). Falls back to node-based resolution.Failover
Observability
New metrics:
opentelemetry_allocator_zone_cross_zone_assignments— targets assigned to different AZopentelemetry_allocator_zone_uncovered_count— zones with targets but no collectorsopentelemetry_allocator_zone_collector_count{zone}— collectors per AZopentelemetry_allocator_zone_target_count{zone}— targets per AZNew API endpoint:
GET /zones— returns zone topology snapshot (collector/target distribution, uncovered zones, balance ratio)Log warnings when zone coverage gaps detected.
Example Scenarios
4 collectors, 3 AZs (1 in az-a, 1 in az-b, 2 in az-c):
2 collectors, 3 AZs (1 in az-a, 1 in az-b, 0 in az-c):
zone_uncovered_count=1, log warning emittedImplementation Plan
Kubernetes Alignment
topology.kubernetes.io/zonelabel (stable since K8s 1.17)topologySpreadConstraintson collector StatefulSetRBAC
Requires
nodeswatch permission (cluster-scoped) for the zone resolver:Alternatives Considered
zone-consistent-hashing) — rejected; combinatorial explosionOpen Questions
/kind feature
/area target-allocator