Skip to content

Proposal: Availability Zone aware target allocation for consistent-hashing and least-weighted strategies #5156

@szibis

Description

@szibis

Summary

Add topology/availability zone (AZ) awareness to the Target Allocator's consistent-hashing and least-weighted allocation strategies, so that targets are preferentially assigned to collectors running in the same AZ. This minimizes cross-AZ network traffic and cloud data transfer costs during Prometheus scraping.

Motivation

In multi-AZ Kubernetes clusters on AWS/GCP/Azure:

  • Cross-AZ data transfer is billed ($0.01-0.02/GB in both directions)
  • The Target Allocator currently assigns targets without considering topology — a collector in us-east-1a may scrape targets in us-east-1c
  • At scale (thousands of targets, 15-30s scrape intervals), cross-AZ traffic accumulates significantly
  • The per-node strategy avoids this but requires DaemonSet mode and cannot balance load

Proposed Design

Approach: Extend existing strategies with built-in AZ awareness

Rather than creating new strategies or a decorator pattern, add zone-aware logic directly into consistent-hashing and least-weighted. Activated via a new topology config section:

allocation_strategy: "least-weighted"
topology:
  zone_aware: true
  zone_label: "topology.kubernetes.io/zone"  # default

When zone_aware: false (default), behavior is identical to today. Zero breaking changes.

Algorithm

GetCollectorForTarget(collectors, target):
  target_zone = target.Labels.Get(target_zone_label)

  IF target_zone != "" AND same_zone_collectors_exist(target_zone):
    return inner_strategy(same_zone_collectors_only, target)
  ELSE:
    return inner_strategy(all_collectors, target)  # FAILOVER
  • consistent-hashing: Maintains per-zone hash rings + one global ring. Same-zone → zone ring. Failover → global ring.
  • least-weighted: Maintains collectorsPerZone index. Same-zone → pick least-loaded in zone. Failover → pick least-loaded globally.

Zone Detection

Collectors: NodeZoneResolver watches K8s Node objects, maps pod.Spec.NodeNametopology.kubernetes.io/zone label. Populates new Collector.Zone field.

Targets: Read from __meta_kubernetes_endpointslice_endpoint_zone Prometheus SD label (available since Prometheus 2.48). Falls back to node-based resolution.

Failover

  • If target's AZ has no collectors → distribute across ALL collectors using base strategy
  • Targets without zone info → global distribution (no zone preference)
  • Clusters without zone labels → zone awareness is a no-op (graceful degradation)

Observability

New metrics:

  • opentelemetry_allocator_zone_cross_zone_assignments — targets assigned to different AZ
  • opentelemetry_allocator_zone_uncovered_count — zones with targets but no collectors
  • opentelemetry_allocator_zone_collector_count{zone} — collectors per AZ
  • opentelemetry_allocator_zone_target_count{zone} — targets per AZ

New API endpoint: GET /zones — returns zone topology snapshot (collector/target distribution, uncovered zones, balance ratio)

Log warnings when zone coverage gaps detected.

Example Scenarios

4 collectors, 3 AZs (1 in az-a, 1 in az-b, 2 in az-c):

  • az-a targets → collector-0
  • az-b targets → collector-1
  • az-c targets → collector-2 + collector-3 (split 50/50 by inner strategy)

2 collectors, 3 AZs (1 in az-a, 1 in az-b, 0 in az-c):

  • az-a targets → collector-0 (same-zone)
  • az-b targets → collector-1 (same-zone)
  • az-c targets → FAILOVER spread across collector-0 + collector-1
  • Metric: zone_uncovered_count=1, log warning emitted

Implementation Plan

  1. Config + Collector.Zone field + NodeZoneResolver — adds zone field, no behavior change
  2. ZoneTopology map + metrics — builds topology, records metrics
  3. Zone-aware consistent-hashing — per-zone hash rings with failover
  4. Zone-aware least-weighted — per-zone collector index with failover
  5. API /zones endpoint + HTML views — read-only topology endpoints
  6. Integration tests + documentation

Kubernetes Alignment

  • Uses standard topology.kubernetes.io/zone label (stable since K8s 1.17)
  • Leverages EndpointSlice zone field (K8s 1.21+)
  • Mirrors Kubernetes topology-aware routing hints pattern (KEP-2433)
  • Complements topologySpreadConstraints on collector StatefulSet

RBAC

Requires nodes watch permission (cluster-scoped) for the zone resolver:

- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list", "watch"]

Alternatives Considered

  1. Decorator/wrapper pattern — rejected for complexity; integrating into strategies is cleaner
  2. New strategy names (e.g., zone-consistent-hashing) — rejected; combinatorial explosion
  3. Prometheus relabel_configs — works for standalone but doesn't solve TA-managed allocation
  4. Extend loadbalancingexporter — solves a different problem (export routing, not scrape assignment)

Open Questions

  1. Should zone info be provided via config/annotations as alternative to node label lookup (for restricted RBAC environments)?

/kind feature
/area target-allocator

Metadata

Metadata

Assignees

No one assigned

    Labels

    discuss-at-sigThis issue or PR should be discussed at the next SIG meeting

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions