Skip to content

[exporter/loadbalancing] - k8s resolver fails to start new exporter for endpoint #45716

@fcoeguiguren

Description

@fcoeguiguren

Component(s)

exporter/loadbalancing

What happened?

Description

Load balancing exporter fails to resolve endpoints for the k8s resolver
This happened when we upgraded from v0.138 to v0.143.0. I tested all versions and it breaks at v0.139.0

Tested on :
kind version 0.30.0
Kind Kubernetes version v1.34.0
AWS v1.32.9-eks-3025e55
GCP v1.33.5-gke.1956000

Steps to Reproduce

You will need 2 OTEL collectors to reproduce the issue. One producer and one consumer.

Producer

apiVersion: v1
data:
  relay: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
            keepalive:
              server_parameters:
                max_connection_age: 1m0s
                max_connection_age_grace: 5m0s
                max_connection_idle: 1m0s
                time: 2h
                timeout: 20s
            max_recv_msg_size_mib: 8
          http:
            endpoint: 0.0.0.0:4318
    processors:
      batch:
        send_batch_max_size: 1000
        send_batch_size: 500
        timeout: 200ms
      memory_limiter:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 15
    extensions:
      pprof:
        endpoint: 0.0.0.0:1777
      zpages:
        endpoint: 0.0.0.0:55679
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
      file_storage/pushgateway:
        create_directory: true
        compaction:
          directory: /etc/otel/storage/pushgateway
          on_rebound: true
        directory: /etc/otel/storage/pushgateway
    exporters:
      debug/detailed:
        verbosity: detailed
      loadbalancing/pushgateways:
        routing_key: "streamID"
        protocol:
          otlp:
            compression: gzip
            retry_on_failure:
              enabled: true
              initial_interval: 5s
              max_elapsed_time: 300s
              max_interval: 30s
            sending_queue:
              enabled: true
              num_consumers: 10
              queue_size: 5000
              storage: file_storage/pushgateway
            timeout: 30s
            tls:
              insecure: true
        resolver:
          k8s:
           service: push-gateway.otel
    service:
      extensions:
        - file_storage/pushgateway
        - health_check
      pipelines:
        metrics/allin:
          receivers:
            - otlp
          processors:
            - memory_limiter
            - batch
          exporters:
            - loadbalancing/pushgateways
      telemetry:
        logs:
          level: DEBUG
        metrics:
          readers:
            - pull:
                exporter:
                  prometheus:
                    host: 0.0.0.0
                    port: 8888
kind: ConfigMap
metadata:
  name: entrypoint-cm
  namespace: otel
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: entrypoint
    app.kubernetes.io/name: entrypoint-collector
  name: entrypoint-collector
  namespace: otel
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app.kubernetes.io/component: entrypoint
      app.kubernetes.io/name: entrypoint-collector
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8888"
        prometheus.io/scrape: "true"
      labels:
        app.kubernetes.io/component: entrypoint
        app.kubernetes.io/name: entrypoint-collector
    spec:
      automountServiceAccountToken: true
      containers:
      - args:
        - --config=/conf/relay.yaml
        env:
        - name: K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: KUBE_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: MY_POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        image: otel/opentelemetry-collector-contrib:0.144.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 13133
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: opentelemetry-collector
        ports:
        - containerPort: 4317
          name: otlp
          protocol: TCP
        - containerPort: 4318
          name: otlp-http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 13133
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 800m
            memory: 1Gi
          requests:
            cpu: 400m
            memory: 500Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - all
          readOnlyRootFilesystem: true
          runAsGroup: 1000
          runAsNonRoot: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/otel/storage
          name: entrypoint-collector-storage
        - mountPath: /conf
          name: entrypoint-collector-configmap
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      serviceAccount: entrypoint
      serviceAccountName: entrypoint
      securityContext:
        seccompProfile:
          type: RuntimeDefault
      terminationGracePeriodSeconds: 500
      volumes:
      - emptyDir: {}
        name: entrypoint-collector-storage
      - configMap:
          defaultMode: 420
          items:
          - key: relay
            path: relay.yaml
          name: entrypoint-cm
        name: entrypoint-collector-configmap
---
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
metadata:
  labels:
    app.kubernetes.io/name: entrypoint
  name: entrypoint
  namespace: otel
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: loadbalancer-role
  namespace: otel
rules:
- apiGroups:
  - ""
  resources:
  - endpoints
  verbs:
  - list
  - watch
  - get
- apiGroups:
  - discovery.k8s.io
  resources:
  - endpointslices
  verbs:
  - get
  - list
  - watch

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: loadbalancer-rolebinding
  namespace: otel
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: loadbalancer-role
subjects:
- kind: ServiceAccount
  name: entrypoint
  namespace: otel

Consumer

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/component: push-gateway
    app.kubernetes.io/name: push-gateway
  name: push-gateway
  namespace: otel
spec:
  replicas: 2
  selector:
    matchLabels:
      app.kubernetes.io/component: push-gateway
      app.kubernetes.io/name: push-gateway
  serviceName: push-gateway
  template:
    metadata:
      annotations:
        prometheus.io/path: /metrics
        prometheus.io/port: "8888"
        prometheus.io/scrape: "true"
      labels:
        app.kubernetes.io/component: push-gateway
        app.kubernetes.io/name: push-gateway
    spec:
      automountServiceAccountToken: true
      containers:
      - args:
        - --config=/conf/relay.yaml
        env:
        - name: K8S_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: KUBE_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: MY_POD_IP
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: status.podIP
        image: otel/opentelemetry-collector-contrib:latest
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 13133
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: opentelemetry-collector
        ports:
        - containerPort: 4317
          name: tcp-grpc
          protocol: TCP
        - containerPort: 4318
          name: tcp-http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 13133
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "2"
            memory: 2560Mi
          requests:
            cpu: "1"
            memory: 1536Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - all
          readOnlyRootFilesystem: true
          runAsGroup: 1000
          runAsNonRoot: true
        volumeMounts:
        - mountPath: /conf
          name: push-gateway-configmap
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      terminationGracePeriodSeconds: 500
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: relay
            path: relay.yaml
          name: push-gateway-cm
        name: push-gateway-configmap
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: push-gateway
    app.kubernetes.io/name: push-gateway
  name: push-gateway
  namespace: otel
spec:
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: tcp-grpc
    port: 4317
    protocol: TCP
    targetPort: 4317
  - name: tcp-http
    port: 4318
    protocol: TCP
    targetPort: 4318
  selector:
    app.kubernetes.io/component: push-gateway
    app.kubernetes.io/name: push-gateway
  sessionAffinity: None
  type: ClusterIP
---
apiVersion: v1
data:
  relay: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
            keepalive:
              server_parameters:
                max_connection_age: 1m0s
                max_connection_age_grace: 5m0s
                max_connection_idle: 1m0s
                time: 2h
                timeout: 20s
            max_recv_msg_size_mib: 4
          http:
            endpoint: 0.0.0.0:4318
    processors:
      batch:
        send_batch_max_size: 1000
        send_batch_size: 500
        timeout: 200ms
      memory_limiter:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 15
    extensions:
      pprof:
        endpoint: 0.0.0.0:1777
      zpages:
        endpoint: 0.0.0.0:55679
      health_check:
        endpoint: ${env:MY_POD_IP}:13133
    exporters:
      debug:
        verbosity: detailed
    service:
      extensions:
        - health_check
      pipelines:
        metrics/otlpin:
          receivers:
            - otlp
          processors:
            - memory_limiter
            - batch
          exporters:
            - debug
      telemetry:
        logs:
          level: DEBUG
        metrics:
          readers:
            - pull:
                exporter:
                  prometheus:
                    host: 0.0.0.0
                    port: 8888
kind: ConfigMap
metadata:
  name: push-gateway-cm
  namespace: otel

Expected Result

Load balancing exporter gets the endpoints

2026-01-29T04:32:21.920Z	debug	loadbalancingexporter@v0.138.0/resolver_k8s.go:148	creating and starting endpoints informer	{"resource": {"service.instance.id": "6941dd4a-8639-4c6f-a098-5d2f79830c46", "service.name": "otelcol-contrib", "service.version": "0.138.0"}, "otelcol.component.id": "loadbalancing/pushgateways", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics", "resolver": "k8s service"}

Actual Results

2026-01-29T04:33:04.217Z	error	loadbalancingexporter@v0.144.0/loadbalancer.go:183	failed to start new exporter for endpoint	{"resource": {"service.instance.id": "967735dc-a76f-48a5-80ae-6b68a8f9c663", "service.name": "otelcol-contrib", "service.version": "0.144.0"}, "otelcol.component.id": "loadbalancing/pushgateways", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics", "endpoint": "10.244.0.9:4317", "error": "timeout"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*loadBalancer).addMissingExporters
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/loadbalancer.go:183
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*loadBalancer).onBackendChanges
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/loadbalancer.go:166
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*k8sResolver).resolve
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/resolver_k8s.go:247
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.handler.OnAdd
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/resolver_k8s_handler.go:57
k8s.io/client-go/tools/cache.(*processorListener).run.func1
	k8s.io/client-go@v0.34.3/tools/cache/shared_informer.go:1067
k8s.io/client-go/tools/cache.(*processorListener).run
	k8s.io/client-go@v0.34.3/tools/cache/shared_informer.go:1077
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
	k8s.io/apimachinery@v0.35.0-alpha.0/pkg/util/wait/wait.go:72

Collector version

v0.144.0

Environment information

Environment

Test 1
kind version 0.30.0
Kind Kubernetes version v1.34.0
Test 2
AWS v1.32.9-eks-3025e55
Test 3
GCP v1.33.5-gke.1956000

OpenTelemetry Collector configuration

Log output

2026-01-29T03:30:13.810Z	error	loadbalancingexporter@v0.144.0/loadbalancer.go:183	failed to start new exporter for endpoint	{"resource": {"service.instance.id": "3f5b9f59-c5c0-43d7-a5fe-110d451ccc82", "service.name": "otelcol-contrib", "service.version": "0.144.0"}, "otelcol.component.id": "loadbalancing/pushgateways", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics", "endpoint": "10.244.0.6:4317", "error": "timeout"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*loadBalancer).addMissingExporters
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/loadbalancer.go:183
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*loadBalancer).onBackendChanges
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/loadbalancer.go:166
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*k8sResolver).resolve
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/resolver_k8s.go:247
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.handler.OnAdd
	github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/resolver_k8s_handler.go:57
k8s.io/client-go/tools/cache.(*processorListener).run.func1
	k8s.io/client-go@v0.34.3/tools/cache/shared_informer.go:1067
k8s.io/client-go/tools/cache.(*processorListener).run
	k8s.io/client-go@v0.34.3/tools/cache/shared_informer.go:1077
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
	k8s.io/apimachinery@v0.35.0-alpha.0/pkg/util/wait/wait.go:72

Additional context

No response

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions