-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Open
Labels
Description
Component(s)
exporter/loadbalancing
What happened?
Description
Load balancing exporter fails to resolve endpoints for the k8s resolver
This happened when we upgraded from v0.138 to v0.143.0. I tested all versions and it breaks at v0.139.0
Tested on :
kind version 0.30.0
Kind Kubernetes version v1.34.0
AWS v1.32.9-eks-3025e55
GCP v1.33.5-gke.1956000
Steps to Reproduce
You will need 2 OTEL collectors to reproduce the issue. One producer and one consumer.
Producer
apiVersion: v1
data:
relay: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
keepalive:
server_parameters:
max_connection_age: 1m0s
max_connection_age_grace: 5m0s
max_connection_idle: 1m0s
time: 2h
timeout: 20s
max_recv_msg_size_mib: 8
http:
endpoint: 0.0.0.0:4318
processors:
batch:
send_batch_max_size: 1000
send_batch_size: 500
timeout: 200ms
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
extensions:
pprof:
endpoint: 0.0.0.0:1777
zpages:
endpoint: 0.0.0.0:55679
health_check:
endpoint: ${env:MY_POD_IP}:13133
file_storage/pushgateway:
create_directory: true
compaction:
directory: /etc/otel/storage/pushgateway
on_rebound: true
directory: /etc/otel/storage/pushgateway
exporters:
debug/detailed:
verbosity: detailed
loadbalancing/pushgateways:
routing_key: "streamID"
protocol:
otlp:
compression: gzip
retry_on_failure:
enabled: true
initial_interval: 5s
max_elapsed_time: 300s
max_interval: 30s
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
storage: file_storage/pushgateway
timeout: 30s
tls:
insecure: true
resolver:
k8s:
service: push-gateway.otel
service:
extensions:
- file_storage/pushgateway
- health_check
pipelines:
metrics/allin:
receivers:
- otlp
processors:
- memory_limiter
- batch
exporters:
- loadbalancing/pushgateways
telemetry:
logs:
level: DEBUG
metrics:
readers:
- pull:
exporter:
prometheus:
host: 0.0.0.0
port: 8888
kind: ConfigMap
metadata:
name: entrypoint-cm
namespace: otel
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/component: entrypoint
app.kubernetes.io/name: entrypoint-collector
name: entrypoint-collector
namespace: otel
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 2
selector:
matchLabels:
app.kubernetes.io/component: entrypoint
app.kubernetes.io/name: entrypoint-collector
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8888"
prometheus.io/scrape: "true"
labels:
app.kubernetes.io/component: entrypoint
app.kubernetes.io/name: entrypoint-collector
spec:
automountServiceAccountToken: true
containers:
- args:
- --config=/conf/relay.yaml
env:
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: MY_POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
image: otel/opentelemetry-collector-contrib:0.144.0
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /
port: 13133
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: opentelemetry-collector
ports:
- containerPort: 4317
name: otlp
protocol: TCP
- containerPort: 4318
name: otlp-http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 13133
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 800m
memory: 1Gi
requests:
cpu: 400m
memory: 500Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- all
readOnlyRootFilesystem: true
runAsGroup: 1000
runAsNonRoot: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/otel/storage
name: entrypoint-collector-storage
- mountPath: /conf
name: entrypoint-collector-configmap
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
serviceAccount: entrypoint
serviceAccountName: entrypoint
securityContext:
seccompProfile:
type: RuntimeDefault
terminationGracePeriodSeconds: 500
volumes:
- emptyDir: {}
name: entrypoint-collector-storage
- configMap:
defaultMode: 420
items:
- key: relay
path: relay.yaml
name: entrypoint-cm
name: entrypoint-collector-configmap
---
apiVersion: v1
kind: ServiceAccount
automountServiceAccountToken: false
metadata:
labels:
app.kubernetes.io/name: entrypoint
name: entrypoint
namespace: otel
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: loadbalancer-role
namespace: otel
rules:
- apiGroups:
- ""
resources:
- endpoints
verbs:
- list
- watch
- get
- apiGroups:
- discovery.k8s.io
resources:
- endpointslices
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: loadbalancer-rolebinding
namespace: otel
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: loadbalancer-role
subjects:
- kind: ServiceAccount
name: entrypoint
namespace: otel
Consumer
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/component: push-gateway
app.kubernetes.io/name: push-gateway
name: push-gateway
namespace: otel
spec:
replicas: 2
selector:
matchLabels:
app.kubernetes.io/component: push-gateway
app.kubernetes.io/name: push-gateway
serviceName: push-gateway
template:
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8888"
prometheus.io/scrape: "true"
labels:
app.kubernetes.io/component: push-gateway
app.kubernetes.io/name: push-gateway
spec:
automountServiceAccountToken: true
containers:
- args:
- --config=/conf/relay.yaml
env:
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: KUBE_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: MY_POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
image: otel/opentelemetry-collector-contrib:latest
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /
port: 13133
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: opentelemetry-collector
ports:
- containerPort: 4317
name: tcp-grpc
protocol: TCP
- containerPort: 4318
name: tcp-http
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 13133
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "2"
memory: 2560Mi
requests:
cpu: "1"
memory: 1536Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- all
readOnlyRootFilesystem: true
runAsGroup: 1000
runAsNonRoot: true
volumeMounts:
- mountPath: /conf
name: push-gateway-configmap
dnsPolicy: ClusterFirst
restartPolicy: Always
terminationGracePeriodSeconds: 500
volumes:
- configMap:
defaultMode: 420
items:
- key: relay
path: relay.yaml
name: push-gateway-cm
name: push-gateway-configmap
---
apiVersion: v1
kind: Service
metadata:
labels:
app.kubernetes.io/component: push-gateway
app.kubernetes.io/name: push-gateway
name: push-gateway
namespace: otel
spec:
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: tcp-grpc
port: 4317
protocol: TCP
targetPort: 4317
- name: tcp-http
port: 4318
protocol: TCP
targetPort: 4318
selector:
app.kubernetes.io/component: push-gateway
app.kubernetes.io/name: push-gateway
sessionAffinity: None
type: ClusterIP
---
apiVersion: v1
data:
relay: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
keepalive:
server_parameters:
max_connection_age: 1m0s
max_connection_age_grace: 5m0s
max_connection_idle: 1m0s
time: 2h
timeout: 20s
max_recv_msg_size_mib: 4
http:
endpoint: 0.0.0.0:4318
processors:
batch:
send_batch_max_size: 1000
send_batch_size: 500
timeout: 200ms
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
extensions:
pprof:
endpoint: 0.0.0.0:1777
zpages:
endpoint: 0.0.0.0:55679
health_check:
endpoint: ${env:MY_POD_IP}:13133
exporters:
debug:
verbosity: detailed
service:
extensions:
- health_check
pipelines:
metrics/otlpin:
receivers:
- otlp
processors:
- memory_limiter
- batch
exporters:
- debug
telemetry:
logs:
level: DEBUG
metrics:
readers:
- pull:
exporter:
prometheus:
host: 0.0.0.0
port: 8888
kind: ConfigMap
metadata:
name: push-gateway-cm
namespace: otel
Expected Result
Load balancing exporter gets the endpoints
2026-01-29T04:32:21.920Z debug loadbalancingexporter@v0.138.0/resolver_k8s.go:148 creating and starting endpoints informer {"resource": {"service.instance.id": "6941dd4a-8639-4c6f-a098-5d2f79830c46", "service.name": "otelcol-contrib", "service.version": "0.138.0"}, "otelcol.component.id": "loadbalancing/pushgateways", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics", "resolver": "k8s service"}
Actual Results
2026-01-29T04:33:04.217Z error loadbalancingexporter@v0.144.0/loadbalancer.go:183 failed to start new exporter for endpoint {"resource": {"service.instance.id": "967735dc-a76f-48a5-80ae-6b68a8f9c663", "service.name": "otelcol-contrib", "service.version": "0.144.0"}, "otelcol.component.id": "loadbalancing/pushgateways", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics", "endpoint": "10.244.0.9:4317", "error": "timeout"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*loadBalancer).addMissingExporters
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/loadbalancer.go:183
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*loadBalancer).onBackendChanges
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/loadbalancer.go:166
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*k8sResolver).resolve
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/resolver_k8s.go:247
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.handler.OnAdd
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/resolver_k8s_handler.go:57
k8s.io/client-go/tools/cache.(*processorListener).run.func1
k8s.io/client-go@v0.34.3/tools/cache/shared_informer.go:1067
k8s.io/client-go/tools/cache.(*processorListener).run
k8s.io/client-go@v0.34.3/tools/cache/shared_informer.go:1077
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
k8s.io/apimachinery@v0.35.0-alpha.0/pkg/util/wait/wait.go:72
Collector version
v0.144.0
Environment information
Environment
Test 1
kind version 0.30.0
Kind Kubernetes version v1.34.0
Test 2
AWS v1.32.9-eks-3025e55
Test 3
GCP v1.33.5-gke.1956000
OpenTelemetry Collector configuration
Log output
2026-01-29T03:30:13.810Z error loadbalancingexporter@v0.144.0/loadbalancer.go:183 failed to start new exporter for endpoint {"resource": {"service.instance.id": "3f5b9f59-c5c0-43d7-a5fe-110d451ccc82", "service.name": "otelcol-contrib", "service.version": "0.144.0"}, "otelcol.component.id": "loadbalancing/pushgateways", "otelcol.component.kind": "exporter", "otelcol.signal": "metrics", "endpoint": "10.244.0.6:4317", "error": "timeout"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*loadBalancer).addMissingExporters
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/loadbalancer.go:183
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*loadBalancer).onBackendChanges
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/loadbalancer.go:166
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*k8sResolver).resolve
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/resolver_k8s.go:247
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.handler.OnAdd
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter@v0.144.0/resolver_k8s_handler.go:57
k8s.io/client-go/tools/cache.(*processorListener).run.func1
k8s.io/client-go@v0.34.3/tools/cache/shared_informer.go:1067
k8s.io/client-go/tools/cache.(*processorListener).run
k8s.io/client-go@v0.34.3/tools/cache/shared_informer.go:1077
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1
k8s.io/apimachinery@v0.35.0-alpha.0/pkg/util/wait/wait.go:72Additional context
No response
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.
fcoeguiguren