Is high error rate during rollouts of thanos receive expected? #4277

jmichalek132 · 2021-05-27T10:27:26Z

jmichalek132
May 27, 2021

Hi, I wanted to ask whether high error rate during a rollout of thanos receive is expected?
When triggering a rollout of thanos receive for e.g. by doing kubectl rollout restart statefulset thanos-receive-staging, we experience high error rate on all layers (http error rate, replication error rate and forward request error rate).

Screenshot of metrics during a rollout:

Errors in log of thanos-receive-default:

thanos-receive-default-5 thanos-receive level=error ts=2021-05-27T10:01:29.986629432Z caller=handler.go:340 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
thanos-receive-default-5 thanos-receive level=error ts=2021-05-27T10:01:29.993031403Z caller=handler.go:340 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
thanos-receive-default-5 thanos-receive level=error ts=2021-05-27T10:01:30.002731645Z caller=handler.go:340 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
thanos-receive-default-5 thanos-receive level=error ts=2021-05-27T10:01:30.016280441Z caller=handler.go:340 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
thanos-receive-default-5 thanos-receive level=error ts=2021-05-27T10:01:30.034725505Z caller=handler.go:340 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
thanos-receive-default-5 thanos-receive level=error ts=2021-05-27T10:01:30.157205235Z caller=handler.go:340 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
thanos-receive-default-5 thanos-receive level=error ts=2021-05-27T10:01:30.19584754Z caller=handler.go:340 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
thanos-receive-default-5 thanos-receive level=error ts=2021-05-27T10:01:30.540910187Z caller=handler.go:340 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
thanos-receive-default-5 thanos-receive level=error ts=2021-05-27T10:01:30.726548734Z caller=handler.go:340 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
thanos-receive-default-5 thanos-receive level=error ts=2021-05-27T10:01:30.782889058Z caller=handler.go:340 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"
thanos-receive-default-5 thanos-receive level=error ts=2021-05-27T10:01:30.833211713Z caller=handler.go:340 component=receive component=receive-handler err="context deadline exceeded" msg="internal server error"

Errors in log of thanos-receive-staging:

thanos-receive-staging-3 thanos-receive level=warn ts=2021-05-27T08:33:10.095619638Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=8
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.315769707Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=11
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.316131273Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=5
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.316153279Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=8
thanos-receive-staging-2 thanos-receive level=warn ts=2021-05-27T08:32:34.779210265Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=16
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.316557839Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=11
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.316578368Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=7
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.317364964Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=16
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.317387982Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=4
thanos-receive-staging-3 thanos-receive level=warn ts=2021-05-27T08:33:10.096461921Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=44
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.317424214Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=86
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.317769722Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=17
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.31778458Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=20
thanos-receive-staging-2 thanos-receive level=warn ts=2021-05-27T08:32:34.779338971Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=7
thanos-receive-staging-4 thanos-receive level=warn ts=2021-05-27T08:36:02.721166024Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=22
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.318748322Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=21
thanos-receive-staging-4 thanos-receive level=warn ts=2021-05-27T08:36:02.721225879Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=19
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.318766163Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=15
thanos-receive-staging-3 thanos-receive level=warn ts=2021-05-27T08:33:10.096521591Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=16
thanos-receive-staging-3 thanos-receive level=warn ts=2021-05-27T08:33:10.096539631Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=6
thanos-receive-staging-4 thanos-receive level=warn ts=2021-05-27T08:36:02.72183861Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=22
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.318764713Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=14
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.318779701Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samp^Cthanos-receive-staging-3 thanos-receive level=warn ts=2021-05-27T08:33:10.096859773Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=16
thanos-receive-staging-1 thanos-receive level=warn ts=2021-05-27T08:33:42.318843125Z caller=writer.go:92 component=receive component=receive-writer msg="Error on ingesting out-of-order samples" num_dropped=23
thanos-receive-staging-6 thanos-receive level=warn ts=2021-05-27T09:27:58.446297582Z caller=writer.go:100 component=receive component=receive-writer msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=4

Our deployment.

Configuration of thanos receive default:
The one for thanos-receive is almost the same with exception of necessary modifications such as name of the statefulset etc.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: thanos-receive
    controller.receive.thanos.io: thanos-receive-controller
  name: thanos-receive
  namespace: namespace_name
spec:
  replicas: 6 # TODO increase if ram usage over 50GB
  volumeClaimTemplates:
    - metadata:
        name: thanos-receive-default
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: "40.0Gi"
  selector:
    matchLabels:
      app: thanos-receive
      controller.receive.thanos.io: thanos-receive-controller
  serviceName: thanos-receive
  template:
    metadata:
      annotations:
      labels:
        app.kubernetes.io/part-of: "namespace_name"
        app.kubernetes.io/instance: "namespace_name-singleton"
        app.kubernetes.io/name: "thanos-receive"
        #app.kubernetes.io/version: ""
        app: thanos-receive
        controller.receive.thanos.io: thanos-receive-controller
    spec:
      affinity: {}
      containers:
        - name: "jaeger-agent"  # See also https://www.jaegertracing.io/docs/1.19/operator/. In the future, sidecar injection might be more appropriate.
          image: "jaegertracing/jaeger-agent:1.19.2"
          imagePullPolicy: "IfNotPresent"
          ports:  # See ports in https://www.jaegertracing.io/docs/1.19/deployment/#agent.
            - containerPort: 5775
              name: "zk-compact-trft"
              protocol: "UDP"
            - containerPort: 5778
              name: "config-rest"
              protocol: "TCP"
            - containerPort: 6831   # Incoming spans are pushed here.
              name: "jg-compact-trft"
              protocol: "UDP"
            - containerPort: 6832
              name: "jg-binary-trft"
              protocol: "UDP"
            - containerPort: 14271  # Health check at /, metrics at /metrics.
              name: "admin-http"
              protocol: "TCP"
          resources:
            requests:
              cpu: "0.1"
              memory: "256.0Mi"
              ephemeral-storage: "512.0Mi"
            limits:
              memory: "256.0Mi"
        - args:
            - receive
            - --receive.replication-factor=3
            - --tsdb.path=/var/thanos/receive
            - --label=receive_replica="$(NAME)"
            - --label=hashring="default"
            - --receive.local-endpoint=$(NAME).thanos-receive-default.$(NAMESPACE).svc:10901
            - --tsdb.retention=4h
            - --receive.hashrings-file=/var/lib/thanos-receive/hashrings.json
            - --log.level=info
            - --tracing.config-file=/tracing/thanos-tracing.yaml
            - --objstore.config-file=/object-store/thanos-receive-object-store.yaml
            - --receive.tenant-label-name=thanos_receive_tenant_id
          env:
            - name: NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          resources:
            requests:
              cpu: "2.0"
              memory: "8.0Gi"
              ephemeral-storage: "10.0Gi"
            limits:
              memory: "16.0Gi"
              cpu: "4.0"
          image: quay.io/thanos/thanos:v0.19.0
          livenessProbe:
            failureThreshold: 8
            httpGet:
              path: /-/healthy
              port: 10902
              scheme: HTTP
            periodSeconds: 30
          name: thanos-receive
          ports:
            - containerPort: 10901
              name: grpc
            - containerPort: 10902
              name: http
            - containerPort: 19291
              name: remote-write
          readinessProbe:
            failureThreshold: 20
            httpGet:
              path: /-/ready
              port: 10902
              scheme: HTTP
            periodSeconds: 5
          terminationMessagePolicy: FallbackToLogsOnError
          volumeMounts:
            - mountPath: /var/thanos/receive
              name: thanos-receive-default
              readOnly: false
            - mountPath: /var/lib/thanos-receive
              name: hashring-config
            - name: "thanos-receive-tracing-config"
              mountPath: "/tracing"
            - name: "thanos-object-store-config"
              mountPath: "/object-store"
              readOnly: true
      terminationGracePeriodSeconds: 900
      volumes:
        - configMap:
            name: thanos-receive-generated
          name: hashring-config
        - name: "thanos-receive-tracing-config"
          secret:
            secretName: "thanos-receive-default-tracing-config"
        - name: "thanos-object-store-config"
          secret:
            secretName: "thanos-receive-object-store-config"
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: thanos-receive
  name: thanos-receive
  namespace: namespace_name
spec:
  clusterIP: None
  ports:
    - name: grpc
      port: 10901
      targetPort: 10901
    - name: http
      port: 10902
      targetPort: 10902
    - name: remote-write
      port: 19291
      targetPort: 19291
  selector:
    app: thanos-receive
    controller.receive.thanos.io: thanos-receive-controller

Hashring json config:

[
  {
    "hashring": "staging",
    "tenants": [
      "tenantA",
      "tenantB",
      "tenantC",
      "tenantD"
    ],
    "endpoints": [
      "thanos-receive-staging-0.thanos-receive-staging.namespace.svc:10901",
      "thanos-receive-staging-1.thanos-receive-staging.namespace.svc:10901",
      "thanos-receive-staging-2.thanos-receive-staging.namespace.svc:10901",
      "thanos-receive-staging-3.thanos-receive-staging.namespace.svc:10901",
      "thanos-receive-staging-4.thanos-receive-staging.namespace.svc:10901",
      "thanos-receive-staging-5.thanos-receive-staging.namespace.svc:10901",
      "thanos-receive-staging-6.thanos-receive-staging.namespace.svc:10901",
      "thanos-receive-staging-7.thanos-receive-staging.namespace.svc:10901"
    ]
  },
  {
    "hashring": "default",
    "endpoints": [
      "thanos-receive-default-0.thanos-receive-default.namespace.svc:10901",
      "thanos-receive-default-1.thanos-receive-default.namespace.svc:10901",
      "thanos-receive-default-2.thanos-receive-default.namespace.svc:10901",
      "thanos-receive-default-3.thanos-receive-default.namespace.svc:10901",
      "thanos-receive-default-4.thanos-receive-default.namespace.svc:10901",
      "thanos-receive-default-5.thanos-receive-default.namespace.svc:10901"
    ]
  }
]

kakkoyun · 2021-05-31T13:19:02Z

kakkoyun
May 31, 2021
Maintainer

Replication errors during updates are expected, however they shouldn't surface to the remote-write responses if you have enough healthy receive nodes. With your setup (6 nodes, replication factor 2), you should tolerate 2/6 down nodes. Do you have Pod Disruption Budgets set? https://github.com/thanos-io/kube-thanos/blob/f53ad9856c6f765989ea76ba8eff8dd1e77186b7/jsonnet/kube-thanos/kube-thanos-receive.libsonnet#L224

3 replies

jmichalek132 Jun 1, 2021
Author

Thank you for the tip, we were indeed missing Pod Disruption Budget, but adding it didn't help.
The Pod Disruption Budget:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: thanos-receive
  namespace: giraffe
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: thanos-receive

Also with the default behavior of statefulset even before we had only one unready pod (it's readiness didn't yet pass) at a time.

kakkoyun Jun 1, 2021
Maintainer

🤔 I don't see why this'd happen from the config you've shared. We need further investigation on this.
FWIW I'm sharing our configuration which is pretty similar to yours. Maybe it'd help you to spot something I couldn't see https://github.com/rhobs/configuration/blob/2cb8cf57e2d7b94fc577288913cc9e77de49b8f2/resources/services/observatorium-metrics-template.yaml#L980-L1197

jmichalek132 Jun 3, 2021
Author

I haven't noticed any significant difference either.
Adding a screenshot from another rollout, also with sample trace.

trace.zip

bwplotka · 2021-06-10T11:30:46Z

bwplotka
Jun 10, 2021
Maintainer

Thanks for this!

Some ideas during our Contributor Hours:

Known issue: receive + receive controller: Eliminate downtime when scaling up/down hashring replicas. observatorium/thanos-receive-controller#69 (might not relevant as it's scale-out)
How to preserve more than one receiver replica being down? (investigate some solution)
Is readiness really enabled when we can ingest? (What if we are ready before responses would be 200)
Create e2e test around this topic, dynamically changing replicas with receive-controller

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is high error rate during rollouts of thanos receive expected? #4277

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is high error rate during rollouts of thanos receive expected? #4277

Uh oh!

Uh oh!

jmichalek132 May 27, 2021

Replies: 2 comments · 3 replies

Uh oh!

kakkoyun May 31, 2021 Maintainer

Uh oh!

jmichalek132 Jun 1, 2021 Author

Uh oh!

kakkoyun Jun 1, 2021 Maintainer

Uh oh!

jmichalek132 Jun 3, 2021 Author

Uh oh!

bwplotka Jun 10, 2021 Maintainer

jmichalek132
May 27, 2021

Replies: 2 comments 3 replies

kakkoyun
May 31, 2021
Maintainer

jmichalek132 Jun 1, 2021
Author

kakkoyun Jun 1, 2021
Maintainer

jmichalek132 Jun 3, 2021
Author

bwplotka
Jun 10, 2021
Maintainer