Skip to content

buildx kubernetes driver sometimes returns ERROR: error dialing backend: remote error: tls: internal error #2668

Open
@dcherniv

Description

Contributing guidelines

I've found a bug and checked that ...

  • ... the documentation does not mention anything about my problem
  • ... there are no open or closed issues that are related to my problem

Description

On EKS 1.29 specifically using ARM64 nodes, it seems that there exists a race condition somewhere where the node CSR is not signed yet but the node is reported as ready in the cluster.
The issue goes away after a number of seconds once the CSR is approved and issued.
However, during that time all calls to the pods situated on the node return the above mentioned error.
Buildx kubernetes driver specifically returns this:

ERROR: error dialing backend: remote error: tls: internal error
ERROR: context deadline exceeded

NAME/NODE                             DRIVER/ENDPOINT                                                                                                               STATUS    BUILDKIT   PLATFORMS
buildx-php-release-8.2-2b0efca        kubernetes                                                                                                                                         
 \_ buildx-php-release-8.2-2b0efca0    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-8e44b51e-ee72-4a03-873b-8fa232cf24d9-33fdb&kubeconfig=   running   v0.15.2    linux/amd64*
 \_ buildx-php-release-8.2-2b0efca1    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-b4c6a857-d35e-4171-b4d3-0e38e1749975-05a58&kubeconfig=   error                linux/arm64*
Failed to get status for buildx-php-release-8.2-2b0efca (buildx-php-release-8.2-2b0efca1): listing workers: failed to list workers: DeadlineExceeded: context deadline exceeded
default*                              docker                                                                                                                                             
 \_ default                            \_ default                                                                                                                   running   v0.15.2    linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386

I was browsing the buildx source code and found the call to list the workers: https://github.com/docker/buildx/blob/master/vendor/github.com/moby/buildkit/client/workers.go#L31
But it's not clear what the retry logic is here. It seems to me when we get the above tls internal error, the code just dies and buildx quits.
Is it possible to handle this specific error somehow? It's transient, and should succeed if buildx does some kind of exponential backoff.

buildx is started as follows:

          docker buildx create --bootstrap --name=buildx-${DOCKERFILE_DIR_SANITIZED}-${VERSION} --driver=kubernetes --platform=linux/amd64 \
                 --buildkitd-flags '--debug --trace' \
                 --driver-opt='"annotations=karpenter.sh/do-not-disrupt=true,karpenter.sh/do-not-evict=true","image=162166941288.dkr.ecr.us-east-1.amazonaws.com/moby/buildkit:v0.15.2","timeout=600s","requests.memory=28Gi","nodeselector=runners=dedicated,kubernetes.io/arch=amd64","tolerations=key=runners,value=dedicated"'
          sleep 10
          docker buildx ls
          docker buildx create --append --bootstrap --name=buildx-${DOCKERFILE_DIR_SANITIZED}-${VERSION} --driver=kubernetes --platform=linux/arm64 \
                 --buildkitd-flags '--debug --trace' \
                 --driver-opt='"annotations=karpenter.sh/do-not-disrupt=true,karpenter.sh/do-not-evict=true","image=162166941288.dkr.ecr.us-east-1.amazonaws.com/moby/buildkit:v0.15.2","timeout=600s","requests.memory=28Gi","nodeselector=runners=dedicated,kubernetes.io/arch=arm64","tolerations=key=runners,value=dedicated;key=arch,value=arm64"'
          sleep 10
          docker buildx ls

buildx and docker versions:
buildkit remote agent to be booted on the nodes: moby/buildkit:v0.15.2
docker version: docker:27.2-dind with its built-in buildkit, no modifications.
The whole setup runs on self-hosted github-actions runners using 0.9.3 version of oci://ghcr.io/actions/actions-runner-controller-charts

It seems to only happen under heavy load on the cluster. We have a repo where we build about 20-30 docker images in parallel (its our base images repo). Each docker image requests 2 buildx kubernetes workers, one for amd64 and one for arm64. So a lot of nodes get spun up at the same time.

Expected behaviour

buildx to not die when it encounters a transient error.

Actual behaviour

Failure log follows:

#1 [internal] booting buildkit
W0831 21:09:50.507151     230 warnings.go:70] metadata.name: this is used in Pod names and hostnames, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
#1 waiting for 1 pods to be ready, timeout: 10 minutes
#1 waiting for 1 pods to be ready, timeout: 10 minutes 66.8s done
#1 DONE 66.8s
buildx-php-release-8.2-2b0efca
NAME/NODE                             DRIVER/ENDPOINT                                                                                                               STATUS    BUILDKIT   PLATFORMS
buildx-php-release-8.2-2b0efca        kubernetes                                                                                                                                         
 \_ buildx-php-release-8.2-2b0efca0    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-8e44b51e-ee72-4a03-873b-8fa232cf24d9-33fdb&kubeconfig=   running   v0.15.2    linux/amd64*
default*                              docker                                                                                                                                             
 \_ default                            \_ default                                                                                                                   running   v0.15.2    linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386
#1 [internal] booting buildkit
W0831 21:11:07.543695     328 warnings.go:70] metadata.name: this is used in Pod names and hostnames, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
#1 waiting for 1 pods to be ready, timeout: 10 minutes
#1 waiting for 1 pods to be ready, timeout: 10 minutes 71.9s done
#1 DONE 71.9s
buildx-php-release-8.2-2b0efca
ERROR: error dialing backend: remote error: tls: internal error
ERROR: context deadline exceeded

Sometimes it is able to proceed past this error (i'm guessing due to the sleep 10 statement), but not always.

Buildx version

github.com/dockerbuildx v0.16.2 99dea6d

Docker info

/ # docker info
Client:
 Version:    27.2.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.16.2
    Path:     /usr/local/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.2
    Path:     /usr/local/libexec/docker/cli-plugins/docker-compose

Builders list

NAME/NODE                             DRIVER/ENDPOINT                                                                                                               STATUS    BUILDKIT   PLATFORMS
buildx-php-release-8.2-2b0efca        kubernetes                                                                                                                                         
 \_ buildx-php-release-8.2-2b0efca0    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-8e44b51e-ee72-4a03-873b-8fa232cf24d9-33fdb&kubeconfig=   running   v0.15.2    linux/amd64*
 \_ buildx-php-release-8.2-2b0efca1    \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-b4c6a857-d35e-4171-b4d3-0e38e1749975-05a58&kubeconfig=   error                linux/arm64*
Failed to get status for buildx-php-release-8.2-2b0efca (buildx-php-release-8.2-2b0efca1): listing workers: failed to list workers: DeadlineExceeded: context deadline exceeded
default*                              docker                                                                                                                                             
 \_ default                            \_ default                                                                                                                   running   v0.15.2    linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386

Configuration

FROM public.ecr.aws/docker/library/php:8.2-apache

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get -y update &&\
    apt-get -y install gnupg git unzip


### Build logs

_No response_

### Additional info

_No response_

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions