buildx kubernetes driver sometimes returns ERROR: error dialing backend: remote error: tls: internal error
#2668
Description
Contributing guidelines
- I've read the contributing guidelines and wholeheartedly agree
I've found a bug and checked that ...
- ... the documentation does not mention anything about my problem
- ... there are no open or closed issues that are related to my problem
Description
On EKS 1.29 specifically using ARM64 nodes, it seems that there exists a race condition somewhere where the node CSR is not signed yet but the node is reported as ready in the cluster.
The issue goes away after a number of seconds once the CSR is approved and issued.
However, during that time all calls to the pods situated on the node return the above mentioned error.
Buildx kubernetes driver specifically returns this:
ERROR: error dialing backend: remote error: tls: internal error
ERROR: context deadline exceeded
NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS
buildx-php-release-8.2-2b0efca kubernetes
\_ buildx-php-release-8.2-2b0efca0 \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-8e44b51e-ee72-4a03-873b-8fa232cf24d9-33fdb&kubeconfig= running v0.15.2 linux/amd64*
\_ buildx-php-release-8.2-2b0efca1 \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-b4c6a857-d35e-4171-b4d3-0e38e1749975-05a58&kubeconfig= error linux/arm64*
Failed to get status for buildx-php-release-8.2-2b0efca (buildx-php-release-8.2-2b0efca1): listing workers: failed to list workers: DeadlineExceeded: context deadline exceeded
default* docker
\_ default \_ default running v0.15.2 linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386
I was browsing the buildx source code and found the call to list the workers: https://github.com/docker/buildx/blob/master/vendor/github.com/moby/buildkit/client/workers.go#L31
But it's not clear what the retry logic is here. It seems to me when we get the above tls internal error, the code just dies and buildx quits.
Is it possible to handle this specific error somehow? It's transient, and should succeed if buildx does some kind of exponential backoff.
buildx is started as follows:
docker buildx create --bootstrap --name=buildx-${DOCKERFILE_DIR_SANITIZED}-${VERSION} --driver=kubernetes --platform=linux/amd64 \
--buildkitd-flags '--debug --trace' \
--driver-opt='"annotations=karpenter.sh/do-not-disrupt=true,karpenter.sh/do-not-evict=true","image=162166941288.dkr.ecr.us-east-1.amazonaws.com/moby/buildkit:v0.15.2","timeout=600s","requests.memory=28Gi","nodeselector=runners=dedicated,kubernetes.io/arch=amd64","tolerations=key=runners,value=dedicated"'
sleep 10
docker buildx ls
docker buildx create --append --bootstrap --name=buildx-${DOCKERFILE_DIR_SANITIZED}-${VERSION} --driver=kubernetes --platform=linux/arm64 \
--buildkitd-flags '--debug --trace' \
--driver-opt='"annotations=karpenter.sh/do-not-disrupt=true,karpenter.sh/do-not-evict=true","image=162166941288.dkr.ecr.us-east-1.amazonaws.com/moby/buildkit:v0.15.2","timeout=600s","requests.memory=28Gi","nodeselector=runners=dedicated,kubernetes.io/arch=arm64","tolerations=key=runners,value=dedicated;key=arch,value=arm64"'
sleep 10
docker buildx ls
buildx and docker versions:
buildkit remote agent to be booted on the nodes: moby/buildkit:v0.15.2
docker version: docker:27.2-dind
with its built-in buildkit, no modifications.
The whole setup runs on self-hosted github-actions runners using 0.9.3
version of oci://ghcr.io/actions/actions-runner-controller-charts
It seems to only happen under heavy load on the cluster. We have a repo where we build about 20-30 docker images in parallel (its our base images repo). Each docker image requests 2 buildx kubernetes workers, one for amd64 and one for arm64. So a lot of nodes get spun up at the same time.
Expected behaviour
buildx to not die when it encounters a transient error.
Actual behaviour
Failure log follows:
#1 [internal] booting buildkit
W0831 21:09:50.507151 230 warnings.go:70] metadata.name: this is used in Pod names and hostnames, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
#1 waiting for 1 pods to be ready, timeout: 10 minutes
#1 waiting for 1 pods to be ready, timeout: 10 minutes 66.8s done
#1 DONE 66.8s
buildx-php-release-8.2-2b0efca
NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS
buildx-php-release-8.2-2b0efca kubernetes
\_ buildx-php-release-8.2-2b0efca0 \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-8e44b51e-ee72-4a03-873b-8fa232cf24d9-33fdb&kubeconfig= running v0.15.2 linux/amd64*
default* docker
\_ default \_ default running v0.15.2 linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386
#1 [internal] booting buildkit
W0831 21:11:07.543695 328 warnings.go:70] metadata.name: this is used in Pod names and hostnames, which can result in surprising behavior; a DNS label is recommended: [must not contain dots]
#1 waiting for 1 pods to be ready, timeout: 10 minutes
#1 waiting for 1 pods to be ready, timeout: 10 minutes 71.9s done
#1 DONE 71.9s
buildx-php-release-8.2-2b0efca
ERROR: error dialing backend: remote error: tls: internal error
ERROR: context deadline exceeded
Sometimes it is able to proceed past this error (i'm guessing due to the sleep 10 statement), but not always.
Buildx version
github.com/dockerbuildx v0.16.2 99dea6d
Docker info
/ # docker info
Client:
Version: 27.2.0
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.16.2
Path: /usr/local/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.29.2
Path: /usr/local/libexec/docker/cli-plugins/docker-compose
Builders list
NAME/NODE DRIVER/ENDPOINT STATUS BUILDKIT PLATFORMS
buildx-php-release-8.2-2b0efca kubernetes
\_ buildx-php-release-8.2-2b0efca0 \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-8e44b51e-ee72-4a03-873b-8fa232cf24d9-33fdb&kubeconfig= running v0.15.2 linux/amd64*
\_ buildx-php-release-8.2-2b0efca1 \_ kubernetes:///buildx-php-release-8.2-2b0efca?deployment=buildkit-b4c6a857-d35e-4171-b4d3-0e38e1749975-05a58&kubeconfig= error linux/arm64*
Failed to get status for buildx-php-release-8.2-2b0efca (buildx-php-release-8.2-2b0efca1): listing workers: failed to list workers: DeadlineExceeded: context deadline exceeded
default* docker
\_ default \_ default running v0.15.2 linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386
Configuration
FROM public.ecr.aws/docker/library/php:8.2-apache
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get -y update &&\
apt-get -y install gnupg git unzip
### Build logs
_No response_
### Additional info
_No response_