"Cannot connect to the Docker daemon" errors appear more frequently as more runners we deploy #3828
Open
Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- I am using charts that are officially provided
Controller Version
0.9.3
Deployment Method
ArgoCD
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).
- I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
- Set different variables to the runner to prevent the error from appearing:
- name: STARTUP_DELAY_IN_SECONDS
value: "10"
- name: DISABLE_WAIT_FOR_DOCKER
value: "false"
- name: DOCKER_ENABLED
value: "true"
- name: WAIT_FOR_DOCKER_SECONDS
value: "180"
- Set the total amount of runners more than around 200.
- Run any pipeline that has to do any kind of "docker pull/build/etc" command.
3.1. If less than around 200 runners, only 1 or 2 runners fail through the week because of docker.
3.2. If more than around 200 runners, around a third or the runners fail and jobs have to be re-runned.
Describe the bug
Pipeline step complains that docker is not running:
Describe the expected behavior
It should do any of one of these two options:
- Run without issues.
- Runner auto-killed because of the DISABLE_WAIT_FOR_DOCKER check.
Additional Context
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
name: gha-runner-scale-set-${name}
namespace: flux-system
labels:
app.kubernetes.io/component: runner-scale-set
spec:
targetNamespace: ${namespace}
releaseName: ${name}
chart:
spec:
chart: gha-runner-scale-set
version: ${arc_version:=0.9.3}
sourceRef:
kind: HelmRepository
name: gha-runner-scale-set
namespace: flux-system
interval: 30m
install:
crds: CreateReplace
upgrade:
crds: CreateReplace
values:
minRunners: ${min_runners:=1}
maxRunners: ${max_runners:=2}
githubConfigUrl: https://github.com/factorialco
githubConfigSecret: actions-runner-secrets
runnerGroup: ${runner_group}
containerMode:
type: dind
listenerTemplate:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: listener
template:
spec:
shareProcessNamespace: true
releaseName: runner-scale-set-${name}
restartPolicy: Never
initContainers:
- name: clone-factorial-repository
image: ${image:=mirror.gcr.io/factorialdx/actions-runner:2.320.0-runner-setv2}
volumeMounts:
- mountPath: /home/runner/_work
name: work
- mountPath: /home/runner/cache
name: cache
- mountPath: /scripts
name: clone-factorial-repository-script
command: ["/scripts/clone-factorial-repository.sh"]
envFrom:
- secretRef:
name: actions-runner-secrets
containers:
- name: runner
securityContext:
privileged: true
imagePullPolicy: IfNotPresent
image: ${image:=mirror.gcr.io/factorialdx/actions-runner:2.320.0-runner-setv2}
command: ["/home/runner/run.sh"]
env:
- name: STARTUP_DELAY_IN_SECONDS
value: "10"
- name: DISABLE_WAIT_FOR_DOCKER
value: "false"
- name: DOCKER_ENABLED
value: "true"
- name: WAIT_FOR_DOCKER_SECONDS
value: "180"
resources:
requests:
cpu: "1"
memory: "8Gi"
limits:
memory: "16Gi"
volumeMounts:
- name: work
mountPath: /home/runner/_work
- mountPath: /tmp
name: tmp
- mountPath: /home/runner/cache
name: cache
volumes:
- name: work
emptyDir: {}
- name: tmp
emptyDir:
medium: Memory
- name: cache
hostPath:
path: /cache
- name: clone-factorial-repository-script
configMap:
name: clone-factorial-repository
defaultMode: 0777
Controller Logs
Runner name to copy&search on logs: build-s-99pst-runner-bnhsl
https://gist.github.com/snavarro-factorial/ee965f37114d0ac4589169012cc098a6
Runner Pod Logs
Export format in csv:
https://gist.github.com/snavarro-factorial/796fba24ba5c7f854d3b95f04b636021