Skip to content

"Cannot connect to the Docker daemon" errors appear more frequently as more runners we deploy #3828

Open
@snavarro-factorial

Description

Checks

Controller Version

0.9.3

Deployment Method

ArgoCD

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

  1. Set different variables to the runner to prevent the error from appearing:
- name: STARTUP_DELAY_IN_SECONDS
  value: "10"
- name: DISABLE_WAIT_FOR_DOCKER
  value: "false"
- name: DOCKER_ENABLED
  value: "true"
- name: WAIT_FOR_DOCKER_SECONDS
  value: "180"
  1. Set the total amount of runners more than around 200.
  2. Run any pipeline that has to do any kind of "docker pull/build/etc" command.
    3.1. If less than around 200 runners, only 1 or 2 runners fail through the week because of docker.
    3.2. If more than around 200 runners, around a third or the runners fail and jobs have to be re-runned.

Describe the bug

Pipeline step complains that docker is not running:
timestamped_error

Describe the expected behavior

It should do any of one of these two options:

  1. Run without issues.
  2. Runner auto-killed because of the DISABLE_WAIT_FOR_DOCKER check.

Additional Context

apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
  name: gha-runner-scale-set-${name}
  namespace: flux-system
  labels:
    app.kubernetes.io/component: runner-scale-set
spec:
  targetNamespace: ${namespace}
  releaseName: ${name}
  chart:
    spec:
      chart: gha-runner-scale-set
      version: ${arc_version:=0.9.3}
      sourceRef:
        kind: HelmRepository
        name: gha-runner-scale-set
        namespace: flux-system
  interval: 30m
  install:
    crds: CreateReplace
  upgrade:
    crds: CreateReplace
  values:
    minRunners: ${min_runners:=1}
    maxRunners: ${max_runners:=2}
    githubConfigUrl: https://github.com/factorialco
    githubConfigSecret: actions-runner-secrets
    runnerGroup: ${runner_group}
    containerMode:
      type: dind
    listenerTemplate:
      metadata:
        annotations:
          prometheus.io/scrape: "true"
          prometheus.io/port: "8080"
          prometheus.io/path: "/metrics"
      spec:
        containers:
          - name: listener
    template:
      spec:
        shareProcessNamespace: true
        releaseName: runner-scale-set-${name}
        restartPolicy: Never
        initContainers:
          - name: clone-factorial-repository
            image: ${image:=mirror.gcr.io/factorialdx/actions-runner:2.320.0-runner-setv2}
            volumeMounts:
              - mountPath: /home/runner/_work
                name: work
              - mountPath: /home/runner/cache
                name: cache
              - mountPath: /scripts
                name: clone-factorial-repository-script
            command: ["/scripts/clone-factorial-repository.sh"]
            envFrom:
              - secretRef:
                  name: actions-runner-secrets
        containers:
          - name: runner
            securityContext:
              privileged: true
            imagePullPolicy: IfNotPresent
            image: ${image:=mirror.gcr.io/factorialdx/actions-runner:2.320.0-runner-setv2}
            command: ["/home/runner/run.sh"]
            env:
              - name: STARTUP_DELAY_IN_SECONDS
                value: "10"
              - name: DISABLE_WAIT_FOR_DOCKER
                value: "false"
              - name: DOCKER_ENABLED
                value: "true"
              - name: WAIT_FOR_DOCKER_SECONDS
                value: "180"
            resources:
              requests:
                cpu: "1"
                memory: "8Gi"
              limits:
                memory: "16Gi"
            volumeMounts:
              - name: work
                mountPath: /home/runner/_work
              - mountPath: /tmp
                name: tmp
              - mountPath: /home/runner/cache
                name: cache
        volumes:
          - name: work
            emptyDir: {}
          - name: tmp
            emptyDir:
              medium: Memory
          - name: cache
            hostPath:
              path: /cache
          - name: clone-factorial-repository-script
            configMap:
              name: clone-factorial-repository
              defaultMode: 0777

Controller Logs

Runner name to copy&search on logs: build-s-99pst-runner-bnhsl
https://gist.github.com/snavarro-factorial/ee965f37114d0ac4589169012cc098a6

Runner Pod Logs

Export format in csv:
https://gist.github.com/snavarro-factorial/796fba24ba5c7f854d3b95f04b636021

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions