Skip to content

Runners on EKS with an EFS volume in K8s-mode can't start a job pod. #3885

Open
@sierrasoleil

Description

Checks

Controller Version

0.10.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Deploy a kubernetes-mode runner in EKS using CONTAINER_HOOKS, with EFS as a RWX storage volume
2. Run a job that only uses the runner container, everything is fine.
3. Run the same job but add the `container:` key to the workflow, the runner pod never gets past "pending"

Describe the bug

First and foremost, has anyone successfully used EFS for the _work volume in kubernetes-mode runners? I can't seem to find any examples, so maybe that's just wrong? I don't know of any other readily available CSI for EKS that supports RWX, which I guess is required for k8s-mode?

All runners, successful or not, show a few error events while waiting for the EFS volume to become available.

Warning  FailedScheduling  33s   default-scheduler  0/8 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim "arc-amd-8jt4l-runner-kll9r-work". preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.

I guess EFS is just slow, but I don't know why that would prevent the runner from starting at all.

Describe the expected behavior

I expected a runner with the container: key to create a job pod using that container.

Additional Context

My Runner definition:

githubConfigSecret: github-auth
githubConfigUrl: <url>

controllerServiceAccount:
  namespace: gh-controller
  name: github-arc

# containerMode:
#   kubernetesModeWorkVolumeClaim:
#     accessModes: ["ReadWriteOnce"]

template:
  spec:
    nodeSelector:
      beta.kubernetes.io/arch: amd64
    serviceAccountName: github-runner
    # securityContext:
    #   fsGroup: 1001
    containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        #image: 823996030995.dkr.ecr.us-west-2.amazonaws.com/github-runner-robust:amd64
        command: ["/home/runner/run.sh"]
        env:
          - name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/k8s/index.js
          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
            value: /home/runner/config/hook-extension.yaml
          - name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "false"
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
          - name: hook-extension
            mountPath: /home/runner/config/hook-extension.yaml
            subPath: hook-extension.yaml
    volumes:
      - name: work
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes: ["ReadWriteMany"]
              storageClassName: "gh-efs-sc"
              resources:
                requests:
                  storage: 10Gi
      - name: hook-extension
        configMap:
          name: hook-extension
          items:
            - key: content
              path: hook-extension.yaml


The Hook extention only adds a serviceAccountName to the worker pod:

apiVersion: v1
kind: ConfigMap
metadata:
  name: hook-extension
data:
  content: |
    spec:
      serviceAccountName: github-runner


The following job will work:

name: Actions Runner Controller
on:
  workflow_dispatch:
jobs:
  Base-Runner:
    runs-on: arc-amd
    #container:
    #  image: alpine:latest
    steps:
      - run: echo "hooray!"


However, if I uncomment `container:` and `image:` the runner pod gets stuck at `pending` and it never even creates the job pod.

It's worth noting that `fsGroup:` key because that previously got the runner to work, but after some CSI updates it became a problem.

Controller Logs

The controller logs just show this every minute or so while the pod is pending:

2025-01-14T20:52:05Z    INFO    EphemeralRunnerSet      Ephemeral runner counts {"version": "0.10.1", "ephemeralrunnerset": {"name":"arc-amd-8jt4l","namespace":"gh-runners"}, "pending": 1, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2025-01-14T20:52:05Z    INFO    EphemeralRunnerSet      Scaling comparison      {"version": "0.10.1", "ephemeralrunnerset": {"name":"arc-amd-8jt4l","namespace":"gh-runners"}, "current": 1, "desired": 1}
2025-01-14T20:52:05Z    INFO    AutoscalingRunnerSet    Find existing ephemeral runner set      {"version": "0.10.1", "autoscalingrunnerset": {"name":"arc-amd","namespace":"gh-runners"}, "name": "arc-amd-8jt4l", "specHash": "76b6bcbfbb"}

Runner Pod Logs

The runner pod never reaches a point where it can produce logs.

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions