Runners on EKS with an EFS volume in K8s-mode can't start a job pod. #3885
Open
Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- I am using charts that are officially provided
Controller Version
0.10.1
Deployment Method
Helm
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).
- I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
1. Deploy a kubernetes-mode runner in EKS using CONTAINER_HOOKS, with EFS as a RWX storage volume
2. Run a job that only uses the runner container, everything is fine.
3. Run the same job but add the `container:` key to the workflow, the runner pod never gets past "pending"
Describe the bug
First and foremost, has anyone successfully used EFS for the _work
volume in kubernetes-mode runners? I can't seem to find any examples, so maybe that's just wrong? I don't know of any other readily available CSI for EKS that supports RWX, which I guess is required for k8s-mode?
All runners, successful or not, show a few error events while waiting for the EFS volume to become available.
Warning FailedScheduling 33s default-scheduler 0/8 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim "arc-amd-8jt4l-runner-kll9r-work". preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling.
I guess EFS is just slow, but I don't know why that would prevent the runner from starting at all.
Describe the expected behavior
I expected a runner with the container:
key to create a job pod using that container.
Additional Context
My Runner definition:
githubConfigSecret: github-auth
githubConfigUrl: <url>
controllerServiceAccount:
namespace: gh-controller
name: github-arc
# containerMode:
# kubernetesModeWorkVolumeClaim:
# accessModes: ["ReadWriteOnce"]
template:
spec:
nodeSelector:
beta.kubernetes.io/arch: amd64
serviceAccountName: github-runner
# securityContext:
# fsGroup: 1001
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
#image: 823996030995.dkr.ecr.us-west-2.amazonaws.com/github-runner-robust:amd64
command: ["/home/runner/run.sh"]
env:
- name: ACTIONS_RUNNER_CONTAINER_HOOKS
value: /home/runner/k8s/index.js
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /home/runner/config/hook-extension.yaml
- name: ACTIONS_RUNNER_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "false"
volumeMounts:
- name: work
mountPath: /home/runner/_work
- name: hook-extension
mountPath: /home/runner/config/hook-extension.yaml
subPath: hook-extension.yaml
volumes:
- name: work
ephemeral:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteMany"]
storageClassName: "gh-efs-sc"
resources:
requests:
storage: 10Gi
- name: hook-extension
configMap:
name: hook-extension
items:
- key: content
path: hook-extension.yaml
The Hook extention only adds a serviceAccountName to the worker pod:
apiVersion: v1
kind: ConfigMap
metadata:
name: hook-extension
data:
content: |
spec:
serviceAccountName: github-runner
The following job will work:
name: Actions Runner Controller
on:
workflow_dispatch:
jobs:
Base-Runner:
runs-on: arc-amd
#container:
# image: alpine:latest
steps:
- run: echo "hooray!"
However, if I uncomment `container:` and `image:` the runner pod gets stuck at `pending` and it never even creates the job pod.
It's worth noting that `fsGroup:` key because that previously got the runner to work, but after some CSI updates it became a problem.
Controller Logs
The controller logs just show this every minute or so while the pod is pending:
2025-01-14T20:52:05Z INFO EphemeralRunnerSet Ephemeral runner counts {"version": "0.10.1", "ephemeralrunnerset": {"name":"arc-amd-8jt4l","namespace":"gh-runners"}, "pending": 1, "running": 0, "finished": 0, "failed": 0, "deleting": 0}
2025-01-14T20:52:05Z INFO EphemeralRunnerSet Scaling comparison {"version": "0.10.1", "ephemeralrunnerset": {"name":"arc-amd-8jt4l","namespace":"gh-runners"}, "current": 1, "desired": 1}
2025-01-14T20:52:05Z INFO AutoscalingRunnerSet Find existing ephemeral runner set {"version": "0.10.1", "autoscalingrunnerset": {"name":"arc-amd","namespace":"gh-runners"}, "name": "arc-amd-8jt4l", "specHash": "76b6bcbfbb"}
Runner Pod Logs
The runner pod never reaches a point where it can produce logs.