Runner to workflow pods take 3 minutes to start on RWX & containerMode: Kubernetes #3834
Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- I am using charts that are officially provided
Controller Version
0.9.3
Deployment Method
Helm
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).
- I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
Setup arc runner scaleset with containerMode: Kubernetes
Use an NFS based storageclass to back the nodes
build a docker image via GHA using kaniko
Describe the bug
After initializing the runner pod (which is fairly immediate) - the github actions jobs (6 of them) seems to get stuck polling for 2-3 minutes waiting to spin up the workflow pod to continue the github action job.
The runner pod logs show every 5-10 seconds there is a job that polls for 2-3 minutes before the container hook is called and the workflow pod is spun up.
See Line 6-52 in the scaleset logs gist below, you'll see this line get called every few seconds.
[WORKER 2024-12-03 19:21:58Z INFO HostContext] Well known directory 'Root': '/home/runner'
This bug started occuring when we switched to RWX, new storage class using NFS based azure files. I suppose it might be the slowness to provision a PVC using azure files versus traditional disk based setup on RWO
Describe the expected behavior
After initializing the runner pod on new github actions job- the workflow pods should spin up near immediately to process the docker builds from each GHA job.
Additional Context
Here is the arc runner scaleset code
initContainers:
- name: kube-init
image: ghcr.io/actions/actions-runner:latest
command: ["/bin/sh", "-c"]
args:
- |
sudo chown -R ${local.github_runner_user_gid}:123 /home/runner/_work
volumeMounts:
- name: work
mountPath: /home/runner/_work
securityContext:
fsGroup: 123 ## needed to resolve permission issues with mounted volume. https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors#error-access-to-the-path-homerunner_work_tool-is-denied
containers:
- name: runner
image: ghcr.io/actions/actions-runner:latest
command: ["/home/runner/run.sh"]
env:
- name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
value: /home/runner/pod-templates/default.yml
- name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
value: "false" ## To allow jobs without a job container to run, set ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER to false on your runner container. This instructs the runner to disable this check.
- name: ACTIONS_RUNNER_USE_KUBE_SCHEDULER # Flag enables separate scheduling for worker pods
value: "true"
volumeMounts:
- name: pod-templates
mountPath: /home/runner/pod-templates
readOnly: true
volumes:
- name: work
ephemeral:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteMany"]
storageClassName: ${local.storage_class_name}
resources:
requests:
storage: ${local.volume_claim_size}
- name: pod-templates
configMap:
name: "runner-pod-template"
containerMode:
type: "kubernetes" ## type can be set to dind or kubernetes
## the following is required when containerMode.type=kubernetes
kubernetesModeWorkVolumeClaim:
accessModes: ["ReadWriteMany"]
storageClassName: ${local.storage_class_name}
resources:
requests:
storage: ${local.volume_claim_size}
EOF
]
}
locals {
job_template_name = "runner-pod-template"
}
resource "kubernetes_config_map" "job_template" {
metadata {
name = local.job_template_name
namespace = local.gha_runner_namespace
}
data = {
"default.yml" = yamlencode({
apiVersion = "v1"
kind = "PodTemplate"
metadata = {
name = local.job_template_name
}
spec = {
containers = [
{
name = "$job"
resources = {
requests = {
cpu = "3000m"
}
limits = {
cpu = "3000m"
}
}
}
]
}
})
}
}
# GHA job
/kaniko/executor --dockerfile=".Dockerfilehere" \
--context="${{ github.repositoryUrl }}#${{ github.ref }}#${{ github.sha }}" \
--destination="randomcontainerregistry:taghere" \
--use-new-run \
--snapshot-mode=redo \
--compressed-caching=false \
--registry-mirror=mirror.gcr.io \
--cache=true --cache-copy-layers=false --cache-ttl=500h \
--push-retry 5
# Storage class
resource "kubernetes_manifest" "csi_storage_class" {
manifest = {
apiVersion = "storage.k8s.io/v1"
kind = "StorageClass"
metadata = {
name = "storageclassawesome"
}
provisioner = "file.csi.azure.com"
allowVolumeExpansion = true
parameters = {
resourceGroup = "yup"
storageAccount = "yup"
skuName = "Premium_LRS"
location = "sdfsf"
server = "test.net"
}
reclaimPolicy = "Delete"
volumeBindingMode = "Immediate"
mountOptions = [
"dir_mode=0777",
"file_mode=0777",
"uid=1000",
"gid=1000",
"mfsymlinks",
"cache=strict",
"nosharesock",
"actimeo=30"
]
Controller Logs
ARC Controller & Scaleset Logs: https://gist.github.com/jonathan-fileread/fd0978bef66784e20d6b50bce50cd3b9
Runner Pod Logs
ARC Controller & Scaleset Logs: https://gist.github.com/jonathan-fileread/fd0978bef66784e20d6b50bce50cd3b9