Open
Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.I am using charts that are officially provided
Controller Version
0.10.1
Deployment Method
Helm
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
start jobs
Describe the bug
Sometimes, the runner pods continue running in zombie mode after completing their jobs.
Describe the expected behavior
runner pods should should be terminated after job completion
Additional Context
gha-runner-scale-set-controller:
enabled: true
flags:
logLevel: "warn"
podLabels:
finops.company.net/cloud_provider: gcp
finops.company.net/cost_center: compute
finops.company.net/product: tools
finops.company.net/service: actions-runner-controller
finops.company.net/region: europe-west1
replicaCount: 3
podAnnotations:
ad.datadoghq.com/manager.checks: |
{
"openmetrics": {
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:8080/metrics",
"histogram_buckets_as_distributions": true,
"namespace": "actions-runner-system",
"metrics": [".*"]
}
]
}
}
metrics:
controllerManagerAddr: ":8080"
listenerAddr: ":8080"
listenerEndpoint: "/metrics"
gha-runner-scale-set:
enabled: true
githubConfigUrl: https://github.com/company
githubConfigSecret:
github_token: <path:secret/github_token/actions_runner_controller#token>
maxRunners: 100
minRunners: 1
containerMode:
type: "dind" ## type can be set to dind or kubernetes
listenerTemplate:
metadata:
labels:
finops.company.net/cloud_provider: gcp
finops.company.net/cost_center: compute
finops.company.net/product: tools
finops.company.net/service: actions-runner-controller
finops.company.net/region: europe-west1
annotations:
ad.datadoghq.com/listener.checks: |
{
"openmetrics": {
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:8080/metrics",
"histogram_buckets_as_distributions": true,
"namespace": "actions-runner-system",
"max_returned_metrics": 6000,
"metrics": [".*"],
"exclude_metrics": [
"gha_job_startup_duration_seconds",
"gha_job_execution_duration_seconds"
],
"exclude_labels": [
"enterprise",
"event_name",
"job_name",
"job_result",
"job_workflow_ref",
"organization",
"repository",
"runner_name"
]
}
]
}
}
spec:
containers:
- name: listener
securityContext:
runAsUser: 1000
template:
metadata:
labels:
finops.company.net/cloud_provider: gcp
finops.company.net/cost_center: compute
finops.company.net/product: tools
finops.company.net/service: actions-runner-controller
finops.company.net/region: europe-west1
spec:
restartPolicy: OnFailure
imagePullSecrets:
- name: company-prod-registry
containers:
- name: runner
image: eu.gcr.io/company-production/devex/gha-runners:v1.0.0-snapshot5
command: ["/home/runner/run.sh"]
controllerServiceAccount:
namespace: actions-runner-system
name: actions-runner-controller-gha-rs-controller
Controller Logs
Date,Host,Service,Message
"2025-01-29T15:16:06.017Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:52.677Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:52.671Z","""node_name""","""manager""","Updated ephemeral runner status with pod phase"
"2025-01-29T15:15:52.657Z","""node_name""","""manager""","Updating ephemeral runner status with pod phase"
"2025-01-29T15:15:52.657Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:51.652Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:49.690Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:48.461Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:48.456Z","""node_name""","""manager""","Updated ephemeral runner status with pod phase"
"2025-01-29T15:15:48.440Z","""node_name""","""manager""","Updating ephemeral runner status with pod phase"
"2025-01-29T15:15:48.440Z","""node_name""","""manager""","Ephemeral runner container is still running"
"2025-01-29T15:15:48.424Z","""node_name""","""manager""","Waiting for runner container status to be available"
"2025-01-29T15:15:48.399Z","""node_name""","""manager""","Created ephemeral runner pod"
"2025-01-29T15:15:48.367Z","""node_name""","""manager""","Created new pod spec for ephemeral runner"
"2025-01-29T15:15:48.366Z","""node_name""","""manager""","Creating new pod for ephemeral runner"
"2025-01-29T15:15:48.366Z","""node_name""","""manager""","Creating new EphemeralRunner pod."
"2025-01-29T15:15:48.361Z","""node_name""","""manager""","Created ephemeral runner secret"
"2025-01-29T15:15:48.313Z","""node_name""","""manager""","Created new secret spec for ephemeral runner"
"2025-01-29T15:15:48.313Z","""node_name""","""manager""","Creating new secret for ephemeral runner"
"2025-01-29T15:15:48.313Z","""node_name""","""manager""","Creating new ephemeral runner secret for jitconfig."
"2025-01-29T15:15:48.308Z","""node_name""","""manager""","Updated ephemeral runner status with runnerId and runnerJITConfig"
"2025-01-29T15:15:48.294Z","""node_name""","""manager""","Updating ephemeral runner status with runnerId and runnerJITConfig"
"2025-01-29T15:15:48.294Z","""node_name""","""manager""","Created ephemeral runner JIT config"
"2025-01-29T15:15:48.093Z","""node_name""","""manager""","Creating ephemeral runner JIT config"
"2025-01-29T15:15:48.093Z","""node_name""","""manager""","Creating new ephemeral runner registration and updating status with runner config"
"2025-01-29T15:15:48.093Z","""node_name""","""manager""","Successfully added runner registration finalizer"
"2025-01-29T15:15:48.076Z","""node_name""","""manager""","Adding runner registration finalizer"
"2025-01-29T15:15:48.076Z","""node_name""","""manager""","Successfully added finalizer"
"2025-01-29T15:15:48.059Z","""node_name""","""manager""","Adding finalizer"
Runner Pod Logs
https://gist.github.com/julien-michaud/ce2a1e5c5d494d89e09453f0b270a26f
Activity
AblionGE commentedon Jan 30, 2025
Hi @julien-michaud ,
I experience the same behaviour (I'm using the containerMode kubernetes).
I checked the processes of one of these instances and the steps were done (no more processes from the workflow) but the container stays there doing... nothing.
I encounter this issue especially when I have long running commands that don't write to the output (
terraform plan
with a ressource that is "huge" to compute (4-5 min locally to plan this specific ressource without writing anything in the output) orterraform/terragrunt validate
on a "big" repository).prizov commentedon Jan 31, 2025
Hi @julien-michaud 👋
We discovered an issue in our environment. It has exactly the same symptoms as yours, although we have a different version of the runner and the controller in our setup. It turned out that some processes (
node
applications) inside the runner container got stuck in theD
state - uninterruptible sleep, and thus the runners' pod wasn't terminated properly.We reached out to GCP support, and they confirmed a regression introduced with the Container-Optimized OS (COS) versions between
cos-113-18244-236-26
andcos-113-18244-236-70
.Here is what they suggested:
Based on the configuration you shared, I assume you're also running the runners on GKE. I hope this helps!
julien-michaud commentedon Feb 3, 2025
Thanks a lot for the infos @prizov !
We just upgraded to
cos-113-18244-236-77
🤞FabioGentile commentedon Feb 7, 2025
Thanks for the comment, we have been facing the same issue and it helped us with the troubleshooting.
A note: we have been facing the issue also on other COS version, namely
cos-113-18244-236-77
andcos-117-18613-75-66
, I've asked GCP to confirm this and provide a list of affected COS version.We were able to (temporarily) solve the issue by:
1.30.5-gke.1713000
/cos-113-18244-236-5
OR
UV_USE_IO_URING=0
env var when runningnode install
commandsThe root cause seem to be a kernel bug in the
io_uring
call, and it hit GKE clusters on theSTABLE
channel only now due to the (i guess) slower release cycle.arouxAgicap commentedon Feb 7, 2025
Thanks @FabioGentile for precision concerning node 🙏 .
We encountered the same issue with our runners and Follow what @prizov recommended but with no success (currently on GKE version 1.30.8-gke.1261000 with COS version cos-113-18244-236-90).
We had to split scaleset to try to isolate the problematic workflows and the minute we run the workflows from our node repos the issue arise again.
We will test what you suggest (the env variable).
EDIT :
The variable worked well. We then upgraded the cluster to the version 1.30.9-gke.1009000 (Regular channel) with COS version cos-113-18244-291-3, removed the vars from our workflows and we everything kept working.
mcmarkj commentedon Feb 10, 2025
I'm so relieved to see this issue ticket.
I raised it with GCP about 3 weeks ago, and we probably are the customer they're referencing when they say
1.30.6-gke.1596000
fixed it for them as that's exactly what I did a few weeks ago.But a huge thanks to @FabioGentile for the
UV_USE_IO_URING=0
which fixed the issue for us, we're on the very latest GKE versionv1.32.1-gke.1357001
which still has the issue, but the env var fixes it.Scalahansolo commentedon Feb 12, 2025
I'm a little confused about where to add this environment variable? Is it just setting this env variable in the docker container that the arc runner uses?
arouxAgicap commentedon Feb 13, 2025
@Scalahansolo you have to set it in your workflows when using nodejs. In our composite action, it looked like this :
byterider commentedon Feb 19, 2025
Thanks for the workaround, it fixed our issue with Jenkins jobs getting stuck.