Checks
Controller Version
0.14.1 (gha-runner-scale-set-controller)
Deployment Method
Helm
Checks
To Reproduce
## Steps to Reproduce
1. Run ARC with replicaCount=2 on a multi-node cluster
2. Power off the node running the listener pod
3. Observe listener pod stuck Terminating forever
4. Check AutoscalingListener CR - no deletionTimestamp is ever set
5. Check controller logs - zero activity on AutoscalingListener after startup
Describe the bug
ARC Version
0.14.1 (gha-runner-scale-set-controller)
Kubernetes Version
v1.28.5+k3s1
Description
When a worker node running a listener pod is powered off or crashes, the
AutoscalingListener CR never gets a deletionTimestamp set. The ARC controller
does not initiate cleanup or recreation of the listener, resulting in permanent
loss of job pickup for that scale set until manual intervention.
Even after 5 minutes as well, listener pod is not scheduled on another worker node. It was in still terminating mode.
Describe the expected behavior
Expected Behavior
ARC controller detects listener pod is gone due to node failure and either:
- Sets deletionTimestamp on the AutoscalingListener CR and recreates it, OR
- Has a health check timeout that triggers listener recreation
Additional Context
## Evidence
kubectl get autoscalinglistener <name> -n arc-systems \
-o jsonpath='{.metadata.deletionTimestamp}'
# Returns empty — no deletion ever requested
Controller logs show only EphemeralRunnerSet activity, zero AutoscalingListener
reconciliation after the node failure.
Controller Logs
https://gist.github.com/vikasraut499/83b0d07133cf4ddd7d612206b08312fc
Runner Pod Logs
root@control-plane-ha-2:~/tetrate/cloud/baremetal/terraform# kubectl get pods -n arc-systems -o wide -w
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
arc-gha-rs-controller-69697bf97-nnl8m 1/1 Running 0 69m 10.42.5.81 worker-2-ha-setup <none> <none>
arc-gha-rs-controller-69697bf97-r2h5m 1/1 Terminating 0 62m 10.42.3.222 worker-1-ha-setup <none> <none>
arc-gha-rs-controller-69697bf97-tqvcm 1/1 Running 0 57m 10.42.5.82 worker-2-ha-setup <none> <none>
monorepo-baremetal-runner-ha-764b4bd4-listener 1/1 Running 0 70m 10.42.5.80 worker-2-ha-setup <none> <none>
xcp-baremetal-runner-ha-764b4bd4-listener 1/1 Terminating 0 70m 10.42.3.217 worker-1-ha-setup <none> <none>
xcp-baremetal-runner-ha-rlr6j-runner-47vps 2/2 Terminating 0 70m 10.42.3.218 worker-1-ha-setup <none> <none>
xcp-baremetal-runner-ha-rlr6j-runner-fxqz2 2/2 Terminating 0 70m 10.42.3.220 worker-1-ha-setup <none> <none>
xcp-baremetal-runner-ha-rlr6j-runner-j4rvd 2/2 Terminating 0 70m 10.42.3.219 worker-1-ha-setup <none> <none>
xcp-baremetal-runner-ha-rlr6j-runner-vzn95 2/2 Running 0 38m 10.42.5.83 worker-2-ha-setup <none> <none>
^Croot@control-plane-ha-2:~/tetrate/cloud/baremetal/terraform#
^Croot@control-plane-ha-2:~/tetrate/cloud/baremetal/terraform# kubectl logs xcp-baremetal-runner-ha-rlr6j-runner-j4rvd -n arc-systems
Defaulted container "runner" out of: runner, dind, init-dind-externals (init), init-harbor-creds (init)
Error from server: Get "https://172.16.0.89:10250/containerLogs/arc-systems/xcp-baremetal-runner-ha-rlr6j-runner-j4rvd/runner": proxy error from 127.0.0.1:6443 while dialing 172.16.0.89:10250, code 502: 502 Bad Gateway
root@control-plane-ha-2:~/tetrate/cloud/baremetal/terraform# kubectl logs xcp-baremetal-runner-ha-rlr6j-runner-j4rvd -n arc-systems
Defaulted container "runner" out of: runner, dind, init-dind-externals (init), init-harbor-creds (init)
Error from server: Get "https://172.16.0.89:10250/containerLogs/arc-systems/xcp-baremetal-runner-ha-rlr6j-runner-j4rvd/runner": proxy error from 127.0.0.1:6443 while dialing 172.16.0.89:10250, code 502: 502 Bad Gateway
r
Checks
Controller Version
0.14.1 (gha-runner-scale-set-controller)
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
ARC Version
0.14.1 (gha-runner-scale-set-controller)
Kubernetes Version
v1.28.5+k3s1
Description
When a worker node running a listener pod is powered off or crashes, the
AutoscalingListener CR never gets a deletionTimestamp set. The ARC controller
does not initiate cleanup or recreation of the listener, resulting in permanent
loss of job pickup for that scale set until manual intervention.
Even after 5 minutes as well, listener pod is not scheduled on another worker node. It was in still terminating mode.
Describe the expected behavior
Expected Behavior
ARC controller detects listener pod is gone due to node failure and either:
Additional Context
Controller Logs
Runner Pod Logs