Skip to content

AutoscalingListener not recreated after node failure - CR never gets deletionTimestamp #4469

@vikasraut499

Description

@vikasraut499

Checks

Controller Version

0.14.1 (gha-runner-scale-set-controller)

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

## Steps to Reproduce
  1. Run ARC with replicaCount=2 on a multi-node cluster
  2. Power off the node running the listener pod
  3. Observe listener pod stuck Terminating forever
  4. Check AutoscalingListener CR - no deletionTimestamp is ever set
  5. Check controller logs - zero activity on AutoscalingListener after startup

Describe the bug

ARC Version

0.14.1 (gha-runner-scale-set-controller)

Kubernetes Version

v1.28.5+k3s1

Description

When a worker node running a listener pod is powered off or crashes, the
AutoscalingListener CR never gets a deletionTimestamp set. The ARC controller
does not initiate cleanup or recreation of the listener, resulting in permanent
loss of job pickup for that scale set until manual intervention.

Even after 5 minutes as well, listener pod is not scheduled on another worker node. It was in still terminating mode.

Describe the expected behavior

Expected Behavior

ARC controller detects listener pod is gone due to node failure and either:

  • Sets deletionTimestamp on the AutoscalingListener CR and recreates it, OR
  • Has a health check timeout that triggers listener recreation

Additional Context

## Evidence
  kubectl get autoscalinglistener <name> -n arc-systems \
    -o jsonpath='{.metadata.deletionTimestamp}'
  # Returns empty — no deletion ever requested

  Controller logs show only EphemeralRunnerSet activity, zero AutoscalingListener
  reconciliation after the node failure.

Controller Logs

https://gist.github.com/vikasraut499/83b0d07133cf4ddd7d612206b08312fc

Runner Pod Logs

root@control-plane-ha-2:~/tetrate/cloud/baremetal/terraform# kubectl get pods -n arc-systems -o wide -w
NAME                                             READY   STATUS        RESTARTS   AGE   IP            NODE                NOMINATED NODE   READINESS GATES
arc-gha-rs-controller-69697bf97-nnl8m            1/1     Running       0          69m   10.42.5.81    worker-2-ha-setup   <none>           <none>
arc-gha-rs-controller-69697bf97-r2h5m            1/1     Terminating   0          62m   10.42.3.222   worker-1-ha-setup   <none>           <none>
arc-gha-rs-controller-69697bf97-tqvcm            1/1     Running       0          57m   10.42.5.82    worker-2-ha-setup   <none>           <none>
monorepo-baremetal-runner-ha-764b4bd4-listener   1/1     Running       0          70m   10.42.5.80    worker-2-ha-setup   <none>           <none>
xcp-baremetal-runner-ha-764b4bd4-listener        1/1     Terminating   0          70m   10.42.3.217   worker-1-ha-setup   <none>           <none>
xcp-baremetal-runner-ha-rlr6j-runner-47vps       2/2     Terminating   0          70m   10.42.3.218   worker-1-ha-setup   <none>           <none>
xcp-baremetal-runner-ha-rlr6j-runner-fxqz2       2/2     Terminating   0          70m   10.42.3.220   worker-1-ha-setup   <none>           <none>
xcp-baremetal-runner-ha-rlr6j-runner-j4rvd       2/2     Terminating   0          70m   10.42.3.219   worker-1-ha-setup   <none>           <none>
xcp-baremetal-runner-ha-rlr6j-runner-vzn95       2/2     Running       0          38m   10.42.5.83    worker-2-ha-setup   <none>           <none>
^Croot@control-plane-ha-2:~/tetrate/cloud/baremetal/terraform#

^Croot@control-plane-ha-2:~/tetrate/cloud/baremetal/terraform# kubectl logs xcp-baremetal-runner-ha-rlr6j-runner-j4rvd -n arc-systems
Defaulted container "runner" out of: runner, dind, init-dind-externals (init), init-harbor-creds (init)
Error from server: Get "https://172.16.0.89:10250/containerLogs/arc-systems/xcp-baremetal-runner-ha-rlr6j-runner-j4rvd/runner": proxy error from 127.0.0.1:6443 while dialing 172.16.0.89:10250, code 502: 502 Bad Gateway
root@control-plane-ha-2:~/tetrate/cloud/baremetal/terraform# kubectl logs xcp-baremetal-runner-ha-rlr6j-runner-j4rvd -n arc-systems
Defaulted container "runner" out of: runner, dind, init-dind-externals (init), init-harbor-creds (init)
Error from server: Get "https://172.16.0.89:10250/containerLogs/arc-systems/xcp-baremetal-runner-ha-rlr6j-runner-j4rvd/runner": proxy error from 127.0.0.1:6443 while dialing 172.16.0.89:10250, code 502: 502 Bad Gateway
r

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions