Skip to content

Leader election lost #2112

@rkotech

Description

@rkotech

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that the AWX Operator is open source software provided for free and that I might not receive a timely response.

Bug Summary

Basically the same as #808, but there was no answer in that request, so I would like to bring this up again.

Normal awx-operator deployment, after a random amount of time, few hours, awx operator starts to loose grip on it's own monitoring?
Keeps restarting itself then gives up and stays in 1/2 CrashLoopBackOff

16:16:22.208428 7 leaderelection.go:332] error retrieving resource lock awx/awx-operator: Get "https://10.96.0.1:443/apis/coordinati on.k8s.io/v1/namespaces/awx/leases/awx-operator": context deadline exceeded I0409 16:16:22.217943 7 leaderelection.go:285] failed to renew lease awx/awx-operator: timed out waiting for the condition {"level":"info","ts":"2026-04-09T16:16:22Z","msg":"Stopping and waiting for non leader election runnables"} {"level":"info","ts":"2026-04-09T16:16:22Z","msg":"Stopping and waiting for leader election runnables"} {"level":"info","ts":"2026-04-09T16:16:22Z","msg":"Stopping and waiting for caches"} {"level":"info","ts":"2026-04-09T16:16:22Z","msg":"Stopping and waiting for webhooks"} {"level":"info","ts":"2026-04-09T16:16:22Z","msg":"Stopping and waiting for HTTP servers"} {"level":"info","ts":"2026-04-09T16:16:22Z","msg":"Wait completed, proceeding to shutdown the manager"} {"level":"info","ts":"2026-04-09T16:16:22Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"awx-controller"} {"level":"info","ts":"2026-04-09T16:16:22Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"awxrestore-control ler"} {"level":"info","ts":"2026-04-09T16:16:22Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"awxmeshingress-con troller"} {"level":"info","ts":"2026-04-09T16:16:22Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"awxbackup-controll er"} {"level":"error","ts":"2026-04-09T16:16:22Z","logger":"cmd","msg":"Proxy or operator exited with error.","error":"leader election lost","stacktr ace":"github.com/operator-framework/ansible-operator-plugins/internal/cmd/ansible-operator/run.run\n\tansible-operator-plugins/internal/cmd/ansi ble-operator/run/cmd.go:261\ngithub.com/operator-framework/ansible-operator-plugins/internal/cmd/ansible-operator/run.NewCmd.func1\n\tansible-op erator-plugins/internal/cmd/ansible-operator/run/cmd.go:81\ngithub.com/spf13/cobra.(*Command).execute\n\t/home/runner/go/pkg/mod/github.com/spf1 3/cobra@v1.8.0/command.go:987\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:11 15\ngithub.com/spf13/cobra.(*Command).Execute\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tansible-ope rator-plugins/cmd/ansible-operator/main.go:40\nruntime.main\n\t/opt/hostedtoolcache/go/1.20.12/x64/src/runtime/proc.go:250"}

AWX Operator version

2.19.1

AWX version

24.6.1

Kubernetes platform

minikube

Kubernetes/Platform version

1,38,1

Modifications

no

Steps to reproduce

simple deployment with ansible helm chart

- name: Deploy AWX Operator, patience pls.
  kubernetes.core.helm:
    name: awx-operator
    chart_ref: awx-operator
    chart_repo_url: https://ansible-community.github.io/awx-operator-helm/
    release_namespace: "{{ awx_namespace }}"
    create_namespace: true
    wait: true

...

- name: deploy AWX pods
  kubernetes.core.k8s:
    state: present
    wait: true
    definition: "{{ lookup('template', 'awx-manifest.yml.j2') | from_yaml }}"

Expected results

Well, I would expect it to run and able to read the monitor URL continuously.

Actual results

Instead, after some time it fails:
pod describe:

`Events:
Type Reason Age From Message


Normal Scheduled 20m default-scheduler Successfully assigned awx/awx-operator-controller-manager-5f468697-r5trg to awx-mi
ni
Normal Pulled 20m kubelet Container image "quay.io/brancz/kube-rbac-proxy:v0.15.0" already present on machin
e and can be accessed by the pod
Normal Created 20m kubelet Container created
Normal Started 20m kubelet Container started
Warning Unhealthy 14m kubelet Liveness probe failed: Get "http://10.244.0.22:6789/healthz": EOF
Warning Unhealthy 12m (x3 over 19m) kubelet Readiness probe failed: Get "http://10.244.0.22:6789/readyz": context deadline exc
eeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 12m (x3 over 19m) kubelet Liveness probe failed: Get "http://10.244.0.22:6789/healthz": context deadline exc
eeded (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 3m25s (x2 over 12m) kubelet Readiness probe failed: Get "http://10.244.0.22:6789/readyz": EOF
Warning BackOff 111s (x16 over 18m) kubelet Back-off restarting failed container awx-manager in pod awx-operator-controller-ma
nager-5f468697-r5trg_awx(b3c159ee-057a-49c5-8128-776564327dcc)
Normal Pulled 39s (x7 over 20m) kubelet Container image "quay.io/ansible/awx-operator:2.19.1" already present on machine a
nd can be accessed by the pod
Normal Created 38s (x7 over 20m) kubelet Container created
Normal Started 38s (x7 over 20m) kubelet Container started
`

get pods:

awx awx-migration-24.6.1-r95kc 0/1 Completed 0 16h awx awx-operator-controller-manager-5f468697-r5trg 1/2 CrashLoopBackOff 3 (35s ago) 9m5s awx awx-task-7ccf57f75c-gfw4z 4/4 Running 0 16h awx awx-web-5d78fb7c79-2wg57 3/3 Running 0 16h

Additional information

Can it be that the livenessprobe and readynessprobe timeouts are too low? 1sec?

Operator Logs

`
...

{"level":"info","ts":"2026-04-10T08:37:56Z","logger":"proxy","msg":"Read object from cache","resource":{"IsResourceRequest":true,"Path":"/apis/a
pps/v1/namespaces/awx/deployments/awx-task","Verb":"get","APIPrefix":"apis","APIGroup":"apps","APIVersion":"v1","Namespace":"awx","Resource":"de
ployments","Subresource":"","Name":"awx-task","Parts":["deployments","awx-task"]}}
{"level":"info","ts":"2026-04-10T08:37:58Z","logger":"proxy","msg":"Read object from cache","resource":{"IsResourceRequest":true,"Path":"/apis/a
pps/v1/namespaces/awx/deployments/awx-task","Verb":"get","APIPrefix":"apis","APIGroup":"apps","APIVersion":"v1","Namespace":"awx","Resource":"de
ployments","Subresource":"","Name":"awx-task","Parts":["deployments","awx-task"]}}
E0410 08:38:13.190444 7 leaderelection.go:369] Failed to update lock: client rate limiter Wait returned an error: context deadline exceede
d
I0410 08:38:13.306497 7 leaderelection.go:285] failed to renew lease awx/awx-operator: timed out waiting for the condition
{"level":"info","ts":"2026-04-10T08:38:13Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"error","ts":"2026-04-10T08:38:13Z","logger":"cmd","msg":"Proxy or operator exited with error.","error":"leader election lost","stacktr
ace":"github.com/operator-framework/ansible-operator-plugins/internal/cmd/ansible-operator/run.run\n\tansible-operator-plugins/internal/cmd/ansi
ble-operator/run/cmd.go:261\ngithub.com/operator-framework/ansible-operator-plugins/internal/cmd/ansible-operator/run.NewCmd.func1\n\tansible-op
erator-plugins/internal/cmd/ansible-operator/run/cmd.go:81\ngithub.com/spf13/cobra.(*Command).execute\n\t/home/runner/go/pkg/mod/github.com/spf1
3/cobra@v1.8.0/command.go:987\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:11
15\ngithub.com/spf13/cobra.(*Command).Execute\n\t/home/runner/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\tansible-ope
rator-plugins/cmd/ansible-operator/main.go:40\nruntime.main\n\t/opt/hostedtoolcache/go/1.20.12/x64/src/runtime/proc.go:250"}
`

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions