Description
Making a new issue after seeing this comment in the last issue on this topic: #282 (comment)
Specs
On Flux v2.2
with the image-update-automation pod running the image image-automation-controller:v0.37.0
for image.toolkit.fluxcd.io/v1beta1
automation resources
Description
The issue we saw was the image-automation-controller
stopped reconciling for all ImageUpdateAutomation
resources in the cluster for ~ 5 hours, then when it resumed stopped reconciling only 1 ImageUpdateAutomation
resource out of many in the cluster.
To resolve it, had to delete the image-automation-controller
pod and upon restart everything worked fine.
The following telemetry screenshots show CPU usage dropping off entirely, log volume dropping off entirely, and no errors produced in logs. There were no pod restarts / crashloops or any k8s events for the pods.
![Screenshot 2024-05-29 at 2 53 47 PM](https://private-user-images.githubusercontent.com/17187192/334927471-32cc1380-d862-4f2a-9fdb-a80e5ce0ec51.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5MzY2MDUsIm5iZiI6MTczODkzNjMwNSwicGF0aCI6Ii8xNzE4NzE5Mi8zMzQ5Mjc0NzEtMzJjYzEzODAtZDg2Mi00ZjJhLTlmZGItYTgwZTVjZTBlYzUxLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDEzNTE0NVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTY5NWIyM2FlY2QyNDZlMGM3ZmNjZjNiODkzYmI3MGQ0M2YyZTY0ZGQ4Y2U0ZjNjMjA0NTg4NDYxZGE5YTI4MGMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.cVWIXeIpydyb6Urwn9RjEbEL8j7BWSHfbGpo4HvXDsc)
![Screenshot 2024-05-29 at 2 54 30 PM](https://private-user-images.githubusercontent.com/17187192/334927636-41e6661d-4056-4cdc-9db0-56ba4de0d5a9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5MzY2MDUsIm5iZiI6MTczODkzNjMwNSwicGF0aCI6Ii8xNzE4NzE5Mi8zMzQ5Mjc2MzYtNDFlNjY2MWQtNDA1Ni00Y2RjLTlkYjAtNTZiYTRkZTBkNWE5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDEzNTE0NVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWYzYTUxNGNkZWNiM2UwMjhmZmIwZTNhNmI0NTE4N2NjNTdkNjMxNDk4MmM0ZjA0OWY2ZmM1MDlhN2EzMDgxYmUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.bvoLC_kE9SDsOCZ3G4Jw8_X3IMFs3JawD6z-iRwNivs)
I've attached a debug/pprof/goroutine?debug=2"
profiler to this issue as well.
Please let me know if there's any further details I can provide. Thank you!