-
Notifications
You must be signed in to change notification settings - Fork 237
Description
Describe the bug
Deployment on EKS (kubernetes) cluster fails (from time to time) with error:
failed to serve and listen","error":"listen tcp :4311: bind: address already in use
This port is used by fluent bit kuberntes filter (default value for aws_pod_association_port field)
The error comes from here - https://github.com/aws/amazon-cloudwatch-agent/blob/main/extension/server/extension.go#L127
And most likely is caused by this line - https://github.com/aws/amazon-cloudwatch-agent/blob/main/extension/server/extension.go#L110 This does not seem to wait for port to be freed.
This should be replaced by sever.Shutdown(...), but the whole code seem heavy weight to reload certs (shutting and starting server).
Steps to reproduce
EKS cluster with cloudwatch agent, there is similar issue (errors coming from aws fluent-bit) here - aws/amazon-cloudwatch-agent-operator#269
[filter:kubernetes:kubernetes.1] no upstream connections available to cloudwatch-agent.amazon-cloudwatch:4311
What did you expect to see?
No errors, if server not reloaded do not just blindly log error in go routine (no error is returned from the method) and pretend everything is ok - https://github.com/aws/amazon-cloudwatch-agent/blob/main/extension/server/extension.go#L127
At least there should be retry if the port is available before starting server in go routine.
What did you see instead?
Server restarted when reloadServer is called or error returned. None of this currently happens.
What version did you use?
latest
What config did you use?
default
Environment
OS: linux
Additional context
Add any other context about the problem here.