Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix various leaks #115

Merged
merged 2 commits into from
Feb 11, 2025
Merged

Fix various leaks #115

merged 2 commits into from
Feb 11, 2025

Conversation

Arno500
Copy link
Contributor

@Arno500 Arno500 commented Feb 6, 2025

When running fishymetrics in our own infrastructure we noticed that the memory usage was greatly increasing (at a steady pace)
as you can see here:
image
Left: deployed as-is, middle: after SystemD memory accounting and limiting, right: with the fix

The specificity of our configuration is that we are scrapping around 6000 servers on this instance (there are actually two instances load balanced, so let's say 3000 averaged) and that a non negligible portion is in error(either because of a broken BMC firmware than cannot properly be parsed by Fishymetrics, or just because they are unreachable). There was two issues:

  1. A channel is awaited during the scrapping, but if, for whatever reason, there was a fetching/parsing error during the handling of the task, then we have a leaking go routine, repeating at every scrap
  2. Some body were not closed when requesting (due to the structure of the code) and in some cases the body was not consumed, preventing the connections from being reused (see the associated commit, Prometheus had a similar issue in the past)

With those two fixes, we get the memory usage under 3GB for those 3000 servers, which seems much more reasonable to me :)

@CLAassistant
Copy link

CLAassistant commented Feb 6, 2025

CLA assistant check
All committers have signed the CLA.

@Arno500
Copy link
Contributor Author

Arno500 commented Feb 6, 2025

(I'm currently trying to sort up my pro account to be able to sign the CLA)

@Arno500 Arno500 force-pushed the fix-leaks branch 2 times, most recently from efec6d9 to 53cc995 Compare February 6, 2025 14:28
Copy link
Collaborator

@ibrahimkk-moideen ibrahimkk-moideen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Arno500 pls look into the changes requested.

Similar to this: prometheus/prometheus@f2ca36e
It should increase connection reuse and heavily reduce CPU usage in case of errors as well as fixing some memory leaks
When there was an error during the scraping, there could be a chan waiting for a response forever if there was an error during the export of metrics. That would cause a pretty big memory leak.
Copy link
Collaborator

@ibrahimkk-moideen ibrahimkk-moideen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ibrahimkk-moideen ibrahimkk-moideen merged commit ebe969b into Comcast:main Feb 11, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants