Skip to content

Conversation

@SeanKilleen
Copy link

@SeanKilleen SeanKilleen commented Nov 19, 2025

ℹ️ NOTE: This fix was generated in concert with Claude AI, but with me in the driver's seat, starting from my own hunch, working through analysis, and then looking at the possible fixes. I do not have deep background in node/TS, but I have 20 years of experience and I also hate AI slop. 😄 Blunt feedback on the PR is welcome.

❗ Important Announcements

Click here for more details:

⚠️ Please Note: We do not accept all types of pull requests, and we want to ensure we don’t waste your time. Before submitting, make sure you have read our pull request guidelines: Pull Request Rules

🚫 Please Avoid Unnecessary Pinging of Maintainers

We kindly ask you to refrain from pinging maintainers unless absolutely necessary. Pings are for critical/urgent pull requests that require immediate attention.

📋 Overview

🛠️ Type of change

  • 🐛 Bugfix (a non-breaking change that resolves an issue)
  • ✨ New feature (a non-breaking change that adds new functionality)
  • ⚠️ Breaking change (a fix or feature that alters existing functionality in a way that could cause issues)
  • 🎨 User Interface (UI) updates
  • 📄 New Documentation (addition of new documentation)
  • 📄 Documentation Update (modification of existing documentation)
  • 📄 Documentation Update Required (the change requires updates to related documentation)
  • 🔧 Other (please specify):
    • Provide additional details here.

📄 Checklist

  • 🔍 My code adheres to the style guidelines of this project.
  • 🦿 I have indicated where (if any) I used an LLM for the contributions
  • ✅ I ran ESLint and other code linters for modified files.
  • 🛠️ I have reviewed and tested my code.
  • 📝 I have commented my code, especially in hard-to-understand areas (e.g., using JSDoc for methods).
  • ⚠️ My changes generate no new warnings.
  • 🤖 My code needed automated testing. I have added them (this is an optional task).
  • 📄 Documentation updates are included (if applicable).
  • 🔒 I have considered potential security impacts and mitigated risks.
  • 🧰 Dependency updates are listed and explained.
  • 📚 I have read and understood the Pull Request guidelines.

📷 Screenshots or Visual Changes

N/A

Copilot AI review requested due to automatic review settings November 19, 2025 20:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses HTTP agent resource leaks that cause timeouts after prolonged operation by implementing agent reuse and proper cleanup. Instead of creating new HTTP/HTTPS agents for every request, the code now creates reusable agents per monitor instance with keepAlive disabled to prevent connection pooling issues.

Key Changes:

  • Added reusable HTTP/HTTPS agent instance variables to the Monitor class
  • Modified HTTP/HTTPS/Steam/Docker monitor types to reuse agents instead of creating new ones per request
  • Implemented agent cleanup in the stop() method to prevent resource leaks

@SeanKilleen
Copy link
Author

I am in the process of looking to test this locally -- I am brand new to the project so it may take a minute. Feedback is welcome!

@SeanKilleen SeanKilleen changed the title Resolve HTTP agent leak Fix: Resolve HTTP agent leak via agent disposal / management Nov 19, 2025
@CommanderStorm
Copy link
Collaborator

So what was the result of the testing?

@SeanKilleen
Copy link
Author

SeanKilleen commented Nov 20, 2025

@CommanderStorm I'm doing this as a side-side-quest for my day job and have never touched this repo before so it's going to take a little to build, push, and use the docker container in our existing setup amongst my other priorities.

Hopefully a maintainer or others can weigh in on this PR in the interim.

@louislam
Copy link
Owner

louislam commented Nov 20, 2025

Just read #275 again. I have the same concern as @steveiliop56 said. #275 (comment)

I have also checked the Node.js Doc. https://nodejs.org/api/http.html#new-agentoptions

agent.destroy()
It is usually not necessary to do this. However, if using an agent with keepAlive enabled, then it is best to explicitly shut down the agent when it is no longer needed.

keepAlive Default: false.

  1. So we have never enabled keepAlive, why is destroy() still needed?
  2. I think you have to prove that maybe using a minimal javascript script to compare both with/without reusing the http agent object, not (I think!).

And actually I am not so faimilar with http.Agent too. Recreating http.Agent looks more stateless to me I think. While reusing the same http.Agent, I have no idea if it would have any state changes inside the agent, and affect upcoming requests.

@louislam
Copy link
Owner

louislam commented Nov 20, 2025

And since you mentioned Claude AI, Did the idea of reusing agent and agent.destroy() purely come up with Caude, not by your test?

If yes, I think we better stop here.

@louislam louislam added the pr:please address review comments this PR needs a bit more work to be mergable label Nov 20, 2025
@SeanKilleen SeanKilleen changed the title Fix: Resolve HTTP agent leak via agent disposal / management Fix: Decrease pressure via agent disposal / management Nov 20, 2025
@SeanKilleen
Copy link
Author

SeanKilleen commented Nov 20, 2025

@louislam

So we have never enabled keepAlive, why is destroy() still needed?

Great catch. I was taking a belt & suspenders approach. That one's on me, I hadn't read the docs thoroughly enough. I was coming with a .NET mindset of disposing any IDisposable.

But from those same docs, right after you quoted:

Otherwise, sockets might stay open for quite a long time before the server terminates them.

So I'm thinking there could be something where enough sockets are kept open longer than they should be, causing an issue. This is a strange situation in that I'm attempting to resolve a performance-related symptom where it's difficult to diagnose the problem, so I'm looking for a non-harmful practice that could also possibly fix things, and this stuck out.

To be clear, I'm also shifting my thinking to agree that this is less about memory (and perhaps less about sockets) and more about pressure from object accumulation over a long time. I'm not seeing memory pressure on our container at all.

My new theory is that enough of these are being created that we see socket exhaustion faster than they're cleaned up, or that GC pauses due to object accumulation get longer. And that since it's efficient and fine to re-use them, nothing seems to be hurt by this change.

I did work with a colleague to create a small test harness (which didn't make sense to include in the larger codebase I don't think, though let me know if you want it). It just shows that we definitely don't accumulate things anymore with this PR. But I think that much is obvious from the code change so it probably doesn't make sense to include.


If you'd like, I can work to deploy this PR into our production environment and let it run for some time. That might be the ultimate proof -- putting it in a production scenario and seeing if the symptom is resolved.

@SeanKilleen
Copy link
Author

SeanKilleen commented Nov 20, 2025

I've deployed this to an experimental container under my dockerhub and we are now using it in our pre-prod and prod environments. I'll report back on findings. This can stay open in the interim.

Based on past experience it takes about 18 hours for the symptom to show up. I'll let this run for a few days and if symptoms don't show up I think we'll have a strong signal on this

@SeanKilleen
Copy link
Author

Observations from production so far:

  • We've received far less of the 48000ms timeout errors, but the ones we did receive took an interesting format. Rather than a continuous rolling issue as we previously saw, a one-time issue happened at 12:41am where 12 monitors had this issue at once, and then recovered 1 minute later. I didn't restart anything yet; looking to see if it happens again or if maybe that was a one-off fluke.
    • The end-user experience is definitely preferable to the current continual rolling false-positives.
  • I've noticed a side effect of my change where in one instance, an agent that was being re-used "went bad" (looking to further diagnose). This agent was monitoring a web site in front of Azure app gateway. It received a 403 and was marked as down. The 403 is surprising which makes me think something about how it initially established the http connection with the Azure app Gateway went stale (possibly the keepAlive = false). When I paused the monitor and then resumed it, the problem cleared (presumably because the http agent was recreated). So I'll want to look into that as part of this PR, even if we proceed.

@SeanKilleen
Copy link
Author

Additional observations from production:

  • No other 48000ms timeout errors except for that one small batch that resolved itself yesterday
  • No other issues with monitors "going back" or being unable to recover

So I do think we have a strong signal that this is, at a minimum, helping with the symptom. Going to keep monitoring / reporting for a few more days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr:please address review comments this PR needs a bit more work to be mergable

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lots of 'timeout of 48000ms exceeded' during the day, but the site is still up?

3 participants