Fix: Decrease pressure via agent disposal / management #6380

SeanKilleen · 2025-11-19T20:35:40Z

ℹ️ NOTE: This fix was generated in concert with Claude AI, but with me in the driver's seat, starting from my own hunch, working through analysis, and then looking at the possible fixes. I do not have deep background in node/TS, but I have 20 years of experience and I also hate AI slop. 😄 Blunt feedback on the PR is welcome.

❗ Important Announcements

Click here for more details:

⚠️ Please Note: We do not accept all types of pull requests, and we want to ensure we don’t waste your time. Before submitting, make sure you have read our pull request guidelines: Pull Request Rules

🚫 Please Avoid Unnecessary Pinging of Maintainers

We kindly ask you to refrain from pinging maintainers unless absolutely necessary. Pings are for critical/urgent pull requests that require immediate attention.

📋 Overview

What problem does this pull request address?
- aims to resolve issues surfaced in Lots of 'timeout of 48000ms exceeded' during the day, but the site is still up? #275 regarding 48000ms timeouts over time
- re-uses http agents where possible
- disables keepAlive to disable connection pooling on the individual agents
- destroys the http agents for a monitor when the monitor is stopped.

Resolves Lots of 'timeout of 48000ms exceeded' during the day, but the site is still up? #275 (I think!)

🛠️ Type of change

🐛 Bugfix (a non-breaking change that resolves an issue)
✨ New feature (a non-breaking change that adds new functionality)
⚠️ Breaking change (a fix or feature that alters existing functionality in a way that could cause issues)
🎨 User Interface (UI) updates
📄 New Documentation (addition of new documentation)
📄 Documentation Update (modification of existing documentation)
📄 Documentation Update Required (the change requires updates to related documentation)
🔧 Other (please specify):
- Provide additional details here.

📄 Checklist

📷 Screenshots or Visual Changes

N/A

Copilot

Pull Request Overview

This PR addresses HTTP agent resource leaks that cause timeouts after prolonged operation by implementing agent reuse and proper cleanup. Instead of creating new HTTP/HTTPS agents for every request, the code now creates reusable agents per monitor instance with keepAlive disabled to prevent connection pooling issues.

Key Changes:

Added reusable HTTP/HTTPS agent instance variables to the Monitor class
Modified HTTP/HTTPS/Steam/Docker monitor types to reuse agents instead of creating new ones per request
Implemented agent cleanup in the stop() method to prevent resource leaks

server/model/monitor.js

SeanKilleen · 2025-11-19T20:36:15Z

I am in the process of looking to test this locally -- I am brand new to the project so it may take a minute. Feedback is welcome!

CommanderStorm · 2025-11-20T01:52:19Z

So what was the result of the testing?

SeanKilleen · 2025-11-20T02:23:51Z

@CommanderStorm I'm doing this as a side-side-quest for my day job and have never touched this repo before so it's going to take a little to build, push, and use the docker container in our existing setup amongst my other priorities.

Hopefully a maintainer or others can weigh in on this PR in the interim.

louislam · 2025-11-20T07:55:12Z

Just read #275 again. I have the same concern as @steveiliop56 said. #275 (comment)

I have also checked the Node.js Doc. https://nodejs.org/api/http.html#new-agentoptions

agent.destroy()
It is usually not necessary to do this. However, if using an agent with keepAlive enabled, then it is best to explicitly shut down the agent when it is no longer needed.

keepAlive Default: false.

So we have never enabled keepAlive, why is destroy() still needed?
I think you have to prove that maybe using a minimal javascript script to compare both with/without reusing the http agent object, not (I think!).

And actually I am not so faimilar with http.Agent too. Recreating http.Agent looks more stateless to me I think. While reusing the same http.Agent, I have no idea if it would have any state changes inside the agent, and affect upcoming requests.

louislam · 2025-11-20T08:01:36Z

And since you mentioned Claude AI, Did the idea of reusing agent and agent.destroy() purely come up with Caude, not by your test?

If yes, I think we better stop here.

SeanKilleen · 2025-11-20T13:05:28Z

@louislam

So we have never enabled keepAlive, why is destroy() still needed?

Great catch. I was taking a belt & suspenders approach. That one's on me, I hadn't read the docs thoroughly enough. I was coming with a .NET mindset of disposing any IDisposable.

But from those same docs, right after you quoted:

Otherwise, sockets might stay open for quite a long time before the server terminates them.

So I'm thinking there could be something where enough sockets are kept open longer than they should be, causing an issue. This is a strange situation in that I'm attempting to resolve a performance-related symptom where it's difficult to diagnose the problem, so I'm looking for a non-harmful practice that could also possibly fix things, and this stuck out.

To be clear, I'm also shifting my thinking to agree that this is less about memory (and perhaps less about sockets) and more about pressure from object accumulation over a long time. I'm not seeing memory pressure on our container at all.

My new theory is that enough of these are being created that we see socket exhaustion faster than they're cleaned up, or that GC pauses due to object accumulation get longer. And that since it's efficient and fine to re-use them, nothing seems to be hurt by this change.

I did work with a colleague to create a small test harness (which didn't make sense to include in the larger codebase I don't think, though let me know if you want it). It just shows that we definitely don't accumulate things anymore with this PR. But I think that much is obvious from the code change so it probably doesn't make sense to include.

If you'd like, I can work to deploy this PR into our production environment and let it run for some time. That might be the ultimate proof -- putting it in a production scenario and seeing if the symptom is resolved.

SeanKilleen · 2025-11-20T13:26:40Z

I've deployed this to an experimental container under my dockerhub and we are now using it in our pre-prod and prod environments. I'll report back on findings. This can stay open in the interim.

Based on past experience it takes about 18 hours for the symptom to show up. I'll let this run for a few days and if symptoms don't show up I think we'll have a strong signal on this

SeanKilleen · 2025-11-21T15:52:48Z

Observations from production so far:

We've received far less of the 48000ms timeout errors, but the ones we did receive took an interesting format. Rather than a continuous rolling issue as we previously saw, a one-time issue happened at 12:41am where 12 monitors had this issue at once, and then recovered 1 minute later. I didn't restart anything yet; looking to see if it happens again or if maybe that was a one-off fluke.
- The end-user experience is definitely preferable to the current continual rolling false-positives.
I've noticed a side effect of my change where in one instance, an agent that was being re-used "went bad" (looking to further diagnose). This agent was monitoring a web site in front of Azure app gateway. It received a 403 and was marked as down. The 403 is surprising which makes me think something about how it initially established the http connection with the Azure app Gateway went stale (possibly the keepAlive = false). When I paused the monitor and then resumed it, the problem cleared (presumably because the http agent was recreated). So I'll want to look into that as part of this PR, even if we proceed.

SeanKilleen · 2025-11-22T15:39:38Z

Additional observations from production:

No other 48000ms timeout errors except for that one small batch that resolved itself yesterday
No other issues with monitors "going back" or being unable to recover

So I do think we have a strong signal that this is, at a minimum, helping with the symptom. Going to keep monitoring / reporting for a few more days.

Update the monitor

c621339

Copilot AI review requested due to automatic review settings November 19, 2025 20:35

Copilot AI reviewed Nov 19, 2025

View reviewed changes

server/model/monitor.js Outdated Show resolved Hide resolved

SeanKilleen added 3 commits November 19, 2025 16:00

Move away from class variables to avoid conflict with ReadBean

ceb4b8b

linting: use double quotes

74e0f47

Fix other linting

33460cd

SeanKilleen changed the title ~~Resolve HTTP agent leak~~ Fix: Resolve HTTP agent leak via agent disposal / management Nov 19, 2025

louislam added the pr:please address review comments this PR needs a bit more work to be mergable label Nov 20, 2025

SeanKilleen changed the title ~~Fix: Resolve HTTP agent leak via agent disposal / management~~ Fix: Decrease pressure via agent disposal / management Nov 20, 2025

Merge branch 'master' into fix-db-issue

4ef587c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix: Decrease pressure via agent disposal / management #6380

Fix: Decrease pressure via agent disposal / management #6380

SeanKilleen commented Nov 19, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

SeanKilleen commented Nov 19, 2025

Uh oh!

CommanderStorm commented Nov 20, 2025

Uh oh!

SeanKilleen commented Nov 20, 2025 •

edited

Loading

Uh oh!

louislam commented Nov 20, 2025 •

edited

Loading

Uh oh!

louislam commented Nov 20, 2025 •

edited

Loading

Uh oh!

SeanKilleen commented Nov 20, 2025 •

edited

Loading

Uh oh!

SeanKilleen commented Nov 20, 2025 •

edited

Loading

Uh oh!

SeanKilleen commented Nov 21, 2025

Uh oh!

SeanKilleen commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Fix: Decrease pressure via agent disposal / management #6380

Are you sure you want to change the base?

Fix: Decrease pressure via agent disposal / management #6380

Conversation

SeanKilleen commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❗ Important Announcements

🚫 Please Avoid Unnecessary Pinging of Maintainers

📋 Overview

🛠️ Type of change

📄 Checklist

📷 Screenshots or Visual Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

SeanKilleen commented Nov 19, 2025

Uh oh!

CommanderStorm commented Nov 20, 2025

Uh oh!

SeanKilleen commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

louislam commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

louislam commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SeanKilleen commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SeanKilleen commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SeanKilleen commented Nov 21, 2025

Uh oh!

SeanKilleen commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SeanKilleen commented Nov 19, 2025 •

edited

Loading

SeanKilleen commented Nov 20, 2025 •

edited

Loading

louislam commented Nov 20, 2025 •

edited

Loading

louislam commented Nov 20, 2025 •

edited

Loading

SeanKilleen commented Nov 20, 2025 •

edited

Loading

SeanKilleen commented Nov 20, 2025 •

edited

Loading