Skip to content

Minimize memory leaks when server is unreachable#350

Open
mgr-inz-rafal wants to merge 6 commits into
Ylianst:masterfrom
mgr-inz-rafal:stab_at_memleaks
Open

Minimize memory leaks when server is unreachable#350
mgr-inz-rafal wants to merge 6 commits into
Ylianst:masterfrom
mgr-inz-rafal:stab_at_memleaks

Conversation

@mgr-inz-rafal

Copy link
Copy Markdown

This PR is related to #110 and is a best effort to clear some memory leaks.

Recently in our project we've also noticed high memory usage in cases where the server is unreachable. I did small review of the reconnection loop and implemented a couple of fixes and optimizations.

I did some tests for which I dodged the exponential backoff mechanism and forced the reconnections every 10ms. I noticed that even after a couple of hours the memory footprint is relatively low and much lower that when I tested the code from master to establish a baseline. There could be some more memory leaks lurking there, though.

Quick summary of changes (details are in separate commits):

  1. Closing some dangling handles
  2. Add clean-up code in some early-exit paths
  3. Removed one redundant call to ILibMemory_Free
  4. One additional, explicit invokation of duk_gc
  5. Probably the biggest win: cache of the authenticode check result - I assumed that the binary itself will never be modified while running, so we can cache the check result. My guess is that some code in the duk_* functions which is now extracted to MeshServer_CheckAuthenticode was leaking memory.

Hope this helps.

@Tesla2k

Tesla2k commented May 30, 2026

Copy link
Copy Markdown

Tested on Linux x86-64 (Flatcar headless VM, no KVM, ~zero workload — just heartbeat to server). Compared the stock 2026-05-22 build against a build of this PR (HEAD 48b9aa6, make linux ARCHID=6 KVM=0).

Stock binary on this host showed the classic pattern:

Run Duration RSS peak Swap peak Outcome
1 24 h 202 M 3.7 G SIGKILL (OOM)
2 93 h 205 M 3.8 G core-dump

(MemoryMax=300M was set in the unit — it caps RSS but the kernel pushes the rest into swap until the process either OOMs or crashes.)

Replaced just the binary, no other changes:

Build Duration RSS RSS peak Swap Restarts
PR #350 41 h 213 M 216 M 0 B 0

RSS peak hasn't moved in the last ~38 h — clear plateau. Task count plateaus too (started at 2, jumped once to 36 around the 7 h mark, since then 36–37). CPU is ~41 min/day, same as the stock binary. Same workload, same MeshCentral server, same systemd config.

The fix at agentcore.c:3554 (drop redundant ILibMemory_Free after ILibLifeTime_Remove) and the extra duk_gc() in MeshServer_Connect are likely doing the heavy lifting on Linux — the Windows authenticode cache changes don't apply here and we still see this clean a behaviour change. Will report back at the 7-day mark; happy to share the monitor log if useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants