Skip to content

Potential memory leak on servers since 1.10.0 #25659

Open
@EtienneBruines

Description

@EtienneBruines

Hi! The investigation on this issue is still ongoing, but I wanted to report issues with the latest release as soon as I got them.

Nomad version

Nomad v1.10.0
BuildDate 2025-04-09T16:40:54Z
Revision e26a2bd

Operating system and Environment details

Ubuntu 24.04.2 LTS

Issue

Memory usage steadily increasing for Nomad Server instances.

Context to this screenshot:

  • The upgrade to v1.10.0 was yesterday between 12:00 and 16:00.
  • Regular SIGHUP-reloads are happening due to certificate renewals by consul-template.
  • The cluster has been wanting a new leader quite a lot since the upgrade. Easily once every 10 minutes.

All three server nodes seem affected by this. For example regular spikes up to 3GB and then down to 300MB.

Image

Reproduction steps

  • Have a 3-node Server cluster
    • Probably unrelated: We have 3 clients, and then another 3 Server nodes in another region federated. The problematic region is the authoritative one
    • The memory usage in the federated (non-authorattive) region seems unaffected - stable and low.
  • Wait a day

Expected Result

Memory usage to stay roughly the same.

Actual Result

Memory usage increasing.

Job file (if appropriate)

Not applicable.

Nomad Server logs (if appropriate)

Here I am noticing that my cluster became very unstable since the 1.10 upgrade.

Also worth noting: Network latency between nodes is less than 1ms (same rack), but the nomad.raft.leader.lastContact metric showed values between 200 and 600ms (+= 200ms). Not great. This metric on the non-authoritative region is about 1-100 += 26ms, which is great (despite being slower hardware). According to Consul:

# roundtrip time
Minimum  0.43ms
Median 0.7ms
Maximum 0.8ms
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:05.332364Z","peer":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3"}
{"@level":"warn","@message":"appendEntries rejected, sending older logs","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:05.338291Z","next":154265,"peer":{"Suffrage":1,"ID":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","Address":"172.16.1.236:4647"}}
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:05.400207Z","peer":{"Suffrage":1,"ID":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","Address":"172.16.1.236:4647"}}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.187027Z","server-id":"607347df-65b5-95b8-2ab3-6e0138a30db7","time":508388073}
{"@level":"warn","@message":"failed to contact quorum of nodes, stepping down","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.187102Z"}
{"@level":"info","@message":"entering follower state","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.187137Z","follower":{},"leader-address":"","leader-id":""}
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.249755Z","peer":{"Suffrage":1,"ID":"607347df-65b5-95b8-2ab3-6e0138a30db7","Address":"172.16.1.235:4647"}}
{"@level":"info","@message":"cluster leadership lost","@module":"nomad","@timestamp":"2025-04-11T13:19:09.323127Z"}
{"@level":"info","@message":"aborting pipeline replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:09.560093Z","peer":{"Suffrage":1,"ID":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","Address":"172.16.1.236:4647"}}
{"@level":"warn","@message":"heartbeat timeout reached, starting election","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.787542Z","last-leader-addr":"","last-leader-id":""}
{"@level":"info","@message":"entering candidate state","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.787590Z","node":{},"term":495}
{"@level":"info","@message":"pre-vote successful, starting election","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.852815Z","refused":0,"tally":2,"term":495,"votesNeeded":2}
{"@level":"info","@message":"election won","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.906849Z","tally":2,"term":495}
{"@level":"info","@message":"entering leader state","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.906884Z","leader":{}}
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.906904Z","peer":"607347df-65b5-95b8-2ab3-6e0138a30db7"}
{"@level":"info","@message":"added peer, starting replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.906916Z","peer":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3"}
{"@level":"info","@message":"cluster leadership acquired","@module":"nomad","@timestamp":"2025-04-11T13:19:10.907793Z"}
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.933525Z","peer":{"Suffrage":0,"ID":"607347df-65b5-95b8-2ab3-6e0138a30db7","Address":"172.16.1.235:4647"}}
{"@level":"info","@message":"pipelining replication","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:10.933853Z","peer":{"Suffrage":1,"ID":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","Address":"172.16.1.236:4647"}}
{"@level":"info","@message":"eval broker status modified","@module":"nomad","@timestamp":"2025-04-11T13:19:10.956148Z","paused":false}
{"@level":"info","@message":"blocked evals status modified","@module":"nomad","@timestamp":"2025-04-11T13:19:10.956327Z","paused":false}
{"@level":"info","@message":"Promoting server","@module":"nomad.autopilot","@timestamp":"2025-04-11T13:19:20.958465Z","address":"172.16.1.236:4647","id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","name":"nomad03-bre.q-mex.net.bremen"}
{"@level":"info","@message":"updating configuration","@module":"nomad.raft","@timestamp":"2025-04-11T13:19:20.958561Z","command":0,"server-addr":"172.16.1.236:4647","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","servers":"[{Suffrage:Voter ID:84f3b5a8-d979-5d41-c659-badc9ecce162 Address:172.16.1.234:4647} {Suffrage:Voter ID:607347df-65b5-95b8-2ab3-6e0138a30db7 Address:172.16.1.235:4647} {Suffrage:Voter ID:69c7c3cc-256f-e12e-e82c-68bb3f29e6a3 Address:172.16.1.236:4647}]"}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:20:05.373354Z","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","time":505378093}
{"@level":"warn","@message":"failed retrieving server health","@module":"nomad.stats_fetcher","@timestamp":"2025-04-11T13:20:05.963132Z","error":"context deadline exceeded","server":"nomad01-bre.q-mex.net.bremen"}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:21:40.059996Z","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","time":501376848}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:22:09.619104Z","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","time":519836348}
{"@level":"warn","@message":"failed to contact","@module":"nomad.raft","@timestamp":"2025-04-11T13:22:10.196529Z","server-id":"69c7c3cc-256f-e12e-e82c-68bb3f29e6a3","time":506603638}

Nomad Client logs (if appropriate)

Not applicable.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    Needs Roadmapping

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions