Description
Nomad version
Nomad v1.6.0
BuildDate 2023-07-18T18:51:11Z
Revision 87d411f
Operating system and Environment details
Fedora release 35 (Thirty Five)
Issue
We have 3 master nodes cluster with 2 CPU, 4 RAM each. We have 30 client nodes.
The resourses was enough before, but we started to see memory leak - during a week the RAM is exausted on each of the node.
We started to see it on 1.5.6 version, we did not face this before this 1.5.6.
Recently we updated 1.5.6 to 1.6.0 hoping that this problem will dissapear.
You can see on 07/20 is our update time for new 1.6.0 version:
But a week passed, and all 3 servers were almost out of RAM.
We didn't find anything related in logs, except some raft error messages (log files in the attachments), so we decided to run "nomad system gc" on server-1 hoping that some resources will be freed.
But after that, we lost connection to the server-1 and some client nodes started to fail heartbeats and allocations on them were restarted. We did not see this behaviour before 1.5.6 version. It's unacceptable.
Reproduction steps
"nomad system gc" in exausted resoures situation.
Expected Result
Free some RAM, cluster not restarts allocations.
Actual Result
During freeing RAM, cluster fails client node heartbeats and restarts allocations.
We think all this is related to:
#17973
#17974
Because the node server-1 on which we run "nomad system gc" was out of resources and stopped to respond to the cluster.
But we can't find the reason of memory leak.
Please advise what we can do to avoid it.
Nomad Server logs for last 24h
"nomad system gc" was run at was run at 2023-07-26T04:34:00 on server-1. After that we lost connection to the server for a while.
server1.log
server2-leader.log
server3.log
Nomad Client logs for last 24h
Multiple clients disconnected, here's log of one of them:
client1.log
Metadata
Metadata
Assignees
Type
Projects
Status