Description
Nomad version
Nomad v1.5.6
BuildDate 2023-05-19T18:26:13Z
Revision 8af7088
Cluster structure
3 master nodes:
10.1.15.21 - leader
10.1.15.22
10.1.15.23
2 client nodes:
10.1.15.31
10.1.15.32
3 consul cluster nodes:
10.1.15.11
10.1.15.12
10.1.15.13
Operating system and Environment details
Fedora release 35 (Thirty Five)
Issue
We had an issue with nomad cluster, master nodes had high ram/cpu consuming and in logs we saw only these CSI related errors:
[ERROR] nomad.csi_plugin: csi raft apply failed: error="plugin in use" method=delete
Then leader node was hanged at all, and we saw that cluster failed, client nodes went down. We saw in logs that new leader was not elected.
Reproduction steps
We had to re-create the failed nomad master from scratch and remove CSI from our cluster.
But we've managed to reproduce this issue another way - just closed 4647 port on master leader node:
iptables -A INPUT -p tcp --destination-port 4647 -j DROP
We assume that it imitates the issue we had with CSI because it blocks not all, but a part of functionality of master node, what likewise we think happened when CSI hanged our leader node.
Expected Result
We expected that a new leader would be elected and the cluster would not fail.
Actual Result
New leader was not elected, client nodes were down.
nomad server members
output on leader node (where 4647 is blocked):
nomad-server-1.global 10.1.15.21 4648 alive true 3 1.5.6 dc1 global
nomad-server-2.global 10.1.15.22 4648 alive false 3 1.5.6 dc1 global
nomad-server-3.global 10.1.15.23 4648 alive false 3 1.5.6 dc1 global
nomad server members
output on non-leader node:
nomad-server-1.global 10.1.15.21 4648 alive false 3 1.5.6 dc1 global
nomad-server-2.global 10.1.15.22 4648 alive false 3 1.5.6 dc1 global
nomad-server-3.global 10.1.15.23 4648 alive false 3 1.5.6 dc1 global
Error determining leaders: 1 error occurred:
* Region "global": Unexpected response code: 500 (rpc error: failed to get conn: rpc error: lead thread didn't get connection)
Nomad logs
server1-leader.log
server2.log
server3.log
client1.log
client2.log
Metadata
Metadata
Assignees
Type
Projects
Status