Description
Nomad version
Nomad v1.5.6
BuildDate 2023-05-19T18:26:13Z
Revision 8af7088
Cluster structure
3 master nodes:
10.1.15.21 - leader
10.1.15.22
10.1.15.23
2 client nodes:
10.1.15.31
10.1.15.32
3 consul cluster nodes:
10.1.15.11
10.1.15.12
10.1.15.13
Operating system and Environment details
Fedora release 35 (Thirty Five)
Issue
Issue is related to #17973.
In the #17973 issue, after our leader Node1 had had CSI/cpu/mem problems we initially rebooted it.
Then cluster lost leadership and Node2 became new leader. Cluster worked fine.
Then Node1 was back online and joined the cluster, but it was CSI corrupted, and hanged right after joining the cluster.
Then our cluster lost leadership. New leader was not elected.
But this time client nodes were not down, and all allocations on the whole cluster were restarted.
Reproduction steps
After we removed CSI the only way to reproduce the issue quickly is to block 4647 port on non-leader node:
iptables -A INPUT -p tcp --destination-port 4647 -j DROP
We assume that it imitates the issue we had with CSI because it blocks not all, but a part of functionality of a node.
In this case we block nomad-server2 (non-leader).
Expected Result
We expected cluster to not fail. Two left nodes are fine, one of them is leader. Allocations on client nodes are not restarted.
Actual Result
New leader was not elected, all allocations on client nodes were restarted.
nomad server members
output on leader node:
nomad-server-1.global 10.1.15.21 4648 alive true 3 1.5.6 dc1 global
nomad-server-2.global 10.1.15.22 4648 alive false 3 1.5.6 dc1 global
nomad-server-3.global 10.1.15.23 4648 alive false 3 1.5.6 dc1 global
nomad server members
output on non-leader node nomad-server-2 (where 4647 is blocked):
nomad-server-1.global 10.1.15.21 4648 alive false 3 1.5.6 dc1 global
nomad-server-2.global 10.1.15.22 4648 alive false 3 1.5.6 dc1 global
nomad-server-3.global 10.1.15.23 4648 alive false 3 1.5.6 dc1 global
Error determining leaders: 1 error occurred:
* Region "global": Unexpected response code: 500 (rpc error: failed to get conn: rpc error: lead thread didn't get connection)
Nomad logs
client1.log
client2.log
server1-leader.log
server2.log
server3.log
Metadata
Metadata
Assignees
Type
Projects
Status