Skip to content

Problems with connection to Nomad master (non-leader) causes all allocations restart #17974

Open
@beninghton

Description

@beninghton

Nomad version

Nomad v1.5.6
BuildDate 2023-05-19T18:26:13Z
Revision 8af7088

Cluster structure

3 master nodes:
10.1.15.21 - leader
10.1.15.22
10.1.15.23
2 client nodes:
10.1.15.31
10.1.15.32
3 consul cluster nodes:
10.1.15.11
10.1.15.12
10.1.15.13

Operating system and Environment details

Fedora release 35 (Thirty Five)

Issue

Issue is related to #17973.
In the #17973 issue, after our leader Node1 had had CSI/cpu/mem problems we initially rebooted it.
Then cluster lost leadership and Node2 became new leader. Cluster worked fine.
Then Node1 was back online and joined the cluster, but it was CSI corrupted, and hanged right after joining the cluster.
Then our cluster lost leadership. New leader was not elected.
But this time client nodes were not down, and all allocations on the whole cluster were restarted.

Reproduction steps

After we removed CSI the only way to reproduce the issue quickly is to block 4647 port on non-leader node:
iptables -A INPUT -p tcp --destination-port 4647 -j DROP
We assume that it imitates the issue we had with CSI because it blocks not all, but a part of functionality of a node.
In this case we block nomad-server2 (non-leader).

Expected Result

We expected cluster to not fail. Two left nodes are fine, one of them is leader. Allocations on client nodes are not restarted.

Actual Result

New leader was not elected, all allocations on client nodes were restarted.
nomad server members output on leader node:

nomad-server-1.global  10.1.15.21  4648  alive   true    3             1.5.6  dc1         global
nomad-server-2.global  10.1.15.22  4648  alive   false   3             1.5.6  dc1         global
nomad-server-3.global  10.1.15.23  4648  alive   false   3             1.5.6  dc1         global

nomad server members output on non-leader node nomad-server-2 (where 4647 is blocked):

nomad-server-1.global  10.1.15.21  4648  alive   false   3             1.5.6  dc1         global
nomad-server-2.global  10.1.15.22  4648  alive   false   3             1.5.6  dc1         global
nomad-server-3.global  10.1.15.23  4648  alive   false   3             1.5.6  dc1         global

Error determining leaders: 1 error occurred:
        * Region "global": Unexpected response code: 500 (rpc error: failed to get conn: rpc error: lead thread didn't get connection)

Nomad logs

client1.log
client2.log
server1-leader.log
server2.log
server3.log

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Needs Roadmapping

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions