Skip to content

Problems with connection to Nomad master (leader) causes whole cluster to fail #17973

Open
@beninghton

Description

@beninghton

Nomad version

Nomad v1.5.6
BuildDate 2023-05-19T18:26:13Z
Revision 8af7088

Cluster structure

3 master nodes:
10.1.15.21 - leader
10.1.15.22
10.1.15.23
2 client nodes:
10.1.15.31
10.1.15.32
3 consul cluster nodes:
10.1.15.11
10.1.15.12
10.1.15.13

Operating system and Environment details

Fedora release 35 (Thirty Five)

Issue

We had an issue with nomad cluster, master nodes had high ram/cpu consuming and in logs we saw only these CSI related errors:
[ERROR] nomad.csi_plugin: csi raft apply failed: error="plugin in use" method=delete
Then leader node was hanged at all, and we saw that cluster failed, client nodes went down. We saw in logs that new leader was not elected.

Reproduction steps

We had to re-create the failed nomad master from scratch and remove CSI from our cluster.
But we've managed to reproduce this issue another way - just closed 4647 port on master leader node:
iptables -A INPUT -p tcp --destination-port 4647 -j DROP
We assume that it imitates the issue we had with CSI because it blocks not all, but a part of functionality of master node, what likewise we think happened when CSI hanged our leader node.

Expected Result

We expected that a new leader would be elected and the cluster would not fail.

Actual Result

New leader was not elected, client nodes were down.
nomad server members output on leader node (where 4647 is blocked):

nomad-server-1.global  10.1.15.21  4648  alive   true    3             1.5.6  dc1         global
nomad-server-2.global  10.1.15.22  4648  alive   false   3             1.5.6  dc1         global
nomad-server-3.global  10.1.15.23  4648  alive   false   3             1.5.6  dc1         global

nomad server members output on non-leader node:

nomad-server-1.global  10.1.15.21  4648  alive   false   3             1.5.6  dc1         global
nomad-server-2.global  10.1.15.22  4648  alive   false   3             1.5.6  dc1         global
nomad-server-3.global  10.1.15.23  4648  alive   false   3             1.5.6  dc1         global

Error determining leaders: 1 error occurred:
        * Region "global": Unexpected response code: 500 (rpc error: failed to get conn: rpc error: lead thread didn't get connection)

Nomad logs

server1-leader.log
server2.log
server3.log
client1.log
client2.log

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Needs Roadmapping

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions