Problems with connection to Nomad master (leader) causes whole cluster to fail

### Nomad version
Nomad v1.5.6
BuildDate 2023-05-19T18:26:13Z
Revision 8af70885c02ab921dedbdf6bc406a1e886866f80

### Cluster structure
3 master nodes:
10.1.15.21 - leader
10.1.15.22
10.1.15.23
2 client nodes:
10.1.15.31
10.1.15.32
3 consul cluster nodes:
10.1.15.11
10.1.15.12
10.1.15.13

### Operating system and Environment details
Fedora release 35 (Thirty Five)

### Issue
We had an issue with nomad cluster, master nodes had high ram/cpu consuming and in logs we saw only these CSI related errors:
`[ERROR] nomad.csi_plugin: csi raft apply failed: error="plugin in use" method=delete`
Then leader node was hanged at all, and we saw that cluster failed, client nodes went down. We saw in logs that new leader was not elected.

### Reproduction steps
We had to re-create the failed nomad master from scratch and remove CSI from our cluster. 
But we've managed to reproduce this issue another way - just closed 4647 port on master leader node:
`iptables -A INPUT -p tcp --destination-port 4647 -j DROP`
We assume that it imitates the issue we had with CSI because it blocks not all, but a part of functionality of master node, what likewise we think happened when CSI hanged our leader node.

#### Expected Result
We expected that a new leader would be elected and the cluster would not fail.

#### Actual Result
New leader was not elected, client nodes were down.
`nomad server members` output on leader node (where 4647 is blocked):
```
nomad-server-1.global  10.1.15.21  4648  alive   true    3             1.5.6  dc1         global
nomad-server-2.global  10.1.15.22  4648  alive   false   3             1.5.6  dc1         global
nomad-server-3.global  10.1.15.23  4648  alive   false   3             1.5.6  dc1         global
```
`nomad server members` output on non-leader node:
```
nomad-server-1.global  10.1.15.21  4648  alive   false   3             1.5.6  dc1         global
nomad-server-2.global  10.1.15.22  4648  alive   false   3             1.5.6  dc1         global
nomad-server-3.global  10.1.15.23  4648  alive   false   3             1.5.6  dc1         global

Error determining leaders: 1 error occurred:
        * Region "global": Unexpected response code: 500 (rpc error: failed to get conn: rpc error: lead thread didn't get connection)
```

#### Nomad logs
[server1-leader.log](https://github.com/hashicorp/nomad/files/12090654/server1-leader.log)
[server2.log](https://github.com/hashicorp/nomad/files/12090655/server2.log)
[server3.log](https://github.com/hashicorp/nomad/files/12090656/server3.log)
[client1.log](https://github.com/hashicorp/nomad/files/12090652/client1.log)
[client2.log](https://github.com/hashicorp/nomad/files/12090775/client2.log)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problems with connection to Nomad master (leader) causes whole cluster to fail #17973

Nomad version

Cluster structure

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Problems with connection to Nomad master (leader) causes whole cluster to fail #17973

Description

Nomad version

Cluster structure

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions