Skip to content

Add timeout to server_join #10986

Open
Open
@mr-karan

Description

@mr-karan

Proposal

It will be nice if server_join stanza can have a timeout field in the server_join stanza.

Use-cases

Purposefully gave a wrong config to a Nomad agent running as server and from the logs:

Aug 02 12:38:23 nomad-node-0 nomad[50839]: ==> Newer Nomad version available: 1.1.3 (currently running: 1.1.2)
Aug 02 12:38:29 nomad-node-0 nomad[50839]:     2021-08-02T12:38:29.914+0530 [INFO]  client: node registration complete
Aug 02 12:39:20 nomad-node-0 nomad[50839]:     2021-08-02T12:39:20.239+0530 [WARN]  agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:39:20 nomad-node-0 nomad[50839]:         * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:39:20 nomad-node-0 nomad[50839]:         * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:39:20 nomad-node-0 nomad[50839]: " retry=15s


Aug 02 12:40:35 nomad-node-0 nomad[50839]:     2021-08-02T12:40:35.243+0530 [WARN]  agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:40:35 nomad-node-0 nomad[50839]:         * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:40:35 nomad-node-0 nomad[50839]:         * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:40:35 nomad-node-0 nomad[50839]: " retry=15s
Aug 02 12:41:50 nomad-node-0 nomad[50839]:     2021-08-02T12:41:50.248+0530 [WARN]  agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:41:50 nomad-node-0 nomad[50839]:         * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:41:50 nomad-node-0 nomad[50839]:         * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:41:50 nomad-node-0 nomad[50839]: " retry=15s
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.252+0530 [ERROR] agent.joiner: max join retry exhausted, exiting
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.253+0530 [INFO]  agent: requesting shutdown
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.253+0530 [INFO]  client: shutting down
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.253+0530 [INFO]  client.plugin: shutting down plugin manager: plugin-type=device
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.256+0530 [INFO]  client.plugin: plugin manager finished: plugin-type=device
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.256+0530 [INFO]  client.plugin: shutting down plugin manager: plugin-type=driver
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.259+0530 [INFO]  client.plugin: plugin manager finished: plugin-type=driver
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.259+0530 [INFO]  client.plugin: shutting down plugin manager: plugin-type=csi
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.261+0530 [INFO]  client.plugin: plugin manager finished: plugin-type=csi
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.262+0530 [INFO]  nomad: shutting down server
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.262+0530 [WARN]  nomad: serf: Shutdown without a Leave
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.263+0530 [INFO]  nomad: cluster leadership lost
Aug 02 12:43:05 nomad-node-0 nomad[50839]:     2021-08-02T12:43:05.263+0530 [INFO]  agent: shutdown complete

You can see that Nomad almost took 5 minutes to see that the servers is unable to join and then the service exited.

Since there's no timeout defined, I am guessing it waits for a default of 60s or something higher. There's no way to configure that, which makes retry_interval also useless since the next retry will happen only once the first attempt failed (which is 75s according to the logs I shared).

So maybe we can add a timeout and give a sane config like 5s or something as a default as well (It should be less than retry_interval).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Needs Roadmapping

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions