Open
Description
Proposal
It will be nice if server_join
stanza can have a timeout
field in the server_join
stanza.
Use-cases
Purposefully gave a wrong config to a Nomad agent running as server and from the logs:
Aug 02 12:38:23 nomad-node-0 nomad[50839]: ==> Newer Nomad version available: 1.1.3 (currently running: 1.1.2)
Aug 02 12:38:29 nomad-node-0 nomad[50839]: 2021-08-02T12:38:29.914+0530 [INFO] client: node registration complete
Aug 02 12:39:20 nomad-node-0 nomad[50839]: 2021-08-02T12:39:20.239+0530 [WARN] agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:39:20 nomad-node-0 nomad[50839]: * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:39:20 nomad-node-0 nomad[50839]: * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:39:20 nomad-node-0 nomad[50839]: " retry=15s
Aug 02 12:40:35 nomad-node-0 nomad[50839]: 2021-08-02T12:40:35.243+0530 [WARN] agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:40:35 nomad-node-0 nomad[50839]: * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:40:35 nomad-node-0 nomad[50839]: * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:40:35 nomad-node-0 nomad[50839]: " retry=15s
Aug 02 12:41:50 nomad-node-0 nomad[50839]: 2021-08-02T12:41:50.248+0530 [WARN] agent.joiner: join failed: error="2 errors occurred:
Aug 02 12:41:50 nomad-node-0 nomad[50839]: * Failed to join 1.1.1.1: dial tcp 1.1.1.1:4648: i/o timeout
Aug 02 12:41:50 nomad-node-0 nomad[50839]: * Failed to join 2.2.2.2: dial tcp 2.2.2.2:4648: i/o timeout
Aug 02 12:41:50 nomad-node-0 nomad[50839]: " retry=15s
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.252+0530 [ERROR] agent.joiner: max join retry exhausted, exiting
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.253+0530 [INFO] agent: requesting shutdown
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.253+0530 [INFO] client: shutting down
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.253+0530 [INFO] client.plugin: shutting down plugin manager: plugin-type=device
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.256+0530 [INFO] client.plugin: plugin manager finished: plugin-type=device
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.256+0530 [INFO] client.plugin: shutting down plugin manager: plugin-type=driver
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.259+0530 [INFO] client.plugin: plugin manager finished: plugin-type=driver
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.259+0530 [INFO] client.plugin: shutting down plugin manager: plugin-type=csi
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.261+0530 [INFO] client.plugin: plugin manager finished: plugin-type=csi
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.262+0530 [INFO] nomad: shutting down server
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.262+0530 [WARN] nomad: serf: Shutdown without a Leave
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.263+0530 [INFO] nomad: cluster leadership lost
Aug 02 12:43:05 nomad-node-0 nomad[50839]: 2021-08-02T12:43:05.263+0530 [INFO] agent: shutdown complete
You can see that Nomad almost took 5 minutes to see that the servers is unable to join and then the service exited.
Since there's no timeout
defined, I am guessing it waits for a default of 60s or something higher. There's no way to configure that, which makes retry_interval
also useless since the next retry will happen only once the first attempt failed (which is 75s according to the logs I shared).
So maybe we can add a timeout
and give a sane config like 5s
or something as a default as well (It should be less than retry_interval
).
Metadata
Metadata
Assignees
Type
Projects
Status
Needs Roadmapping