Description
Nomad version
Observed on Nomad v1.7.7 and v1.9.6 (Linux, Debian / Ubuntu)
Nomad v1.9.6
BuildDate 2025-02-11T18:55:10Z
Revision 7f8b44963d36d025520348d7f24735774d26f13b+CHANGES
Operating system and Environment details
Debian / Ubuntu, running bare metal and also in a virtual (Vagrant) environment, on libvirt
Linux nomad 6.5.0-14-generic #14-Ubuntu SMP PREEMPT_DYNAMIC Tue Nov 14 14:59:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Issue
During task restarts, the shutdown_delay
of a task (or group) is not taken into account.
Effect
When a task restarts (triggered manually or through e.g. a template
re-render), the task shuts down. However the service should be deregistered first, and then give the task time to finish the existing requests, and then send the shut down signal to restart the task. After the task has been restarted, it becomes healthy and traffic can be sent to the service / task again.
When the service still receives traffic during a task restart, the task is not ready / healthy to receive traffic, and subsequently client requests will fail.
Concretely this causes most issues when template
blocks re-render, and restart a single task, causing service interruption for a few seconds.
More details
From docs (https://developer.hashicorp.com/nomad/docs/job-specification/service#service-lifecycle):
If shutdown_delay is set, wait the configured duration before proceeding to step 3. Setting a shutdown_delay can be useful if the application itself doesn't handle graceful shutdowns based on the kill_signal. The configured delay will provide a period of time in which the service is no longer registered in the provider, and thus is not receiving additional requests, but hasn't been signalled to shutdown. This allows the application time to complete the requests and become idle.
For task-specific services, I would expect the task service to be deregistered before a task is restarted, then waiting for a shutdown_delay
, and finally sending the shutdown signal to the task.
Triggering the restarts through CLI and UI both have exactly the same effect.
Related issues
- coordinate restarts across clients (template rerender, check restart, etc.) #10920
- ShutdownDelay should apply on shutdowns and "healthy" restarts #5980
- Restarting allocations does not seems to respect lifecycle and shudown_delay constraints #10578
- v0.10.4 - Shutdown Delay not working as expected #7251 - Run task shutdown_delay regardless of service registration #7663
Reproduction steps
Reproduction job:
job "task-restart-1" {
group "group" {
count = 2
shutdown_delay = "5s"
service {
name = "group"
task = "task"
}
task "task" {
driver = "raw_exec"
config {
command = "sleep"
args = ["inf"]
}
shutdown_delay = "5s"
resources {
memory = 10
}
}
}
}
Scenario 1: restart task
Actions:
- Deploy the above job
- Navigate to the task
- Click the Restart Task button
Expected result:
- Event User requested task to restart
- Event Waiting for shutdown_delay of 5s before killing the task.
- Consul shows service deregistered
- (after 5 seconds) Event Sent interrupt. Waiting 5s before force killing
- ... restart
- Consul shows service re-registered
Actual result:
- Event User requested task to restart
- (instantly) Event Sent interrupt. Waiting 5s before force killing
- ... restart
Logs:
Mar 05 13:55:25 nomad nomad[4556]: 2025-03-05T13:55:25.511+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task type="Restart Signaled" msg="User requested task to restart" failed=false
Mar 05 13:55:25 nomad nomad[4556]: 2025-03-05T13:55:25.513+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task type=Terminated msg="Exit Code: 130, Signal: 2" failed=false
Mar 05 13:55:25 nomad nomad[4556]: 2025-03-05T13:55:25.514+0100 [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = error reading from server: EOF"
Mar 05 13:55:25 nomad nomad[4556]: 2025-03-05T13:55:25.514+0100 [WARN] client.driver_mgr.raw_exec.executor: error closing client during Kill: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 driver=raw_exec task_name=task err="rpc error: code = Canceled desc = grpc: the client connection is closing"
Mar 05 13:55:25 nomad nomad[4556]: 2025-03-05T13:55:25.514+0100 [WARN] client.driver_mgr.raw_exec.executor: plugin failed to exit gracefully: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 driver=raw_exec task_name=task
Mar 05 13:55:25 nomad nomad[4556]: 2025-03-05T13:55:25.518+0100 [INFO] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 driver=raw_exec task_name=task plugin=/usr/bin/nomad id=10503
Mar 05 13:55:25 nomad nomad[4556]: 2025-03-05T13:55:25.518+0100 [INFO] client.alloc_runner.task_runner: restarting task: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task reason="" delay=0s
Mar 05 13:55:25 nomad nomad[4556]: 2025-03-05T13:55:25.518+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task type=Restarting msg="Task restarting in 0s" failed=false
Mar 05 13:55:25 nomad nomad[4556]: 2025-03-05T13:55:25.521+0100 [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:sleep Args:[inf] OverrideCgroupV2: OverrideCgroupV1:map[] OOMScoreAdj:0 WorkDir:}"
Mar 05 13:55:25 nomad nomad[4556]: 2025-03-05T13:55:25.551+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task type=Started msg="Task started by client" failed=false;
Note there are no delays logged in either the restart logs or the actual task restarts.
Scenario 2: restart allocation
Actions:
- Deploy the above job
- Navigate to one the allocation
- Click the Restart Alloc button
Expected result (on the task):
- Event User requested task to restart
- Event Waiting for shutdown_delay of 5s before killing the task.
- Consul shows service deregistered
- (after 5 seconds) Event Sent interrupt. Waiting 5s before force killing
- ... restart
- Consul shows service re-registered
Actual result:
- Event User requested task to restart
- (instantly) Event Sent interrupt. Waiting 5s before force killing
- Consul shows service deregistered
- ... restart
- Consul shows service re-registered
Logs:
Mar 05 13:57:12 nomad nomad[4556]: 2025-03-05T13:57:12.648+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task type="Restart Signaled" msg="User requested running tasks to restart" failed=false
Mar 05 13:57:12 nomad nomad[4556]: 2025-03-05T13:57:12.650+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task type=Terminated msg="Exit Code: 130, Signal: 2" failed=false
Mar 05 13:57:12 nomad nomad[4556]: 2025-03-05T13:57:12.651+0100 [ERROR] client.driver_mgr.raw_exec: destroying executor failed: driver=raw_exec error="rpc error: code = Unavailable desc = error reading from server: EOF"
Mar 05 13:57:12 nomad nomad[4556]: 2025-03-05T13:57:12.651+0100 [WARN] client.driver_mgr.raw_exec.executor: error closing client during Kill: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 driver=raw_exec task_name=task err="rpc error: code = Canceled desc = grpc: the client connection is closing"
Mar 05 13:57:12 nomad nomad[4556]: 2025-03-05T13:57:12.652+0100 [WARN] client.driver_mgr.raw_exec.executor: plugin failed to exit gracefully: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 driver=raw_exec task_name=task
Mar 05 13:57:12 nomad nomad[4556]: 2025-03-05T13:57:12.657+0100 [INFO] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 driver=raw_exec task_name=task plugin=/usr/bin/nomad id=10591
Mar 05 13:57:12 nomad nomad[4556]: 2025-03-05T13:57:12.657+0100 [INFO] client.alloc_runner.task_runner: restarting task: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task reason="" delay=0s
Mar 05 13:57:12 nomad nomad[4556]: 2025-03-05T13:57:12.657+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task type=Restarting msg="Task restarting in 0s" failed=false
Mar 05 13:57:12 nomad nomad[4556]: 2025-03-05T13:57:12.662+0100 [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:sleep Args:[inf] OverrideCgroupV2: OverrideCgroupV1:map[] OOMScoreAdj:0 WorkDir:}"
Mar 05 13:57:12 nomad nomad[4556]: 2025-03-05T13:57:12.692+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task type=Started msg="Task started by client" failed=false
Scenario 3: stop allocation
(this works as expected ✅)
Actions:
- Deploy the above job
- Navigate to one the allocation
- Click the Stop Alloc button
Expected result (on the task):
- Event User requested task to restart
- Event Waiting for shutdown_delay of 5s before killing the task.
- Consul shows service deregistered
- (after 5 seconds) Event Sent interrupt. Waiting 5s before force killing
- ... task stops
- ... allocation is rescheduled
- Consul shows service re-registered
Actual result:
- Event User requested task to restart
- Event Waiting for shutdown_delay of 5s before killing the task.
- Consul shows service deregistered
- (after 5 seconds) Event Sent interrupt. Waiting 5s before force killing
- ... task stops
- ... allocation is rescheduled
- Consul shows service re-registered
Logs:
Mar 05 13:57:47 nomad nomad[4556]: 2025-03-05T13:57:47.696+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=b6ddd34b-b55d-a0ba-a314-858e52e4b9c5 task=task type=Received msg="Task received by client" failed=false
Mar 05 13:57:47 nomad nomad[4556]: 2025-03-05T13:57:47.697+0100 [INFO] client.alloc_migrator: waiting for previous alloc to terminate: alloc_id=b6ddd34b-b55d-a0ba-a314-858e52e4b9c5 previous_alloc=3109bc23-92b7-84f2-3985-5731d023dce7
Mar 05 13:57:52 nomad nomad[4556]: 2025-03-05T13:57:52.696+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task type=Killing msg="Sent interrupt. Waiting 5s before force killing" failed=false
Mar 05 13:57:52 nomad nomad[4556]: 2025-03-05T13:57:52.696+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task type="Waiting for shutdown delay" msg="Waiting for shutdown_delay of 5s before killing the task." failed=false
Mar 05 13:57:57 nomad nomad[4556]: 2025-03-05T13:57:57.703+0100 [INFO] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 driver=raw_exec task_name=task plugin=/usr/bin/nomad id=10652
Mar 05 13:57:57 nomad nomad[4556]: 2025-03-05T13:57:57.703+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task type=Terminated msg="Exit Code: 130, Signal: 2" failed=false
Mar 05 13:57:57 nomad nomad[4556]: 2025-03-05T13:57:57.703+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task type=Killed msg="Task successfully killed" failed=false
Mar 05 13:57:57 nomad nomad[4556]: 2025-03-05T13:57:57.707+0100 [INFO] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7 task=task plugin=/usr/bin/nomad id=10494
Mar 05 13:57:57 nomad nomad[4556]: 2025-03-05T13:57:57.708+0100 [INFO] client.gc: marking allocation for GC: alloc_id=3109bc23-92b7-84f2-3985-5731d023dce7
Mar 05 13:57:57 nomad nomad[4556]: 2025-03-05T13:57:57.708+0100 [INFO] client.alloc_migrator: waiting for previous alloc to terminate: alloc_id=b6ddd34b-b55d-a0ba-a314-858e52e4b9c5 previous_alloc=3109bc23-92b7-84f2-3985-5731d023dce7
Mar 05 13:57:57 nomad nomad[4556]: 2025-03-05T13:57:57.708+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=b6ddd34b-b55d-a0ba-a314-858e52e4b9c5 task=task type="Task Setup" msg="Building Task Directory" failed=false
Mar 05 13:57:57 nomad nomad[4556]: 2025-03-05T13:57:57.746+0100 [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:sleep Args:[inf] OverrideCgroupV2: OverrideCgroupV1:map[] OOMScoreAdj:0 WorkDir:}"
Mar 05 13:57:57 nomad nomad[4556]: 2025-03-05T13:57:57.775+0100 [INFO] client.alloc_runner.task_runner: Task event: alloc_id=b6ddd34b-b55d-a0ba-a314-858e52e4b9c5 task=task type=Started msg="Task started by client" failed=false
Metadata
Metadata
Assignees
Type
Projects
Status