Description
Nomad version
1.5.5 but also the tip.
Issue
Due to the garbage collection logic applied to periodic sysbatch jobs (and sysbatch jobs in general), the sysbatch jobs will run much more frequently than the job spec expresses. In particular, consider the following:
- Job GC period of X
- Eval GC period of Y
If X < Y
then every periodic run of the sysbatch job will run on every node multiple times, as long as allocations for the a sysbatch job do not end at the exact same time. This may lead to infinite accumulation of the number of periodic jobs and an infinite number of allocations run for each of them on every node. Please see the repro for details.
Reproduction steps
Start a server and two client nodes
# Local server
$ cat server_config.hcl
data_dir = "/tmp/nomad/server"
log_level = "TRACE"
advertise {
http = "127.0.0.1"
rpc = "127.0.0.1"
serf = "127.0.0.1"
}
server {
enabled = true
bootstrap_expect = 1
job_gc_interval = "1m"
job_gc_threshold = "24h"
eval_gc_threshold = "1m"
}
$ ./nomad agent -config server_config.hcl
# Local client no. 1
$ cat client_config.hcl
data_dir = "/tmp/nomad/client-1"
log_level = "debug"
advertise {
http = "127.0.0.1"
rpc = "127.0.0.1"
serf = "127.0.0.1"
}
ports {
http = "9876"
rpc = "9875"
serf = "9874"
}
client {
enabled = true
servers = ["127.0.0.1"]
gc_max_allocs = 1
}
plugin "raw_exec" {
config {
enabled = true
}
}
$ ./nomad agent -config client_config.hcl
$ cat client_config_2.hcl
data_dir = "/tmp/nomad/client-2"
log_level = "debug"
advertise {
http = "127.0.0.1"
rpc = "127.0.0.1"
serf = "127.0.0.1"
}
ports {
http = "8876"
rpc = "8875"
serf = "8874"
}
client {
enabled = true
servers = ["127.0.0.1"]
gc_max_allocs = 1
}
plugin "raw_exec" {
config {
enabled = true
}
}
Please note that we will start client node no. 2 in a way that naturally keeps all of its allocations alive forever. This is due to #16381 and we use it just because it's convenient. In production this can be easiest emulated by having sysbatch jobs whose runtimes are non-uniform and simply splay across a large-ish period (say, 10 minutes). All we want for this node is for it to maintain its allocations and not let them be GCed.
$ while true; do timeout 45 ./nomad agent -config client_config_2.hcl
Now, let us start a periodic job:
# Job
$ cat job.hcl
job "example" {
datacenters = ["dc1"]
type = "sysbatch"
periodic {
cron = "*/10 * * * * *"
}
group "test-group" {
task "test-task" {
driver = "raw_exec"
config {
command = "/usr/bin/echo"
args = [ "I ran!" ]
}
}
}
}
$ ./nomad job run job.hcl
Job registration successful
Approximate next launch time: 2023-06-01T21:20:00Z (9m25s from now)
Now, all we need to do is wait. The Evaluations are only ever GCed every 5 minutes
and the GC is approximate based on the raft index:
[DEBUG] core.sched: eval GC found eligibile objects: evals=6 allocs=3
I left this running overnight, but realistically one may also just add artificial activity onto the node so that the index moves forward. If we now look at some of the periodic job runs they will have a lot of complete allocations (many many more than nodes -- in fact, we can make them have an arbitrary number!):
$ ./nomad job status example/periodic-1685712600
ID = example/periodic-1685712600
Name = example/periodic-1685712600
Submit Date = 2023-06-02T09:30:00-04:00
Type = sysbatch
Priority = 50
Datacenters = dc1
Namespace = default
Status = dead
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost Unknown
test-group 0 0 0 0 8 0 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
7f95450b 274e6daa test-group 0 run complete 41s ago 41s ago
4b36a17c 75560bfc test-group 0 run complete 42m21s ago 42s ago
The record-holding job had >1000
completed allocs in a system with just 2 nodes and there was an average of 1 alloc running every 1 second for this sysbatch job which is configured to run every 10
minutes.
I expect a periodic sysbatch run to ever only have number_of_nodes
allocations complete. Since this happens for every periodic run in addition to new runs we now get infinite number of sysbatch periodic runs on every node (these jobs are never garbage collected and their number grows without bound).
Expected Result
Each periodic sysbatch job instance runs on every node in the system only once.
Actual Result
Each periodic sysbatch job instance runs a large number of times on every node.
Root cause
The root cause is a combination of that in #17395 and the fact that the garbage collection for sysbatch jobs is different from that for batch
jobs.
Batch jobs maintain at least one allocation per task group ran so that they are not rescheduled when the exit code is 0 (that is -- they are expected not to run again):
https://github.com/hashicorp/nomad/blob/v1.5.5/nomad/core_sched.go#L306-L313
However, this logic does not exist for sysbatch jobs which causes the above behavior.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status