Nomad CLI Monitor Dispatched Jobs#27541
Conversation
71312a8 to
20af021
Compare
20af021 to
be99d7d
Compare
| // Running with desired=run is stable | ||
| if alloc.DesiredStatus == api.AllocDesiredStatusRun && | ||
| alloc.ClientStatus == api.AllocClientStatusRunning { | ||
| return true | ||
| } |
There was a problem hiding this comment.
This behaves differently than a deployment, which has an update.min_healthy_time parameter. Here if a task doesn't fail immediately, then it passes though running until it exits, before getting restarted and going back to pending, then running, etc.
Example job sleepy-fail.nomad.hcl:
job "sleepy-fail" {
type = "batch"
parameterized {}
group "g" {
task "t" {
driver = "raw_exec"
config {
command = "bash"
args = ["-xc", "sleep 5; exit 1"]
}
}
restart {
attempts = 2
delay = "2s"
}
reschedule {
attempts = 0
}
}
}and this script dispatch.sh watches the status change over time:
#!/usr/bin/env bash
job="${1:-sleepy-fail}"
eval="$(nomad operator api -X PUT /v1/job/$job/dispatch/payload | jq -r '.EvalID')"
[ -z "$eval" ] && { echo 'no eval...'; exit 1; }
echo "eval: $eval"
idx=0
s='na'
while true; do
eval "$(
nomad operator api "/v1/evaluation/$eval/allocations?index=$idx" \
| jq -r '.[0] | "idx=\(.ModifyIndex); s=\(.ClientStatus)"'
)"
echo "$(date +'%H:%M:%S.%N') idx=$idx status=$s"
[ "$s" == 'failed' ] && break
[ -z "$s" ] && { echo 'no status...'; exit 1; }
doneHere you can see it show running for the 5 seconds that the task sleeps without exiting, then get restarted a couple times before finally failing.
$ nomad run sleepy-fail.nomad.hcl
Job registration successful
$ ./dispatch.sh
eval: 601f0f65-4293-1157-f567-ffc2ee8a9ef0
13:33:40.830825835 idx=838 status=pending
13:33:41.701519927 idx=840 status=running
13:33:45.968495758 idx=841 status=pending
13:33:48.630084597 idx=842 status=running
13:33:53.497547555 idx=843 status=pending
13:33:55.955814207 idx=844 status=running
13:34:01.027398738 idx=845 status=failedThere was a problem hiding this comment.
I'm not sure what to do about it... Maybe since these are batch jobs, we should expect that the alloc must be complete,failed,lost,stopped,evicted, and ignore running altogether?
If we do that, then I think this new behavior should be gated behind a flag, because people may be dispatching stuff that takes a long time (or possibly even never exits, like a service job!), so we can't really default to waiting potentially forever...
There was a problem hiding this comment.
After talking to some people, the idea was to monitor the job, not the "deployment" and in that case, as Daniel mentioned, the change is too big to leave as default so it is placed under the flag '-wait', it will wait for all the allocs in all the task groups are in a final state
gulducat
left a comment
There was a problem hiding this comment.
-wait is exactly what I was hoping for, thanks!
I have a couple comments being picky about words, but also another potential problem. Do we also want to monitor reschedule attempts? At present, it exits after the first alloc fails.
| // Count healthy (running) and unhealthy (failed) allocations | ||
| switch alloc.ClientStatus { | ||
| case api.AllocClientStatusRunning: | ||
| state.HealthyAllocs++ | ||
| case api.AllocClientStatusFailed: | ||
| state.UnhealthyAllocs++ | ||
| } |
There was a problem hiding this comment.
Most of my problems are caused by me fixating on words 😅
Here, while my job's task is running, but before it exits, it shows as "Healthy", but in truth Nomad doesn't know anything about its health. Service deployments are different because they actually inspect health checks.
With my sleepy-fail job, it's really not healthy, it just hasn't failed yet. This is what it shows until the task completes (fails), then it switches to Unhealthy=1 and exits.
⠴ Monitoring allocations for job "sleepy-f"...
Deployed
Task Group Desired Placed Healthy Unhealthy
g 1 1 1 0
Personally I would prefer not to translate between concepts, and instead just say "Running" or "Failed", because that's all we actually know about the alloc.
| // Task group is complete when all desired allocations are terminal | ||
| if terminalCount < state.DesiredTotal { | ||
| allTaskGroupsComplete = false | ||
| } |
There was a problem hiding this comment.
I noticed something about my test job, not caused by only these lines, but related.
If I set reschedule.attempts = 2, then the command exits after the first allocation fails. Subsequent allocs are not monitored.
| if allTaskGroupsComplete { | ||
| d.Close() | ||
| if hasFailures { | ||
| return 2 // Scheduling failure |
There was a problem hiding this comment.
nit about the comment: my test job is scheduled successfully, but then I get this if my task exits non-0. I don't really think that's a "scheduling failure"
There was a problem hiding this comment.
it is not, is more of a "we dont know what happen but your job didnt finish"
c062f0c to
1640f12
Compare
|
This PR fails because of a bug addressed #27852, once that one ie merged, this will work. |
05ccc75 to
d8e2c9b
Compare
Description
This PR adds a more complete monitoring and follow up to the command 'nomad job dispatch' to be more similar to the output of the job run command.
Links
This is a request coming from the community:
NMD-396
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
and job configuration, please update the Nomad product documentation, which is stored in the
web-unified-docsrepo. Refer to theweb-unified-docscontributor guide for docs guidelines.Please also consider whether the change requires notes within the upgrade
guide. If you would like help with the docs, tag the
nomad-docsteam in this PR.Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.
Changes to Security Controls
Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.