Skip to content

Nomad CLI Monitor Dispatched Jobs#27541

Open
Juanadelacuesta wants to merge 8 commits intomainfrom
NMD-396-job-dispatch
Open

Nomad CLI Monitor Dispatched Jobs#27541
Juanadelacuesta wants to merge 8 commits intomainfrom
NMD-396-job-dispatch

Conversation

@Juanadelacuesta
Copy link
Copy Markdown
Member

@Juanadelacuesta Juanadelacuesta commented Feb 19, 2026

Description

This PR adds a more complete monitoring and follow up to the command 'nomad job dispatch' to be more similar to the output of the job run command.

Links

This is a request coming from the community:
NMD-396

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad product documentation, which is stored in the
    web-unified-docs repo. Refer to the web-unified-docs contributor guide for docs guidelines.
    Please also consider whether the change requires notes within the upgrade
    guide
    . If you would like help with the docs, tag the nomad-docs team in this PR.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

Copy link
Copy Markdown
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great! I do have one sticky issue for us to iron out though. Maybe worth discussing a bit as a team?

Comment thread command/job_dispatch.go Outdated
Comment on lines +280 to +284
// Running with desired=run is stable
if alloc.DesiredStatus == api.AllocDesiredStatusRun &&
alloc.ClientStatus == api.AllocClientStatusRunning {
return true
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaves differently than a deployment, which has an update.min_healthy_time parameter. Here if a task doesn't fail immediately, then it passes though running until it exits, before getting restarted and going back to pending, then running, etc.

Example job sleepy-fail.nomad.hcl:

job "sleepy-fail" {
  type = "batch"
  parameterized {}
  group "g" {
    task "t" {
      driver = "raw_exec"
      config {
        command = "bash"
        args    = ["-xc", "sleep 5; exit 1"]
      }
    }
    restart {
      attempts = 2
      delay    = "2s"
    }
    reschedule {
      attempts = 0
    }
  }
}

and this script dispatch.sh watches the status change over time:

#!/usr/bin/env bash

job="${1:-sleepy-fail}"

eval="$(nomad operator api -X PUT /v1/job/$job/dispatch/payload | jq -r '.EvalID')"
[ -z "$eval" ] && { echo 'no eval...'; exit 1; }
echo "eval: $eval"

idx=0
s='na'
while true; do
  eval "$(
    nomad operator api "/v1/evaluation/$eval/allocations?index=$idx" \
    | jq -r '.[0] | "idx=\(.ModifyIndex); s=\(.ClientStatus)"'
  )"
  echo "$(date +'%H:%M:%S.%N') idx=$idx status=$s"
  [ "$s" == 'failed' ] && break
  [ -z "$s" ] && { echo 'no status...'; exit 1; }
done

Here you can see it show running for the 5 seconds that the task sleeps without exiting, then get restarted a couple times before finally failing.

$ nomad run sleepy-fail.nomad.hcl
Job registration successful

$ ./dispatch.sh
eval: 601f0f65-4293-1157-f567-ffc2ee8a9ef0
13:33:40.830825835 idx=838 status=pending
13:33:41.701519927 idx=840 status=running
13:33:45.968495758 idx=841 status=pending
13:33:48.630084597 idx=842 status=running
13:33:53.497547555 idx=843 status=pending
13:33:55.955814207 idx=844 status=running
13:34:01.027398738 idx=845 status=failed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what to do about it... Maybe since these are batch jobs, we should expect that the alloc must be complete,failed,lost,stopped,evicted, and ignore running altogether?

If we do that, then I think this new behavior should be gated behind a flag, because people may be dispatching stuff that takes a long time (or possibly even never exits, like a service job!), so we can't really default to waiting potentially forever...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking to some people, the idea was to monitor the job, not the "deployment" and in that case, as Daniel mentioned, the change is too big to leave as default so it is placed under the flag '-wait', it will wait for all the allocs in all the task groups are in a final state

Copy link
Copy Markdown
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-wait is exactly what I was hoping for, thanks!

I have a couple comments being picky about words, but also another potential problem. Do we also want to monitor reschedule attempts? At present, it exits after the first alloc fails.

Comment thread command/job_dispatch.go
Comment on lines +309 to +315
// Count healthy (running) and unhealthy (failed) allocations
switch alloc.ClientStatus {
case api.AllocClientStatusRunning:
state.HealthyAllocs++
case api.AllocClientStatusFailed:
state.UnhealthyAllocs++
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my problems are caused by me fixating on words 😅

Here, while my job's task is running, but before it exits, it shows as "Healthy", but in truth Nomad doesn't know anything about its health. Service deployments are different because they actually inspect health checks.

With my sleepy-fail job, it's really not healthy, it just hasn't failed yet. This is what it shows until the task completes (fails), then it switches to Unhealthy=1 and exits.

⠴ Monitoring allocations for job "sleepy-f"...

    Deployed
    Task Group  Desired  Placed  Healthy  Unhealthy
    g           1        1       1        0

Personally I would prefer not to translate between concepts, and instead just say "Running" or "Failed", because that's all we actually know about the alloc.

Comment thread command/job_dispatch.go Outdated
Comment on lines +442 to +445
// Task group is complete when all desired allocations are terminal
if terminalCount < state.DesiredTotal {
allTaskGroupsComplete = false
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed something about my test job, not caused by only these lines, but related.

If I set reschedule.attempts = 2, then the command exits after the first allocation fails. Subsequent allocs are not monitored.

Comment thread command/job_dispatch.go Outdated
if allTaskGroupsComplete {
d.Close()
if hasFailures {
return 2 // Scheduling failure
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit about the comment: my test job is scheduled successfully, but then I get this if my task exits non-0. I don't really think that's a "scheduling failure"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not, is more of a "we dont know what happen but your job didnt finish"

@Juanadelacuesta
Copy link
Copy Markdown
Member Author

This PR fails because of a bug addressed #27852, once that one ie merged, this will work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/1.11.x backport to 1.11.x release line theme/cli

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants