Skip to content

feat(cli): nomad job run: add --retry flag (default 3) with configurable backoff #27887

Draft
resmo wants to merge 2 commits intohashicorp:mainfrom
resmo:feature/job-run-retry
Draft

feat(cli): nomad job run: add --retry flag (default 3) with configurable backoff #27887
resmo wants to merge 2 commits intohashicorp:mainfrom
resmo:feature/job-run-retry

Conversation

@resmo
Copy link
Copy Markdown
Contributor

@resmo resmo commented Apr 29, 2026

Description

Adds a retry for errors when monitoring job status after a job run. Closes #12062

Testing & Reproduction steps

Unit tests added:
TBD

Links

N/A

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad product documentation, which is stored in the
    web-unified-docs repo. Refer to the web-unified-docs contributor guide for docs guidelines.
    Please also consider whether the change requires notes within the upgrade
    guide
    . If you would like help with the docs, tag the nomad-docs team in this PR.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

No changes to security controls.

@hashicorp-cla-app
Copy link
Copy Markdown

hashicorp-cla-app Bot commented Apr 29, 2026

CLA assistant check
All committers have signed the CLA.

@resmo resmo changed the title nomad job run: add --retry flag (default 3) with configurable backoff feature: nomad job run: add --retry flag (default 3) with configurable backoff Apr 29, 2026
@resmo resmo marked this pull request as ready for review April 29, 2026 14:11
@resmo resmo requested review from a team as code owners April 29, 2026 14:11
@resmo resmo changed the title feature: nomad job run: add --retry flag (default 3) with configurable backoff feat(cli): nomad job run: add --retry flag (default 3) with configurable backoff Apr 29, 2026
@resmo resmo force-pushed the feature/job-run-retry branch from 1aad5a6 to 45741be Compare April 29, 2026 14:22
@resmo
Copy link
Copy Markdown
Contributor Author

resmo commented Apr 29, 2026

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes

Have you signed the CLA already but the status is still pending? Recheck it.

I signed the CLA but the status is kept pending even though I clicked recheck. hmm...

@tgross
Copy link
Copy Markdown
Member

tgross commented Apr 29, 2026

I signed the CLA but the status is kept pending even though I clicked recheck. hmm...

That's because you're not the author of the PR, CoPilot is (see patch). CoPilot can't sign the CLA, only humans can. Read thru the AI policy and rebase your contribution once you've edited.

@resmo resmo force-pushed the feature/job-run-retry branch from 45741be to 3593a65 Compare April 29, 2026 16:11
@resmo
Copy link
Copy Markdown
Contributor Author

resmo commented Apr 29, 2026

I signed the CLA but the status is kept pending even though I clicked recheck. hmm...

That's because you're not the author of the PR, CoPilot is (see patch). CoPilot can't sign the CLA, only humans can. Read thru the AI policy and rebase your contribution once you've edited.

I see, I didn't want to "hide" it, that is why I kept it and mention it in the PR. Anyways, recommited and pushed.

Copy link
Copy Markdown
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @resmo! Unfortunately I don't this PR actually resolves the intended issue. There are a few high-level problems:

  • The #12062 report is about the monitoring of the job. The RegisterOpts call gets an evaluation ID back on success, and then the monitor spins up to start polling for the deployment to appear (which is async), and then monitors progress of that deployment. This change does nothing to improve handling of errors that happen during monitoring.
  • The PR doesn't retry only on transient errors but also on non-transient errors.
  • The notion of retrying job registration is problematic, as this is not an idempotent operation. You could hypothetically write the new job to Raft and then have RPC forwarding fail. This would end up causing multiple Raft writes for the same job. But if you're using -check-index in that case, none of the retries will succeed.

I think there's value here but I think we'd want to avoid retrying the non-idempotent operations and focus on the monitoring that happens afterwards.

@resmo resmo marked this pull request as draft April 30, 2026 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

[feature]: cli job run retry on "Error fetching deployment"

2 participants