feat(cli): nomad job run: add --retry flag (default 3) with configurable backoff #27887
feat(cli): nomad job run: add --retry flag (default 3) with configurable backoff #27887resmo wants to merge 2 commits intohashicorp:mainfrom
Conversation
1aad5a6 to
45741be
Compare
I signed the CLA but the status is kept pending even though I clicked recheck. hmm... |
45741be to
3593a65
Compare
I see, I didn't want to "hide" it, that is why I kept it and mention it in the PR. Anyways, recommited and pushed. |
tgross
left a comment
There was a problem hiding this comment.
Hi @resmo! Unfortunately I don't this PR actually resolves the intended issue. There are a few high-level problems:
- The #12062 report is about the monitoring of the job. The
RegisterOptscall gets an evaluation ID back on success, and then the monitor spins up to start polling for the deployment to appear (which is async), and then monitors progress of that deployment. This change does nothing to improve handling of errors that happen during monitoring. - The PR doesn't retry only on transient errors but also on non-transient errors.
- The notion of retrying job registration is problematic, as this is not an idempotent operation. You could hypothetically write the new job to Raft and then have RPC forwarding fail. This would end up causing multiple Raft writes for the same job. But if you're using
-check-indexin that case, none of the retries will succeed.
I think there's value here but I think we'd want to avoid retrying the non-idempotent operations and focus on the monitoring that happens afterwards.
Description
Adds a retry for errors when monitoring job status after a job run. Closes #12062
Testing & Reproduction steps
Unit tests added:
TBD
Links
N/A
Contributor Checklist
changelog entry using the
make clcommand.ensure regressions will be caught.
and job configuration, please update the Nomad product documentation, which is stored in the
web-unified-docsrepo. Refer to theweb-unified-docscontributor guide for docs guidelines.Please also consider whether the change requires notes within the upgrade
guide. If you would like help with the docs, tag the
nomad-docsteam in this PR.Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.
Changes to Security Controls
No changes to security controls.