Skip to content
This repository was archived by the owner on Sep 2, 2022. It is now read-only.

Fix job recovery and savepoint bug #401

Merged
merged 1 commit into from
Feb 9, 2021

Conversation

elanv
Copy link
Contributor

@elanv elanv commented Feb 2, 2021

  • Job recovery bug
    When job manager fails and job state falls to Lost, job is not properly recovered.
    The job is resubmitted, but the previous job ID is not cleared, so it is considered an unexpected job and is canceled.

  • Savepoint bug
    Savepoint is not triggered properly.

Resolves #398

@elanv
Copy link
Contributor Author

elanv commented Feb 2, 2021

@shashken @functicons
I have also fixed the issue #392 (comment) because savepoint is required for this PR.

@shashken
Copy link
Contributor

shashken commented Feb 3, 2021

@elanv good job, @functicons this needs to be approved quickly, the version on master does not take SP.
I haven't done a complete test after I pushed CR changes and that line: error != nil broken

@morelina
Copy link

morelina commented Feb 8, 2021

I am using the commit in this PR and I am having issues when updating the job: #408

@functicons functicons self-requested a review February 9, 2021 05:41
@functicons
Copy link
Collaborator

/gcbrun

Copy link
Collaborator

@functicons functicons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@functicons functicons merged commit 7c37f85 into GoogleCloudPlatform:master Feb 9, 2021
@pgandhijr
Copy link

I am experiencing this bug where when the job manager fails the task is submitted but it fails because it is using a a old job id. I am using tag v1beta-9. Is there an older stable version that is not affected by this bug that we could use?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Job is lost after JobManager restart
5 participants