Savepoints flow changes #404
Description
Hey guys, following the comments from @functicons @elanv on my PR: #392 and elanv's fix: #401 for the intermediate fix for Savepoints (which had more than 1 triggered) I would like to have a discussion here before moving on to solving this issue.
I have few things I want to bring up:
-
We need to figure out how we want to make sure more than 1 SP is not triggered at a given time. right now I implemented a
TriggerTime
mechanism that does not allow more than X triggers in a given period.
Maybe we want to have a different approach and check the jobamanager and see if there is a SP/CP in progress and don't allow a SP trigger if there is an active SP/CP.
I would like your opinion on what to do and where to implement this as you both know the operator better than me @functicons @elanv -
I think its very important to make an optional flag for the operator to not allow a job submit without savepoint and i'll explain:
The configuration to update a job with an old SP/trigger a new one and restart is the same as the configuration when you submit without the CRD existing, for some cases a job start with no savepoint(state) is devastating and will cause corrupted results.
With the flag the job will simply finish in an error if no savepoint is retrieved.
This case will get pretty common when using the operator with a CD solution that makes install/upgrade configuration basically the same. -
I want there to be a way to trigger a job restart and SP with a CRD update (right now I found that when I change external configmap I have to change parallelism as well for the operator to notice change and trigger SP+restart) any suggestion here?
Those might not be best solved in a single PR, but the strongest shortage I felt when using the operator was this part regarding Savepoints, everything else works very well for me :)