Commit 2541b1c
authored
fix: wrong terminal counts calculated during migration check (#5400)
Wrong count calculated as part of migration check that happens every 30 seconds.
We fetch counts of terminal statuses in the status table, but since at startup we "cleanup" old jobs just by appending another aborted status irrespective of it's state, that above query can count more than one terminal status per job.
Understandably while actually migrating the jobs, we see that more jobs than we expect have been moved because we were expecting a lesser number.
This issue happened now because archival tables have a default retention of 24 hours. so on successive restarts, more and more statuses were being appended for the same job. And we expect the following expression number of jobs to be migrated:
numExpectedNumberOfMigratedJobs(e) = number of jobs(a) - number of terminal statuses in status table(b)
Due to the cleanup at startup even when a remains same, b increases based on the retention duration effectively decreasing e. And server panics when it actually migrated more jobs than e.
Now with this fix: we change b to number of jobIDs with terminal status in the status table and it's bound to remain the same even if we append more statuses for the same job.1 parent 3258e9d commit 2541b1c
1 file changed
+1
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
628 | 628 | | |
629 | 629 | | |
630 | 630 | | |
631 | | - | |
| 631 | + | |
632 | 632 | | |
633 | 633 | | |
634 | 634 | | |
| |||
0 commit comments