Skip to content

Commit 2541b1c

Browse files
authored
fix: wrong terminal counts calculated during migration check (#5400)
Wrong count calculated as part of migration check that happens every 30 seconds. We fetch counts of terminal statuses in the status table, but since at startup we "cleanup" old jobs just by appending another aborted status irrespective of it's state, that above query can count more than one terminal status per job. Understandably while actually migrating the jobs, we see that more jobs than we expect have been moved because we were expecting a lesser number. This issue happened now because archival tables have a default retention of 24 hours. so on successive restarts, more and more statuses were being appended for the same job. And we expect the following expression number of jobs to be migrated: numExpectedNumberOfMigratedJobs(e) = number of jobs(a) - number of terminal statuses in status table(b) Due to the cleanup at startup even when a remains same, b increases based on the retention duration effectively decreasing e. And server panics when it actually migrated more jobs than e. Now with this fix: we change b to number of jobIDs with terminal status in the status table and it's bound to remain the same even if we append more statuses for the same job.
1 parent 3258e9d commit 2541b1c

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

jobsdb/migration.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -628,7 +628,7 @@ func (jd *Handle) checkIfMigrateDS(ds dataSetT) (
628628
`with combinedResult as (
629629
select
630630
(select count(*) from %[1]q) as totalJobCount,
631-
(select count(*) from %[2]q where job_state = ANY($1)) as terminalJobCount,
631+
(select count(*) from "v_last_%[2]s" where job_state = ANY($1)) as terminalJobCount,
632632
(select created_at from %[1]q order by job_id desc limit 1) as maxCreatedAt,
633633
COALESCE((select exec_time < $2 from %[2]q where job_state = ANY($1) order by id asc limit 1), false) as retentionExpired
634634
)

0 commit comments

Comments
 (0)