[Fix][Zeta] Guard state cleanup races after node failure#10687
[Fix][Zeta] Guard state cleanup races after node failure#10687zhangshenghang wants to merge 27 commits intoapache:devfrom
Conversation
DanielLeens
left a comment
There was a problem hiding this comment.
I agree with the tombstone approach for late callbacks, but I think the delayed cleanup currently loses its cleanup owner across a second master failover.
scheduleRemoveJobStateMaps() removes runningJobInfoIMap immediately and then schedules removeJobStateMaps() only in the local monitorService. If that master dies during the delay window, the scheduled task disappears with it. When the next master restores, there is no runningJobInfoIMap entry left to rediscover this job, and restoreJobFromMasterActiveSwitch() just returns for terminal states.
That leaves the terminal entries in runningJobStateIMap / runningJobStateTimestampsIMap orphaned permanently. I think the delayed-cleanup intent needs to be persisted in distributed state (or another recoverable cleanup record), otherwise this closes the race only as long as the same master survives until the timer fires.
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for the update. I re-reviewed the latest HEAD locally, and I still see the same failover hole during the delayed-cleanup window.
cleanJob() still calls scheduleRemoveJobStateMaps() (JobMaster.java:778-782), and that method still removes runningJobInfoIMap immediately (JobMaster.java:644-647) before scheduling the delayed cleanup on the local monitorService (JobMaster.java:658-672). But master-switch restore only scans runningJobInfoIMap.entrySet() (CoordinatorService.java:636-642, 665-681).
So if the active master dies before the delayed task fires, the next master has no distributed record left to rediscover this terminal job, and the remaining state maps are still orphaned. The new end-state guard in restoreJobFromMasterActiveSwitch() does not close that gap, because it only runs for jobs that still have a runningJobInfoIMap entry.
The new JobStateCleanupDelayTest currently asserts that runningJobInfoIMap is already null immediately after terminal completion, which effectively codifies the same gap instead of covering the second-master-failover case.
I think the delayed-cleanup intent still needs to be persisted in recoverable distributed state, or runningJobInfoIMap needs to remain until delayed cleanup actually executes, before this can merge.
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for the update. I pulled the latest HEAD locally and re-reviewed the delayed-cleanup path.
I still see the same blocking failover hole during the cleanup-delay window. cleanJob() still calls scheduleRemoveJobStateMaps(), and that method still removes runningJobInfoIMap immediately before scheduling the delayed cleanup only on the local monitorService. But master-switch restore still discovers jobs only by scanning runningJobInfoIMap. So if the active master dies before the delayed task fires, the next master has no distributed record left to rediscover this terminal job, and the remaining state maps are still orphaned.
The new end-state guard in restoreJobFromMasterActiveSwitch() does not close that gap because it only runs for jobs that still have a runningJobInfoIMap entry. The new JobStateCleanupDelayTest currently asserts that runningJobInfoIMap is already null during the delay window, which codifies the same hole instead of covering the second-master-failover case.
I think the delayed cleanup intent still needs to be persisted in recoverable distributed state, or runningJobInfoIMap needs to stay until the delayed cleanup actually executes. After that, this will be much closer.
DanielLeens
left a comment
There was a problem hiding this comment.
I pulled the latest HEAD locally again and I still see the same blocking failover hole during the cleanup-delay window.
JobMaster.scheduleRemoveJobStateMaps() still removes runningJobInfoIMap immediately before scheduling the delayed cleanup only on the local monitorService. But master-switch restore still discovers jobs by scanning runningJobInfoIMap in CoordinatorService.restoreAllRunningJobFromMasterNodeSwitch(). So if the active master dies before the delayed task fires, the next master still has no distributed record left to rediscover this terminal job, and the remaining state maps can still be orphaned.
The new stateCleanupDelayMillis=0 test config and the late-checkpoint guard do not close that gap, because they do not persist the delayed-cleanup intent across a second master failover.
I still think this needs one of these two directions before merge:
- keep
runningJobInfoIMapuntil the delayed cleanup actually executes, or - persist the delayed-cleanup intent in recoverable distributed state.
After that, I am happy to re-review.
DanielLeens
left a comment
There was a problem hiding this comment.
I pulled the latest HEAD locally again and re-checked the terminal-cleanup / master-switch path.
The previous failover hole looks closed now: JobMaster.scheduleRemoveJobStateMaps() persists a JobCleanupRecord in IMAP_PENDING_JOB_CLEANUP, CoordinatorService.restoreJobFromMasterActiveSwitch() reschedules terminal cleanup instead of dropping the job blindly, and the REST / overview paths filter delayed-cleanup tombstones so finished jobs are not shown as running. With the new unit / E2E coverage around delayed cleanup and no-restore cluster failure, I do not see the previous blocker in the current revision.
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for the latest update. I re-reviewed the current head locally as seatunnel-review-10687 at commit 17750abb9, comparing it with upstream/dev.
What This PR Fixes
- User pain: after terminal job/pipeline state cleanup, late asynchronous callbacks or active-master switch recovery can observe missing runtime state and either report misleading errors or leave stale job metadata behind.
- Fix approach: the PR delays terminal job-state cleanup, records pending cleanup metadata in Hazelcast, reschedules cleanup after master failover, and prevents a new non-savepoint submission from reusing a job id while the old terminal state is still waiting for cleanup.
- One-line value: terminal state becomes a short-lived tombstone instead of disappearing immediately, which makes late callbacks and master failover safer.
Core Logic Review
Key changed files and methods:
CoordinatorService.java:schedulePendingJobCleanup(...),processPendingJobCleanup(...),restoreAllRunningJobFromMasterNodeSwitch(...), and thesubmitJob(...)pending-cleanup guard.JobMaster.java:createJobCleanupRecord()andscheduleRemoveJobStateMaps().JobCleanupRecord.java: distributed cleanup metadata for job-level state keys.JobInfoService.java: hides terminal jobs that are retained only as cleanup tombstones.
Important before/after point:
// Before: terminal state could be removed immediately, so late callbacks saw null state.
removeJobStateMaps();// After: terminal state is retained and removed later by a cleanup record.
pendingJobCleanupIMap.put(jobId, cleanupRecord);
coordinatorService.schedulePendingJobCleanup(jobId, cleanupRecord);The normal Zeta lifecycle does hit this change:
Job reaches terminal state
-> PhysicalPlan.addPipelineEndCallback()
-> JobMaster.initStateFuture() completion handler
-> JobMaster.cleanJob()
-> createJobCleanupRecord()
-> pendingJobCleanupIMap.put(jobId, record)
-> CoordinatorService.schedulePendingJobCleanup(jobId, record)
Delayed cleanup
-> CoordinatorService.processPendingJobCleanup(jobId, record)
-> verifies initializationTimestamp still matches current JobInfo
-> verifies current job state is terminal
-> removes runningJobInfoIMap
-> removes recorded state/timestamp keys
Active-master switch before cleanup fires
-> CoordinatorService.restoreAllRunningJobFromMasterNodeSwitch()
-> terminal-state zombie jobs are pre-filtered
-> restoreJobFromMasterActiveSwitch()
-> reschedules pending cleanup if a cleanup record exists
-> otherwise removes stale runningJobInfoIMap
Local static verification:
git diff --stat upstream/dev...seatunnel-review-10687: 25 files changed,+1272/-45.git diff --name-status upstream/dev...seatunnel-review-10687: Zeta coordinator/job-master cleanup code, serialization hook, REST visibility, tests, and one E2E are touched.gh pr checks 10687: Build isCANCELLED; label and notification checks are successful.- Local build/tests: not run. This review is based on the local branch, full diff, and the job lifecycle / failover call-chain inspection.
Compatibility, Side Effects, Errors, and Logs
Compatibility impact: mostly compatible, with one intentional operational behavior change. Finished jobs are retained in runtime maps for state-cleanup-delay-ms before cleanup, while JobInfoService.shouldShowAsRunningJob() prevents these tombstones from showing as running jobs. No public API, protocol, or serialization format used by clients is removed, but a new internal Hazelcast data type is added via ResourceDataSerializerHook.
Performance and side effects: the default 60s retention adds bounded temporary IMap entries for terminal jobs/pipelines/tasks and checkpoint state keys. Cleanup scheduling uses the existing monitor service and should not add hot-path CPU/network cost. The owner timestamp guard is important and is present, so delayed cleanup should not remove a newly submitted job with the same id.
Error handling and logs: cleanup failures are logged and retried through retained cleanup records. Late state transitions now log and skip when the state entry is already missing or terminal, instead of trying to force an invalid transition.
Findings
I did not find a new source-level blocker in the latest code. The remaining blocker is CI status.
Merge Conclusion
Conclusion: can merge after CI is rerun successfully
- Blocking items:
- CI:
Buildis currentlyCANCELLEDon the latest head (17750abb9). This must be rerun and pass before merge.
- CI:
- Suggested non-blocking items:
- None from this latest source review.
Overall assessment: the design is a reasonable long-term fix for the terminal-state cleanup race. It keeps state retention bounded, records enough ownership data to avoid deleting a newer job, and includes targeted unit coverage for cleanup ownership, submit blocking, delayed cleanup, and late state transitions. Once Build is green, I think this PR can move forward.
|
Hi @zhangshenghang, I re-pulled the latest head locally as The new delta I checked is focused on the master-switch cleanup edge case: This closes the zombie terminal-state hole better than only removing Conclusion: can merge after fixesI do not see a new static code blocker in the latest delta. Please wait for the currently running |
|
Hi @zhangshenghang, I rechecked the current PR head locally as I rechecked the master-switch cleanup path after the latest update: The latest code closes the stale terminal-state map gap better than only removing Conclusion: can merge after fixesBlocking item:
|
|
Hi @zhangshenghang, thanks for working on this tricky Zeta cleanup race. I re-reviewed the latest head locally on What This PR Fixes
Core Flow ReviewedFindingsIssue 1: Cleanup can drop the pending record after
|
What This PR Fixes
1. Code Review1.1 Core Logic Analysis
if (currentJobInfo == null) {
removePendingJobCleanupRecord(jobId, record);
return;
}if (currentJobInfo == null) {
cleanupPendingJobStateMaps(record);
removePendingJobCleanupRecord(jobId, record);
return;
}
1.2 Compatibility Impact
1.3 Performance / Side Effects
1.4 Error Handling and LogsIssue 1: The current Build check is still running
2. Code Quality Assessment2.1 Code Style
2.2 Test Coverage
2.3 Documentation
3. Architecture3.1 Elegance of the Solution
3.2 Maintainability
3.3 Extensibility
3.4 Backward Compatibility
4. Issue Summary
5. Merge ConclusionConclusion: merge after fixes
|
davidzollo
left a comment
There was a problem hiding this comment.
+1 if CI passes.
Great job, huge thanks for your contribution
|
Hi @zhangshenghang, I rechecked the current head locally from What This PR Fixes
1. Code Review1.1 Core Logic AnalysisLocal review basis:
Runtime path I rechecked: Key findings:
I did not find a blocking issue. 1.2 Compatibility ImpactVerdict: fully compatible. 1.3 Performance / Side EffectsThe added delay/recovery logic is not a hot-path amplification, and I do not currently see a new leak or race regression from the latest head. 1.4 Error Handling and LogsNo blocking issue found. 2. Code Quality2.1 StyleThe implementation remains consistent with the engine/server style. 2.2 TestsThe engine/server tests around cleanup delay, follower filtering, and state transition coverage are meaningful for this fix. 2.3 DocsThe Zeta deployment docs were updated in both English and Chinese. 3. Architecture3.1 EleganceThis is a solid targeted fix. 3.2 MaintainabilityThe cleanup-record boundary is much easier to reason about than eager terminal deletion. 3.3 ExtensibilityIt gives a clean base for future terminal-cleanup / failover tightening. 3.4 Historical CompatibilityFully compatible. 4. Issue SummaryNo blocking issue found. 5. Merge DecisionConclusion: can mergeBlocking items:
Overall, the current head looks stable to me, and the latest discussion did not introduce a new code risk. I am comfortable with merging this. |
|
Hi @zhangshenghang, thanks for the update. I pulled the latest head locally again and rechecked the cleanup / restore path in What this PR solves
Runtime path I checked Merge conclusion Conclusion: can merge
I also rechecked the test coverage direction, especially the terminal-zombie behavior during master switch, and I do not see a new blocker in the current head. |
| if (cleanupRecord != null) { | ||
| schedulePendingJobCleanup(jobId, cleanupRecord); | ||
| } else { | ||
| cleanupTerminalZombieJob(jobId); |
There was a problem hiding this comment.
Could you clarify the assumption behind cleanupTerminalZombieJob()?
It removes the running-state IMaps directly, while the normal JobMaster.cleanJob() path also clears checkpoint state and stores job history / finished job state before removing job IMap data. For a terminal zombie job without a pending cleanup record, do we know that those normal cleanup steps have already completed before master switch? If not, could this direct IMap cleanup make the job disappear from running state without being fully persisted to history?
| if (runningJobInfo != null) { | ||
| return restoreJobDAGInfo(runningJobInfo); | ||
| } | ||
|
|
There was a problem hiding this comment.
Could you please add a test case?
| JobImmutableInformation jobImmutableInformation = | ||
| nodeEngine | ||
| .getSerializationService() | ||
| .toObject( | ||
| nodeEngine | ||
| .getSerializationService() | ||
| .toObject(jobInfo.getJobImmutableInformation())); |
There was a problem hiding this comment.
Is the second toObject() intentional? It looks unnecessary to me.
|
Thanks for the follow-up discussion. I re-checked the active-master restore path locally. What this PR solves
Runtime pathReview findingsIssue 1: The fallback branch for terminal jobs without a cleanup record still skips the normal checkpoint and history cleanup contract
Merge conclusionConclusion: Merge after fixes Blocking items:
Non-blocking suggestions:
CI status:
|
DanielLeens
left a comment
There was a problem hiding this comment.
Hi @zhangshenghang, thanks for the latest follow-up. I pulled the newest head locally again and re-checked the terminal-job fallback path after master switch.
First, the new delta does fix part of the gap I called out earlier:
- the fallback branch now persists job history when the cleanup record is missing
- and it now cleans the recorded running-state / timestamp / checkpoint-state map keys instead of only dropping
runningJobInfoIMap
So this is definitely moving in the right direction.
What this PR solves:
- User pain: after node failure and active-master switch, terminal-job cleanup can race with restore and leave behind zombie runtime state.
- Fix approach: keep delayed cleanup metadata in
pendingJobCleanupIMap, reschedule cleanup after master handoff, and add a fallback cleanup path when the cleanup record is missing. - One-line summary: the overall direction is correct, but the fallback branch still does not fully replay the normal terminal cleanup contract.
Runtime chain I re-verified:
master active switch
-> CoordinatorService.restoreJobFromMasterActiveSwitch() [CoordinatorService.java:928-955]
-> if jobState is end-state
-> pendingJobCleanupIMap.get(jobId)
-> cleanupRecord != null
-> schedulePendingJobCleanup(...)
else
-> cleanupTerminalZombieJob(jobId, jobInfo, finalStatus) [985-989]
-> persistTerminalZombieHistoryIfNecessary(...) [1439-1459]
-> cleanupPendingJobStateMaps(createTerminalZombieCleanupRecord(...))
-> runningJobInfoIMap.remove(jobId)
normal terminal cleanup
-> JobMaster.cleanJob() [JobMaster.java:798-803]
-> checkpointManager.clearCheckpointIfNeed(jobStatus) [799]
-> checkpointStorage.deleteCheckpoint(jobId) [CheckpointManager.java:242-247]
-> checkpointMonitorService.cleanupJob(jobId) [CheckpointManager.java:248-250]
-> jobHistoryService.storeJobInfo(...) [800]
-> jobHistoryService.storeFinishedJobState(this) [801]
-> scheduleRemoveJobStateMaps() [802]
The remaining blocker is:
- The terminal-zombie fallback still skips the checkpoint cleanup part of the normal terminal contract.
- The new
cleanupTerminalZombieJob(...)now persists history and removes running-state keys, which is good. - But it still does not perform the
CheckpointManager.clearCheckpointIfNeed(...)semantics from the normalJobMaster.cleanJob()path. - That means
checkpointStorage.deleteCheckpoint(jobId)andcheckpointMonitorService.cleanupJob(jobId)can still be missed when the new master hits this fallback branch before the old master finishes the normal cleanup sequence.
- The new
Why I still consider this blocking:
- this PR is specifically about making the terminal cleanup / restore path safe under node failure
- leaving checkpoint artifacts behind means the fallback branch is still only partially equivalent to the normal terminal cleanup path
- for
seatunnel-engine, that is still a recovery-path contract gap
Suggested fix order:
- Preferred: extract the shared terminal-finalization contract into one reusable helper so both
JobMaster.cleanJob()andcleanupTerminalZombieJob(...)execute the same checkpoint/history/state cleanup semantics. - Minimal patch: explicitly add the missing checkpoint storage + checkpoint monitor cleanup in the fallback branch, with the same
FINISHED / CANCELEDand savepoint-end semantics as the normal path.
Test note:
- the new test coverage is helpful for history + running-state cleanup
- but there is still no coverage proving that fallback cleanup also clears checkpoint storage / checkpoint monitor state
CI note:
- from the current visible checks, this is no longer a CI problem from my side; the remaining blocker is code semantics.
Conclusion: merge after fixes
Blocking items:
- Make the terminal-zombie fallback replay the checkpoint cleanup part of the normal terminal cleanup contract.
Non-blocking suggestion:
- Add one targeted regression test for fallback checkpoint cleanup so this edge case stays closed.
Overall, this branch is getting very close. The latest follow-up meaningfully improves the fallback path, but I would still close the checkpoint-cleanup gap before merge.
Purpose of this pull request
This PR fixes an engine-side terminal-state convergence bug after worker node failure.
When a worker goes offline, the engine can start cleaning distributed state from the running job state maps before all asynchronous task/pipeline/job callbacks have finished. In the current code path,
PhysicalVertex,SubPlan, andPhysicalPlancan observe missing state entries and throwNullPointerException, which interrupts terminal-state convergence and may leave the job hanging in an intermediate state.This PR changes the cleanup strategy instead of relying on local fallback state:
runningJobInfoIMapimmediately so terminal jobs are not restored on master switchIt also adds:
BATCH + no checkpoint + job.retry.times=0no-restore pathDoes this PR introduce any user-facing change?
No user-facing API/config change in normal operation. This improves failure handling so jobs are less likely to hang in an intermediate state after node failure.
How was this patch tested?
Verified locally:
./mvnw -nsu -pl seatunnel-engine/seatunnel-engine-common spotless:check./mvnw -nsu -pl seatunnel-engine/seatunnel-engine-server spotless:check./mvnw -nsu -pl seatunnel-e2e/seatunnel-engine-e2e/connector-seatunnel-e2e-base spotless:checkAdditional notes:
StateTransitionCleanupTest,JobStateCleanupDelayTestClusterFailureNoRestoreITseatunnel-engine-serverreferences missing types in the current checkout, and reactor builds are also blocked byseatunnel-config-shadecompilation issues), so this PR remains draft.