Skip to content

Conversation

@xiaochen-zhou
Copy link
Contributor

Purpose of this pull request

Support defines how many consecutive checkpoint failures will be tolerated, before the whole job is failed over. The default value is 0, which means no checkpoint failures will be tolerated, and the job will fail on first reported checkpoint failure.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Add test: CheckpointCoordinatorTest#testTolerableFailedCheckpoints()

Check list

@github-actions github-actions bot added document core SeaTunnel core module Zeta api labels Dec 21, 2025
Comment on lines +296 to +304
if (tolerableFailures > 0 && failedCount <= tolerableFailures) {
LOG.warn(
"Checkpoint failed (consecutive failures: {}/{}): {}",
failedCount,
tolerableFailures,
ExceptionUtils.getMessage(checkpointException));
cleanFailedCheckpoint(reason);
return;
}
Copy link
Contributor

@dybyte dybyte Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a checkpoint fails during a savepoint operation? Is there a possibility that the job become non-responsive?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api core SeaTunnel core module document Zeta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants