Skip to content

Conversation

@Adamyuanyuan
Copy link
Contributor

@Adamyuanyuan Adamyuanyuan commented Jan 5, 2026

Purpose of this pull request

When running SeaTunnel on Flink in STREAMING mode with Hive sink overwrite: true, the final Hive partition/table directory may lose previously committed files and end up containing only a subset of data (often only files from the last checkpoint).

Reproduction

  • Use the pipeline such as:
    • env.job.mode = "STREAMING"
    • Hive sink: overwrite: true
    • Constant partition (e.g. SQL transform adds '2025-12-16' as pt, so all records go to the same partition)
  • Observe job logs: every completed checkpoint triggers an aggregated commit, and the same target partition directory is deleted repeatedly before renaming/moving the new files.
  • Result: the partition directory is cleared on every checkpoint, so only the newest checkpoint’s files remain.

Root Cause

overwrite: true is normalized to DataSaveMode.DROP_DATA. In HiveSinkAggregatedCommitter#commit(...), the implementation deleted the target table/partition directories before every commit. In Flink streaming, commit() is invoked after every completed checkpoint, so the delete step was executed repeatedly and wiped files committed by earlier checkpoints.

Fix

Implemented overwrite semantics that are safe for streaming checkpoints:

  • Delete each target directory (table directory or partition directory) at most once per job attempt, and only when the commit contains actual files (skip deletion for empty commits).
  • Best-effort recovery protection: parse checkpointId from transactionDir (pattern like .../T_xxx_<subtaskIndex>_<checkpointId>). If the first checkpoint id seen by this committer is > 1 (usually indicates recovery from a previous checkpoint), skip deletion to avoid removing already committed data that matches the restored state.

Does this PR introduce any user-facing change?

no

How was this patch tested?

yes,UT and tested in our test env.

Check list

Copy link
Member

@liunaijie liunaijie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
LGTM

@davidzollo davidzollo merged commit d32bfce into apache:dev Jan 8, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants