Skip to content

Conversation

@yzeng1618
Copy link
Contributor

Purpose of this pull request

This PR introduces a new sink option file_exists_mode for File connectors to control what happens when the target file already exists during commit (rename temp -> target): OVERWRITE (default) / SKIP / FAIL. It also ensures FAIL mode fails the job on commit, updates the EN/ZH docs, and adds unit + HDFS E2E coverage

Does this PR introduce any user-facing change?

Yes. New optional config file_exists_mode (default OVERWRITE, so behavior is unchanged unless configured).

How was this patch tested?

  • Unit tests: HadoopFileSystemProxyRenameFileTest, FileSinkAggregatedCommitterFileExistsModeTest
  • E2E: HdfsFileIT with fake_to_hdfs_file_exists_mode_* configs

Check list

@chl-wxp
Copy link
Contributor

chl-wxp commented Jan 5, 2026

This looks functionally redundant with the existing SaveMode design.
For file connectors, SeaTunnel already has:
DataSaveMode to control how existing data (files) is handled
SchemaSaveMode to control how the target path / directory is handled
File existence is a data-level concern and should be covered by DataSaveMode. Introducing a separate file_exists_mode duplicates this responsibility and fragments behavior control, making the File connector inconsistent with other sinks.
It would be preferable to reuse or extend the existing SaveMode mechanism rather than adding a new, connector-specific option.

@LiJie20190102
Copy link
Contributor

DataSaveMode

+1, @yzeng1618, Could you explain the difference between file_exists_mode, DataSaveMode, and SchemaSaveMode? Does the combined functionality of DataSaveMode and SchemaSaveMode include file_exists_mode?

@yzeng1618
Copy link
Contributor Author

This looks functionally redundant with the existing SaveMode design. For file connectors, SeaTunnel already has: DataSaveMode to control how existing data (files) is handled SchemaSaveMode to control how the target path / directory is handled File existence is a data-level concern and should be covered by DataSaveMode. Introducing a separate file_exists_mode duplicates this responsibility and fragments behavior control, making the File connector inconsistent with other sinks. It would be preferable to reuse or extend the existing SaveMode mechanism rather than adding a new, connector-specific option.

SchemaSaveMode/DataSaveMode are executed pre-write and work at a directory (“table”) level via the Catalog: create/drop/truncate the target path or check whether the directory already contains data files. file_exists_mode is applied at the 2PC commit stage, when we rename/move a temp file under tmp_path to the final target file. It specifically controls the behavior when the final target file name already exists (e.g., single_file_mode, fixed file_name_expression, or binary with relativePath). So it’s not duplicating the existing SaveMode; it complements a commit-time, single-file conflict that SaveMode currently can’t express.

@yzeng1618
Copy link
Contributor Author

DataSaveMode

+1, @yzeng1618, Could you explain the difference between file_exists_mode, DataSaveMode, and SchemaSaveMode? Does the combined functionality of DataSaveMode and SchemaSaveMode include file_exists_mode?

Differences:

  • SchemaSaveMode: directory/path-level, executed pre-write (create/recreate/error/ignore).

  • DataSaveMode: data-in-directory-level, executed pre-write (truncate directory, append, or fail if the directory already has data files).

  • file_exists_mode: commit-time, single-target-file-level behavior when renaming a temp file to the final file name (SKIP/OVERWRITE/FAIL on name collision).

So SchemaSaveMode + DataSaveMode does not include file_exists_mode; they don’t cover commit-time rename conflicts for a specific final file name.

@chl-wxp
Copy link
Contributor

chl-wxp commented Jan 6, 2026

OVERWRITE (default) / SKIP / FAIL

Why should we do this in the submission phase? I think what OVERWRITE (default) / SKIP / FAIL does is consistent with the final behavior of DataSaveMode.

@yzeng1618
Copy link
Contributor Author

yzeng1618 commented Jan 6, 2026

DataSaveMode

+1, @yzeng1618, Could you explain the difference between file_exists_mode, DataSaveMode, and SchemaSaveMode? Does the combined functionality of DataSaveMode and SchemaSaveMode include file_exists_mode?

OVERWRITE (default) / SKIP / FAIL

Why should we do this in the submission phase? I think what OVERWRITE (default) / SKIP / FAIL does is consistent with the final behavior of DataSaveMode.

The reason file_exists_mode is placed at the commit phase (2PC rename/move of temporary files to the final path) is that only at this phase will the "temporary files" be finalized as "final filenames" in the target directory. Therefore, name conflicts occur at the file level, and deterministic decisions of OVERWRITE/SKIP/FAIL can only be made during the rename operation.
Here is an example to illustrate:

  1. Scenario Initialization
  • Source directory: /tmp/source contains test1.txt and test2.txt.

  • Target directory: /tmp/target already has an existing test1.txt (old file), while test2.txt does not exist.

  • Configuration: path=/tmp/target (the write destination directory). Writes are first persisted to tmp_path, then renamed to /tmp/target/... during commit.

  1. Observing only data_save_mode (Takes effect before task starts, directory-level)
  • DROP_DATA: Clear/recreate /tmp/target before task startup (the old test1.txt will be deleted) → No conflicts occur when writing test1.txt and test2.txt during commit → Result: The target directory contains the new test1.txt + test2.txt.

  • APPEND_DATA: Do not modify /tmp/target before task startup (the old test1.txt remains) → A conflict "the test1.txt to be written already exists" will be encountered during commit, but APPEND_DATA itself does not define how to handle single-file name conflicts → The decision to overwrite/skip/fail depends on file_exists_mode.

  • ERROR_WHEN_DATA_EXISTS: Check /tmp/target before task startup; fail if any data files exist (there is currently test1.txt) → Fail directly without proceeding to the write/commit phase.

  1. Observing only file_exists_mode (Takes effect during commit, file-level; assuming data_save_mode=APPEND_DATA)
  • OVERWRITE: When renaming test1.txt during commit and detecting the existing old file → Delete the old test1.txt first, then rename the temporary file to overwrite it; test2.txt is committed normally → Result: The target directory contains the new test1.txt + test2.txt.

  • SKIP: When detecting the existing test1.txt during commit → Retain the old test1.txt, delete the temporary test1.txt, and mark the commit as successful; test2.txt is committed normally → Result: The target directory contains the old test1.txt + new test2.txt.

  • FAIL: When detecting the existing test1.txt during commit → Throw an error and fail immediately (used to explicitly prevent overwrites).

Welcome everyone to join the discussion. @zhangshenghang @davidzollo @Carl-Zhou-CN @corgy-w

@TyrantLucifer
Copy link
Member

-1, From my perspective, the design of this feature is too customized. In actual production scenarios, the path written for each task is different, and the previously existing save mode feature can fully cover the requirements. The community version of the connector needs to have sufficiently universal functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants