[Feature] [connector-file] Support file_exists_mode for file sink commit #10266

yzeng1618 · 2026-01-04T01:34:56Z

Purpose of this pull request

This PR introduces a new sink option file_exists_mode for File connectors to control what happens when the target file already exists during commit (rename temp -> target): OVERWRITE (default) / SKIP / FAIL. It also ensures FAIL mode fails the job on commit, updates the EN/ZH docs, and adds unit + HDFS E2E coverage

Does this PR introduce any user-facing change?

Yes. New optional config file_exists_mode (default OVERWRITE, so behavior is unchanged unless configured).

How was this patch tested?

Unit tests: HadoopFileSystemProxyRenameFileTest, FileSinkAggregatedCommitterFileExistsModeTest
E2E: HdfsFileIT with fake_to_hdfs_file_exists_mode_* configs

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
[*] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

chl-wxp · 2026-01-05T08:46:14Z

This looks functionally redundant with the existing SaveMode design.
For file connectors, SeaTunnel already has:
DataSaveMode to control how existing data (files) is handled
SchemaSaveMode to control how the target path / directory is handled
File existence is a data-level concern and should be covered by DataSaveMode. Introducing a separate file_exists_mode duplicates this responsibility and fragments behavior control, making the File connector inconsistent with other sinks.
It would be preferable to reuse or extend the existing SaveMode mechanism rather than adding a new, connector-specific option.

LiJie20190102 · 2026-01-06T01:08:36Z

DataSaveMode

+1, @yzeng1618, Could you explain the difference between file_exists_mode, DataSaveMode, and SchemaSaveMode? Does the combined functionality of DataSaveMode and SchemaSaveMode include file_exists_mode?

yzeng1618 · 2026-01-06T02:24:57Z

This looks functionally redundant with the existing SaveMode design. For file connectors, SeaTunnel already has: DataSaveMode to control how existing data (files) is handled SchemaSaveMode to control how the target path / directory is handled File existence is a data-level concern and should be covered by DataSaveMode. Introducing a separate file_exists_mode duplicates this responsibility and fragments behavior control, making the File connector inconsistent with other sinks. It would be preferable to reuse or extend the existing SaveMode mechanism rather than adding a new, connector-specific option.

SchemaSaveMode/DataSaveMode are executed pre-write and work at a directory (“table”) level via the Catalog: create/drop/truncate the target path or check whether the directory already contains data files. file_exists_mode is applied at the 2PC commit stage, when we rename/move a temp file under tmp_path to the final target file. It specifically controls the behavior when the final target file name already exists (e.g., single_file_mode, fixed file_name_expression, or binary with relativePath). So it’s not duplicating the existing SaveMode; it complements a commit-time, single-file conflict that SaveMode currently can’t express.

yzeng1618 · 2026-01-06T02:26:01Z

DataSaveMode

+1, @yzeng1618, Could you explain the difference between file_exists_mode, DataSaveMode, and SchemaSaveMode? Does the combined functionality of DataSaveMode and SchemaSaveMode include file_exists_mode?

Differences:

SchemaSaveMode: directory/path-level, executed pre-write (create/recreate/error/ignore).
DataSaveMode: data-in-directory-level, executed pre-write (truncate directory, append, or fail if the directory already has data files).
file_exists_mode: commit-time, single-target-file-level behavior when renaming a temp file to the final file name (SKIP/OVERWRITE/FAIL on name collision).

So SchemaSaveMode + DataSaveMode does not include file_exists_mode; they don’t cover commit-time rename conflicts for a specific final file name.

chl-wxp · 2026-01-06T03:01:47Z

OVERWRITE (default) / SKIP / FAIL

Why should we do this in the submission phase? I think what OVERWRITE (default) / SKIP / FAIL does is consistent with the final behavior of DataSaveMode.

yzeng1618 · 2026-01-06T03:39:31Z

DataSaveMode

+1, @yzeng1618, Could you explain the difference between file_exists_mode, DataSaveMode, and SchemaSaveMode? Does the combined functionality of DataSaveMode and SchemaSaveMode include file_exists_mode?

OVERWRITE (default) / SKIP / FAIL

Why should we do this in the submission phase? I think what OVERWRITE (default) / SKIP / FAIL does is consistent with the final behavior of DataSaveMode.

The reason file_exists_mode is placed at the commit phase (2PC rename/move of temporary files to the final path) is that only at this phase will the "temporary files" be finalized as "final filenames" in the target directory. Therefore, name conflicts occur at the file level, and deterministic decisions of OVERWRITE/SKIP/FAIL can only be made during the rename operation.
Here is an example to illustrate:

Scenario Initialization

Source directory: /tmp/source contains test1.txt and test2.txt.
Target directory: /tmp/target already has an existing test1.txt (old file), while test2.txt does not exist.
Configuration: path=/tmp/target (the write destination directory). Writes are first persisted to tmp_path, then renamed to /tmp/target/... during commit.

Observing only data_save_mode (Takes effect before task starts, directory-level)

DROP_DATA: Clear/recreate /tmp/target before task startup (the old test1.txt will be deleted) → No conflicts occur when writing test1.txt and test2.txt during commit → Result: The target directory contains the new test1.txt + test2.txt.
APPEND_DATA: Do not modify /tmp/target before task startup (the old test1.txt remains) → A conflict "the test1.txt to be written already exists" will be encountered during commit, but APPEND_DATA itself does not define how to handle single-file name conflicts → The decision to overwrite/skip/fail depends on file_exists_mode.
ERROR_WHEN_DATA_EXISTS: Check /tmp/target before task startup; fail if any data files exist (there is currently test1.txt) → Fail directly without proceeding to the write/commit phase.

Observing only file_exists_mode (Takes effect during commit, file-level; assuming data_save_mode=APPEND_DATA)

OVERWRITE: When renaming test1.txt during commit and detecting the existing old file → Delete the old test1.txt first, then rename the temporary file to overwrite it; test2.txt is committed normally → Result: The target directory contains the new test1.txt + test2.txt.
SKIP: When detecting the existing test1.txt during commit → Retain the old test1.txt, delete the temporary test1.txt, and mark the commit as successful; test2.txt is committed normally → Result: The target directory contains the old test1.txt + new test2.txt.
FAIL: When detecting the existing test1.txt during commit → Throw an error and fail immediately (used to explicitly prevent overwrites).

Welcome everyone to join the discussion. @zhangshenghang @davidzollo @Carl-Zhou-CN @corgy-w

TyrantLucifer · 2026-01-07T14:53:26Z

-1, From my perspective, the design of this feature is too customized. In actual production scenarios, the path written for each task is different, and the previously existing save mode feature can fully cover the requirements. The community version of the connector needs to have sufficiently universal functionality.

zengyi added 3 commits December 31, 2025 21:22

[Feature][connector-file] Add file_exists_mode for file sink commit

17d1d1f

[Fix][connector-file] Fail job on commit when file_exists_mode=FAIL

362083a

[Fix][connector-file] fix e2e

2d4e10d

github-actions bot added document connectors-v2 e2e file labels Jan 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] [connector-file] Support file_exists_mode for file sink commit #10266

[Feature] [connector-file] Support file_exists_mode for file sink commit #10266

yzeng1618 commented Jan 4, 2026

Uh oh!

chl-wxp commented Jan 5, 2026

Uh oh!

LiJie20190102 commented Jan 6, 2026

Uh oh!

yzeng1618 commented Jan 6, 2026

Uh oh!

yzeng1618 commented Jan 6, 2026

Uh oh!

chl-wxp commented Jan 6, 2026

Uh oh!

yzeng1618 commented Jan 6, 2026 •

edited

Loading

Uh oh!

TyrantLucifer commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Feature] [connector-file] Support file_exists_mode for file sink commit #10266

Are you sure you want to change the base?

[Feature] [connector-file] Support file_exists_mode for file sink commit #10266

Conversation

yzeng1618 commented Jan 4, 2026

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

chl-wxp commented Jan 5, 2026

Uh oh!

LiJie20190102 commented Jan 6, 2026

Uh oh!

yzeng1618 commented Jan 6, 2026

Uh oh!

yzeng1618 commented Jan 6, 2026

Uh oh!

chl-wxp commented Jan 6, 2026

Uh oh!

yzeng1618 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TyrantLucifer commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yzeng1618 commented Jan 6, 2026 •

edited

Loading