-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Feature] [connector-file] Support file_exists_mode for file sink commit #10266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
This looks functionally redundant with the existing SaveMode design. |
+1, @yzeng1618, Could you explain the difference between |
SchemaSaveMode/DataSaveMode are executed pre-write and work at a directory (“table”) level via the Catalog: create/drop/truncate the target path or check whether the directory already contains data files. file_exists_mode is applied at the 2PC commit stage, when we rename/move a temp file under tmp_path to the final target file. It specifically controls the behavior when the final target file name already exists (e.g., single_file_mode, fixed file_name_expression, or binary with relativePath). So it’s not duplicating the existing SaveMode; it complements a commit-time, single-file conflict that SaveMode currently can’t express. |
Differences:
So SchemaSaveMode + DataSaveMode does not include file_exists_mode; they don’t cover commit-time rename conflicts for a specific final file name. |
Why should we do this in the submission phase? I think what |
The reason file_exists_mode is placed at the commit phase (2PC rename/move of temporary files to the final path) is that only at this phase will the "temporary files" be finalized as "final filenames" in the target directory. Therefore, name conflicts occur at the file level, and deterministic decisions of OVERWRITE/SKIP/FAIL can only be made during the rename operation.
Welcome everyone to join the discussion. @zhangshenghang @davidzollo @Carl-Zhou-CN @corgy-w |
|
-1, From my perspective, the design of this feature is too customized. In actual production scenarios, the path written for each task is different, and the previously existing save mode feature can fully cover the requirements. The community version of the connector needs to have sufficiently universal functionality. |
Purpose of this pull request
This PR introduces a new sink option
file_exists_modefor File connectors to control what happens when the target file already exists during commit (rename temp -> target):OVERWRITE(default) /SKIP/FAIL. It also ensuresFAILmode fails the job on commit, updates the EN/ZH docs, and adds unit + HDFS E2E coverageDoes this PR introduce any user-facing change?
Yes. New optional config
file_exists_mode(defaultOVERWRITE, so behavior is unchanged unless configured).How was this patch tested?
HadoopFileSystemProxyRenameFileTest,FileSinkAggregatedCommitterFileExistsModeTestHdfsFileITwithfake_to_hdfs_file_exists_mode_*configsCheck list
New License Guide
incompatible-changes.mdto describe the incompatibility caused by this PR.