[Feature][Connector-File-Hadoop] Support sync_mode=update for HdfsFile source (binary) #10268

yzeng1618 · 2026-01-04T08:24:16Z

…e source (binary)

Purpose of this pull request

Add sync_mode=update to HdfsFile source to only sync new/changed files by comparing source/target (len+mtime or Hadoop getFileChecksum).

Add sync_mode=update for HdfsFile source (currently only supports file_format_type=binary).
Support update_strategy=distcp (distcp -update like) and update_strategy=strict.
Support compare_mode=len_mtime and compare_mode=checksum (checksum only valid for strict).

Does this PR introduce any user-facing change?

Yes.

New source options: sync_mode/target_path/target_hadoop_conf/update_strategy/compare_mode.
Default behavior unchanged (sync_mode=full).

How was this patch tested?

Unit test (Windows skipped by design; CI Linux will run): mvnw -pl seatunnel-connectors-v2/connector-file/connector-file-base test -Dtest=UpdateSyncModeTest
Added E2E tests (run in CI/Docker env): HdfsFileIT#testHdfsBinaryUpdateModeDistcp / HdfsFileIT#testHdfsBinaryUpdateModeStrictChecksum.

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
[ * ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

…e source (binary)

…cp skips files

zhangshenghang · 2026-01-05T13:27:16Z

.../java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/AbstractReadStrategy.java

+    private boolean fileContentEquals(String sourceFilePath, String targetFilePath)
+            throws IOException {
+        try (InputStream sourceIn = hadoopFileSystemProxy.getInputStream(sourceFilePath);
+                InputStream targetIn = targetHadoopFileSystemProxy.getInputStream(targetFilePath)) {
+            byte[] sourceBuffer = new byte[8 * 1024];
+            byte[] targetBuffer = new byte[8 * 1024];
+
+            while (true) {
+                int sourceRead = sourceIn.read(sourceBuffer);
+                int targetRead = targetIn.read(targetBuffer);
+                if (sourceRead != targetRead) {
+                    return false;
+                }
+                if (sourceRead == -1) {
+                    return true;
+                }
+                for (int i = 0; i < sourceRead; i++) {
+                    if (sourceBuffer[i] != targetBuffer[i]) {
+                        return false;
+                    }
+                }
+            }
+        }
+    }


If it is downgraded to file byte comparison, if the file is large, it may lead to high memory usage. Is there a risk of OOM?

Memory usage here is constant: we compare via streaming reads with two fixed 8KB buffers (no full-file buffering), so large files shouldn’t cause OOM—only more I/O time.

davidzollo · 2026-01-07T14:35:38Z

[Feature][Connector-File-Hadoop] Support sync_mode=update for HdfsFil…

822cef7

…e source (binary)

github-actions bot added document connectors-v2 e2e file labels Jan 4, 2026

zengyi added 2 commits January 5, 2026 10:20

[Fix][Connector-V2][File] Fix binary schema mismatch when update+dist…

2759423

…cp skips files

[Fix][Connector-V2][File] Fix update checksum compare fallback

96bdd36

zhangshenghang reviewed Jan 5, 2026

View reviewed changes

yzeng1618 requested a review from zhangshenghang January 7, 2026 08:23

Merge branch 'dev' into dev-hdfs

c49dee3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature][Connector-File-Hadoop] Support sync_mode=update for HdfsFile source (binary) #10268

[Feature][Connector-File-Hadoop] Support sync_mode=update for HdfsFile source (binary) #10268

yzeng1618 commented Jan 4, 2026

Uh oh!

zhangshenghang Jan 5, 2026

Uh oh!

yzeng1618 Jan 5, 2026

Uh oh!

davidzollo commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Feature][Connector-File-Hadoop] Support sync_mode=update for HdfsFile source (binary) #10268

Are you sure you want to change the base?

[Feature][Connector-File-Hadoop] Support sync_mode=update for HdfsFile source (binary) #10268

Conversation

yzeng1618 commented Jan 4, 2026

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

zhangshenghang Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

yzeng1618 Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

davidzollo commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants