-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Feature][Connector-File-Hadoop] Support sync_mode=update for HdfsFile source (binary) #10268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…e source (binary)
| private boolean fileContentEquals(String sourceFilePath, String targetFilePath) | ||
| throws IOException { | ||
| try (InputStream sourceIn = hadoopFileSystemProxy.getInputStream(sourceFilePath); | ||
| InputStream targetIn = targetHadoopFileSystemProxy.getInputStream(targetFilePath)) { | ||
| byte[] sourceBuffer = new byte[8 * 1024]; | ||
| byte[] targetBuffer = new byte[8 * 1024]; | ||
|
|
||
| while (true) { | ||
| int sourceRead = sourceIn.read(sourceBuffer); | ||
| int targetRead = targetIn.read(targetBuffer); | ||
| if (sourceRead != targetRead) { | ||
| return false; | ||
| } | ||
| if (sourceRead == -1) { | ||
| return true; | ||
| } | ||
| for (int i = 0; i < sourceRead; i++) { | ||
| if (sourceBuffer[i] != targetBuffer[i]) { | ||
| return false; | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is downgraded to file byte comparison, if the file is large, it may lead to high memory usage. Is there a risk of OOM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory usage here is constant: we compare via streaming reads with two fixed 8KB buffers (no full-file buffering), so large files shouldn’t cause OOM—only more I/O time.

…e source (binary)
Purpose of this pull request
Add sync_mode=update to HdfsFile source to only sync new/changed files by comparing source/target (len+mtime or Hadoop getFileChecksum).
Does this PR introduce any user-facing change?
Yes.
New source options: sync_mode/target_path/target_hadoop_conf/update_strategy/compare_mode.
Default behavior unchanged (sync_mode=full).
How was this patch tested?
Unit test (Windows skipped by design; CI Linux will run): mvnw -pl seatunnel-connectors-v2/connector-file/connector-file-base test -Dtest=UpdateSyncModeTest
Added E2E tests (run in CI/Docker env): HdfsFileIT#testHdfsBinaryUpdateModeDistcp / HdfsFileIT#testHdfsBinaryUpdateModeStrictChecksum.
Check list
New License Guide
incompatible-changes.mdto describe the incompatibility caused by this PR.