Skip to content

Conversation

@yzeng1618
Copy link
Contributor

…e source (binary)

Purpose of this pull request

Add sync_mode=update to HdfsFile source to only sync new/changed files by comparing source/target (len+mtime or Hadoop getFileChecksum).

  • Add sync_mode=update for HdfsFile source (currently only supports file_format_type=binary).
  • Support update_strategy=distcp (distcp -update like) and update_strategy=strict.
  • Support compare_mode=len_mtime and compare_mode=checksum (checksum only valid for strict).

Does this PR introduce any user-facing change?

Yes.

New source options: sync_mode/target_path/target_hadoop_conf/update_strategy/compare_mode.
Default behavior unchanged (sync_mode=full).

How was this patch tested?

  • Unit test (Windows skipped by design; CI Linux will run): mvnw -pl seatunnel-connectors-v2/connector-file/connector-file-base test -Dtest=UpdateSyncModeTest

  • Added E2E tests (run in CI/Docker env): HdfsFileIT#testHdfsBinaryUpdateModeDistcp / HdfsFileIT#testHdfsBinaryUpdateModeStrictChecksum.

Check list

Comment on lines +734 to +757
private boolean fileContentEquals(String sourceFilePath, String targetFilePath)
throws IOException {
try (InputStream sourceIn = hadoopFileSystemProxy.getInputStream(sourceFilePath);
InputStream targetIn = targetHadoopFileSystemProxy.getInputStream(targetFilePath)) {
byte[] sourceBuffer = new byte[8 * 1024];
byte[] targetBuffer = new byte[8 * 1024];

while (true) {
int sourceRead = sourceIn.read(sourceBuffer);
int targetRead = targetIn.read(targetBuffer);
if (sourceRead != targetRead) {
return false;
}
if (sourceRead == -1) {
return true;
}
for (int i = 0; i < sourceRead; i++) {
if (sourceBuffer[i] != targetBuffer[i]) {
return false;
}
}
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is downgraded to file byte comparison, if the file is large, it may lead to high memory usage. Is there a risk of OOM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory usage here is constant: we compare via streaming reads with two fixed 8KB buffers (no full-file buffering), so large files shouldn’t cause OOM—only more I/O time.

@davidzollo
Copy link
Contributor

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants