Skip to content

[Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read#10332

Merged
davidzollo merged 9 commits intoapache:devfrom
yzeng1618:dev-hdfs-split
Jan 29, 2026
Merged

[Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read#10332
davidzollo merged 9 commits intoapache:devfrom
yzeng1618:dev-hdfs-split

Conversation

@yzeng1618
Copy link
Collaborator

@yzeng1618 yzeng1618 commented Jan 13, 2026

Purpose of this pull request

#10326
HdfsFile source currently uses “one file = one split”, which limits parallelism when there are only a few huge files.

Does this PR introduce any user-facing change?

yes

  1. This PR adds enable_file_split / file_split_size to HdfsFile source and wires HDFS-specific split strategies:
  • text/csv/json: split by file_split_size and align split end to the next row_delimiter (HDFS seek-based implementation for large files).

  • parquet: split by RowGroup (logical split) and read footer metadata using HadoopConf-backed Configuration (works with Kerberos/HA/NameService).

  1. Example:
enable_file_split = true
file_split_size = 268435456
row_delimiter = "\n" (for text/csv/json)

How was this patch tested?

  1. Unit tests
    HdfsFileAccordingToSplitSizeSplitStrategyTest#testReadBySplitsShouldMatchFullRead

  2. E2E
    HdfsFileIT#testHdfsTextReadWithFileSplit

Check list

@yzeng1618 yzeng1618 requested a review from chl-wxp January 14, 2026 02:10
@Carl-Zhou-CN
Copy link
Member

hi,It seems that a parquetd use case is still missing

@zhangshenghang
Copy link
Member

Thanks for implementing this important feature! The overall approach looks solid, but I found 2 CRITICAL issues that should be addressed before merging:

1. Poor error observability (GEN-002)
AccordingToSplitSizeSplitStrategy.java:107-109,115-117 - All IOExceptions are mapped to FILE_READ_FAILED, losing distinction between FileNotFoundException, permission denied, network timeout, etc. This makes troubleshooting difficult. Consider catching specific exception types and mapping them to appropriate error codes.

2. Severe performance regression for large file splits (GEN-003)
AbstractReadStrategy.java:510-521 - safeSlice() uses InputStream.skip() in a loop, which reads byte-by-byte. For HDFS files with non-first splits (e.g., start=10GB), this causes massive network/disk I/O just to reach the offset. HDFS supports FSDataInputStream.seek(), which should be used instead. Without this fix, enabling splits on large HDFS files will be slower than not splitting at all.

Note on COR-001: The idempotency concern in addSplitsBack() couldn't be fully verified from code review. The FileSourceSplit.equals() implementation suggests duplicates would be detected, but runtime behavior should be validated post-merge.

@yzeng1618
Copy link
Collaborator Author

Thanks for implementing this important feature! The overall approach looks solid, but I found 2 CRITICAL issues that should be addressed before merging:

1. Poor error observability (GEN-002) AccordingToSplitSizeSplitStrategy.java:107-109,115-117 - All IOExceptions are mapped to FILE_READ_FAILED, losing distinction between FileNotFoundException, permission denied, network timeout, etc. This makes troubleshooting difficult. Consider catching specific exception types and mapping them to appropriate error codes.

2. Severe performance regression for large file splits (GEN-003) AbstractReadStrategy.java:510-521 - safeSlice() uses InputStream.skip() in a loop, which reads byte-by-byte. For HDFS files with non-first splits (e.g., start=10GB), this causes massive network/disk I/O just to reach the offset. HDFS supports FSDataInputStream.seek(), which should be used instead. Without this fix, enabling splits on large HDFS files will be slower than not splitting at all.

Note on COR-001: The idempotency concern in addSplitsBack() couldn't be fully verified from code review. The FileSourceSplit.equals() implementation suggests duplicates would be detected, but runtime behavior should be validated post-merge.

The repair has been completed as required.

Carl-Zhou-CN
Carl-Zhou-CN previously approved these changes Jan 17, 2026
Copy link
Member

@Carl-Zhou-CN Carl-Zhou-CN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@yzeng1618 yzeng1618 requested a review from chl-wxp January 19, 2026 10:29
@davidzollo
Copy link
Contributor

+1 if conflicts solved
image

@yzeng1618
Copy link
Collaborator Author

+1 if conflicts solved image

The conflict has been resolved.

davidzollo
davidzollo previously approved these changes Jan 24, 2026
Copy link
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
LGTM

@DanielCarter-stack
Copy link

Issue 1: Binary Compatibility Break - Removal of LocalFileAccordingToSplitSizeSplitStrategy

Location:

  • seatunnel-connectors-v2/connector-file/connector-file-local/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/local/source/split/LocalFileAccordingToSplitSizeSplitStrategy.java (deleted)

Related Context:

  • Original class: LocalFileAccordingToSplitSizeSplitStrategy extends AccordingToSplitSizeSplitStrategy
  • Replaced by: Direct use of AccordingToSplitSizeSplitStrategy + LocalFileHadoopConfig

Problem Description:
The PR deleted the LocalFileAccordingToSplitSizeSplitStrategy class. If external users or third-party Connectors inherit from this class, compilation errors or runtime ClassNotFoundException will occur after upgrade.

Potential Risks:

  • Risk 1: External inheritors will fail to compile (if source dependency)
  • Risk 2: Third-party Connectors compiled with old versions will have runtime ClassNotFoundException (if binary dependency)
  • Risk 3: Although File Connector is an internal module, there is no guarantee that users haven't extended based on it

Impact Scope:

  • Direct Impact: None (no inheritance within SeaTunnel)
  • Indirect Impact: Possible third-party Connectors
  • Affected Area: LocalFile Connector

Severity: MINOR

Improvement Suggestions:

// Keep old class as deprecated adapter
@Deprecated
public class LocalFileAccordingToSplitSizeSplitStrategy extends AccordingToSplitSizeSplitStrategy {
    public LocalFileAccordingToSplitSizeSplitStrategy(
            String rowDelimiter, long skipHeaderRowNumber, String encodingName, long splitSize) {
        super(new LocalFileHadoopConfig(), rowDelimiter, skipHeaderRowNumber, encodingName, splitSize);
    }
}

Rationale: Keep one version as a deprecated adapter to give users migration time, and completely remove it in version 2.8.0.


Issue 2: Boundary Condition Defect - delimiterBytes.length == 0 causes seek failure

Location:
AccordingToSplitSizeSplitStrategy.java:207-209

long scanStart = Math.max(0, startPos - (delimiterBytes.length - 1));
input.seek(scanStart);

Related Context:

  • Constructor validation: if (rowDelimiter == null || rowDelimiter.isEmpty()) guards against null, but not against length 0
  • Actually "".isEmpty() == true, so the constructor will intercept, but this is an implicit dependency

Problem Description:
Although the constructor will intercept rowDelimiter.isEmpty(), the algorithm in findNextDelimiterWithSeek assumes delimiterBytes.length >= 1. If someone modifies constructor validation logic in the future (e.g., allowing empty delimiter to indicate no splitting), a bug will occur here.

A more serious problem is: when delimiterBytes.length == 1, scanStart = startPos - 0 = startPos, the algorithm is correct; but when future support for multi-byte delimiter is added (already supported), the algorithm assumes scanStart might be less than startPos to capture cross-boundary delimiters.

Potential Risks:

  • Risk 1: If constructor validation is bypassed (e.g., reflection, deserialization attack), delimiterBytes.length == 0 will cause scanStart = startPos - (-1) = startPos + 1, seek exceeds file range
  • Risk 2: Code maintainers may not understand why length - 1, mistakenly changing it to other values

Impact Scope:

  • Direct Impact: AccordingToSplitSizeSplitStrategy.findNextDelimiterWithSeek()
  • Indirect Impact: All scenarios using text/csv/json splitting
  • Affected Area: HdfsFile, LocalFile

Severity: MINOR

Improvement Suggestions:

private long findNextDelimiterWithSeek(FSDataInputStream input, long startPos, long fileSize)
        throws IOException {
    // Explicit assertion to clarify algorithm assumptions
    if (delimiterBytes.length == 0) {
        throw new IllegalStateException("delimiterBytes must not be empty");
    }
    
    // Extract magic numbers as constants
    long scanStart = Math.max(0, startPos - (delimiterBytes.length - 1));
    input.seek(scanStart);
    // ...
}

Rationale: Add defensive assertions to make algorithm assumptions explicit even though the constructor has validated, facilitating future maintenance.


Issue 3: API Compatibility - AccordingToSplitSizeSplitStrategy changed from abstract to concrete

Location:
AccordingToSplitSizeSplitStrategy.java:50

// Old version
public abstract class AccordingToSplitSizeSplitStrategy implements FileSplitStrategy {
    protected abstract InputStream getInputStream(String filePath) throws IOException;
    protected abstract long getFileSize(String filePath) throws IOException;
}

// New version
public class AccordingToSplitSizeSplitStrategy implements FileSplitStrategy, Closeable {
    public AccordingToSplitSizeSplitStrategy(HadoopConf hadoopConf, ...) {
        this.hadoopFileSystemProxy = new HadoopFileSystemProxy(hadoopConf);
    }
}

Related Context:

  • Subclass: LocalFileAccordingToSplitSizeSplitStrategy (deleted)
  • No other known subclasses

Problem Description:
AccordingToSplitSizeSplitStrategy changed from abstract class to concrete class, and constructor signature changed from (String, long, String, long) to (HadoopConf, String, long, String, long). This is a breaking change.

Although there are no other subclasses within SeaTunnel, if external extensions inherit from this class, they will:

  1. Compilation error: unable to instantiate parent class
  2. Abstract methods getInputStream()/getFileSize() disappear, subclass compilation fails

Potential Risks:

  • Risk 1: Third-party Connectors inheriting this class will fail to compile
  • Risk 2: In a plugin architecture, user-defined File Sources may be affected

Impact Scope:

  • Direct Impact: All classes inheriting AccordingToSplitSizeSplitStrategy (none internally)
  • Indirect Impact: Possible third-party extensions
  • Affected Area: File Connector ecosystem

Severity: MINOR

Improvement Suggestions:
Option 1 (Recommended): Explicitly mark Breaking Changes in release notes, requiring users upgrading to 2.7.x to check if they inherit this class.

Option 2: Keep abstract base class, add concrete implementation

// Keep old abstract class (deprecated)
@Deprecated
public abstract class LegacyAccordingToSplitSizeSplitStrategy implements FileSplitStrategy {
    // Old abstract method
    protected abstract InputStream getInputStream(String filePath) throws IOException;
    protected abstract long getFileSize(String filePath) throws IOException;
}

// New concrete implementation
public class AccordingToSplitSizeSplitStrategy implements FileSplitStrategy, Closeable {
    // New implementation
}

Rationale: Although there are no internal subclasses currently, as a framework library, API stability should be maintained, at least marking @Deprecated to provide a transition period.


Issue 4: Code Duplication - HdfsFileSource and LocalFileSource have identical initialization logic

Location:

  • HdfsFileSource.java:36-48 (newly added)
  • LocalFileSource.java:35-47 (modified)

Related Context:

// HdfsFileSource.java
private static FileSplitStrategy initFileSplitStrategy(MultipleTableHdfsFileSourceConfig config) {
    Map<String, FileSplitStrategy> splitStrategies = new HashMap<>();
    for (BaseFileSourceConfig fileSourceConfig : config.getFileSourceConfigs()) {
        String tableId = fileSourceConfig.getCatalogTable().getTableId().toTablePath().toString();
        splitStrategies.put(tableId,
            FileSplitStrategyFactory.initFileSplitStrategy(
                fileSourceConfig.getBaseFileSourceConfig(),
                fileSourceConfig.getHadoopConfig()));
    }
    return new MultipleTableFileSplitStrategy(splitStrategies);
}

// LocalFileSource.java - Exactly the same!
private static FileSplitStrategy initFileSplitStrategy(MultipleTableLocalFileSourceConfig sourceConfig) {
    Map<String, FileSplitStrategy> splitStrategies = new HashMap<>();
    for (BaseFileSourceConfig fileSourceConfig : sourceConfig.getFileSourceConfigs()) {
        String tableId = fileSourceConfig.getCatalogTable().getTableId().toTablePath().toString();
        splitStrategies.put(tableId,
            FileSplitStrategyFactory.initFileSplitStrategy(
                fileSourceConfig.getBaseFileSourceConfig(),
                fileSourceConfig.getHadoopConfig()));
    }
    return new MultipleTableFileSplitStrategy(splitStrategies);
}

Problem Description:
The initFileSplitStrategy() method logic in both classes is completely identical, violating the DRY (Don't Repeat Yourself) principle. If initialization logic needs to be modified in the future (e.g., adding caching, logging), two places need to be modified simultaneously.

Potential Risks:

  • Risk 1: Increased maintenance cost (synchronized modifications to two places)
  • Risk 2: Easy to create inconsistencies (e.g., modifying only one place, forgetting the other)

Impact Scope:

  • Direct Impact: HdfsFileSource, LocalFileSource
  • Indirect Impact: Newly added FileSources in the future (e.g., S3FileSource)
  • Affected Area: All File Sources

Severity: MINOR

Improvement Suggestions:

// BaseMultipleTableFileSource.java
public abstract class BaseMultipleTableFileSource {
    protected static FileSplitStrategy initFileSplitStrategy(
            BaseMultipleTableFileSourceConfig sourceConfig) {
        Map<String, FileSplitStrategy> splitStrategies = new HashMap<>();
        for (BaseFileSourceConfig fileSourceConfig : sourceConfig.getFileSourceConfigs()) {
            String tableId = fileSourceConfig.getCatalogTable().getTableId().toTablePath().toString();
            splitStrategies.put(tableId,
                FileSplitStrategyFactory.initFileSplitStrategy(
                    fileSourceConfig.getBaseFileSourceConfig(),
                    fileSourceConfig.getHadoopConfig()));
        }
        return new MultipleTableFileSplitStrategy(splitStrategies);
    }
}

// HdfsFileSource.java
private HdfsFileSource(MultipleTableHdfsFileSourceConfig sourceConfig) {
    super(sourceConfig, initFileSplitStrategy(sourceConfig));  // Call parent class method
}

// LocalFileSource.java
private LocalFileSource(MultipleTableLocalFileSourceConfig sourceConfig) {
    super(sourceConfig, initFileSplitStrategy(sourceConfig));  // Call parent class method
}

Rationale: Extract common logic to parent class to reduce duplication and improve maintainability.


Issue 6: Incomplete Documentation - Missing performance recommendations and limitation notes

Location:
docs/en/connectors/source/HdfsFile.md:256-265

Related Context:
Current documentation:

### enable_file_split [boolean]
Turn on the file splitting function, the default is false. ...

### file_split_size [long]
File split size, which can be filled in when the enable_file_split parameter is true. ...
The default value is the number of bytes of 128MB, which is 134217728.

Problem Description:
The documentation is missing the following key information:

  1. Performance Recommendations: In which scenarios should it be enabled? In which scenarios should it not be?
  2. Limitation Notes: Maximum split number limit? Memory overhead?
  3. Tuning Guide: How to choose an appropriate file_split_size?
  4. Known Issues: For example, compressed files, archive files do not support splitting

Potential Risks:

  • Risk 1: Users abuse the feature (e.g., enabling splitting for small files) resulting in performance degradation
  • Risk 2: Users set too small split_size (e.g., 1KB) causing OOM
  • Risk 3: Users expect compressed files to also be splittable, actually not supported

Impact Scope:

  • Direct Impact: User experience
  • Indirect Impact: Increased support cost
  • Affected Area: HdfsFile, LocalFile

Severity: MINOR

Improvement Suggestions:

### enable_file_split [boolean]
Turn on the file splitting function, the default is false. ...

**Performance Recommendations:**
- **Enable when**: Reading few large files (> 1GB each) with parallelism > 4
- **Disable when**: Reading many small files (< 128MB each) or parallelism = 1
- **Not supported for**: Compressed files (gzip, bzip2), archive files (zip, tar)

**Limitations:**
- Maximum splits per file: 100,000 (configurable via `seaunnel.source.file.split.max-count`)
- Memory overhead: ~100 bytes per split

### file_split_size [long]
File split size, which can be filled in when the enable_file_split parameter is true. ...

**Tuning Guidelines:**
- **Recommended**: 128MB - 512MB for HDFS, 64MB - 256MB for local file
- **Too small (< 16MB)**: Excessive splits, high overhead
- **Too large (> 1GB)**: Insufficient parallelism
- **Formula**: `file_split_size = total_file_size / desired_parallelism`

Rationale: Add performance guidelines and limitation notes to help users use the feature correctly.


Issue 7: Architectural Complexity - LocalFile requiring HadoopConf configuration concept overflow

Location:
LocalFileSource.java:39-42

splitStrategies.put(tableId,
    FileSplitStrategyFactory.initFileSplitStrategy(
        fileSourceConfig.getBaseFileSourceConfig(),
        fileSourceConfig.getHadoopConfig()));  // ← Does LocalFile need HadoopConf?

Related Context:

  • LocalFileHadoopConfig: LocalFile's HadoopConf implementation (actually just configuring file:///)
  • AccordingToSplitSizeSplitStrategy: requires HadoopConf to create HadoopFileSystemProxy

Problem Description:
To reuse AccordingToSplitSizeSplitStrategy, LocalFile scenarios are also forced to introduce the HadoopConf concept. Although technically feasible (through LocalFileHadoopConfig), it is conceptually unnatural:

  • LocalFile users typically don't care about Hadoop configuration
  • Error messages may contain "Hadoop" wording, confusing LocalFile users

Potential Risks:

  • Risk 1: User confusion ("Why do I need Hadoop configuration to read local files?")
  • Risk 2: Misleading error messages (e.g., "Hadoop configuration error")

Impact Scope:

  • Direct Impact: LocalFile Source user experience
  • Indirect Impact: Documentation complexity
  • Affected Area: LocalFile Connector

Severity: MINOR

Improvement Suggestions:
Option 1 (Recommended): Explicitly state in documentation

For **LocalFile**, the `hadoop_conf` is optional and defaults to `file:///` schema.
You rarely need to configure it unless you have custom Hadoop settings.

Option 2 (Complex): Abstract FileSystemProxy interface

public interface FileSystemProxy {
    FSDataInputStream getInputStream(String path) throws IOException;
    FileStatus getFileStatus(String path) throws IOException;
}

public class HadoopFileSystemProxy implements FileSystemProxy { ... }
public class LocalFileSystemProxy implements FileSystemProxy { ... }  // Pure Java NIO

Rationale: Option 1 is simple and effective, Option 2 has clearer architecture but requires more work. The current solution is acceptable.


Issue 8: Checkpoint Compatibility Not Fully Validated

Location:
FileSourceSplit.java:44-49

public FileSourceSplit(String tableId, String filePath, long start, long length) {
    this.tableId = tableId;
    this.filePath = filePath;
    this.start = start;      // ← New field assigned in constructor
    this.length = length;    // ← New field assigned in constructor
}

Related Context:

  • serialVersionUID = 1L
  • Old version FileSourceSplit may only have tableId/filePath, start/length use default values 0/-1

Problem Description:
The PR did not validate the behavior of restoring from old version (without file split feature) checkpoints to the new version:

  1. After deserializing FileSourceSplit from old checkpoint, start=0, length=-1
  2. New code will read by byte range, length=-1 may cause problems
  3. Need to check if AbstractReadStrategy.actualRead() correctly handles length=-1

Validation:
Check AbstractReadStrategy.safeSlice():

protected static InputStream safeSlice(InputStream in, long start, long length) throws IOException {
    // ...
    return new BoundedInputStream(in, length);  // How does BoundedInputStream handle length=-1?
}

Apache Commons IO's BoundedInputStream(-1) will read to the end of the stream, so the behavior is correct. But this is an implicit dependency.

Potential Risks:

  • Risk 1: Jobs upgrading from 2.6.x to 2.7.x, after checkpoint recovery, data inconsistency
  • Risk 2: The semantics of length=-1 are not clear enough (does it mean "read to end of file" or "disable split"?)

Impact Scope:

  • Direct Impact: Job recovery for upgrading users
  • Indirect Impact: Data consistency
  • Affected Area: All jobs using File Source

Severity: MAJOR

Improvement Suggestions:

// FileSourceSplit.java
public FileSourceSplit(String tableId, String filePath, long start, long length) {
    this.tableId = tableId;
    this.filePath = filePath;
    this.start = start;
    // Clarify semantics: -1 means "entire file" (compatible with old version)
    this.length = length;
}

// AbstractReadStrategy.java
protected static InputStream safeSlice(InputStream in, long start, long length) throws IOException {
    if (start > 0) {
        if (in instanceof Seekable) {
            ((Seekable) in).seek(start);
        } else {
            // ...
        }
    }
    // length=-1 means unlimited (compatible with old checkpoint)
    if (length < 0) {
        return in;
    }
    return new BoundedInputStream(in, length);
}

Also add tests:

@Test
void testDeserializeFromOldCheckpoint() {
    // Simulate FileSourceSplit from old version serialization
    FileSourceSplit oldSplit = new FileSourceSplit("table", "path");  // start=0, length=-1
    // Verify reading entire file
    List<String> records = readBySplit(oldSplit);
    Assertions.assertEquals(fullFileRecords, records);
}

Rationale: Clarify the semantics of length=-1, add tests to verify upgrade compatibility.


@yzeng1618
Copy link
Collaborator Author

Issue 1: Binary Compatibility Break - Removal of LocalFileAccordingToSplitSizeSplitStrategy

Location:

  • seatunnel-connectors-v2/connector-file/connector-file-local/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/local/source/split/LocalFileAccordingToSplitSizeSplitStrategy.java (deleted)

Related Context:

  • Original class: LocalFileAccordingToSplitSizeSplitStrategy extends AccordingToSplitSizeSplitStrategy
  • Replaced by: Direct use of AccordingToSplitSizeSplitStrategy + LocalFileHadoopConfig

Problem Description: The PR deleted the LocalFileAccordingToSplitSizeSplitStrategy class. If external users or third-party Connectors inherit from this class, compilation errors or runtime ClassNotFoundException will occur after upgrade.

Potential Risks:

  • Risk 1: External inheritors will fail to compile (if source dependency)
  • Risk 2: Third-party Connectors compiled with old versions will have runtime ClassNotFoundException (if binary dependency)
  • Risk 3: Although File Connector is an internal module, there is no guarantee that users haven't extended based on it

Impact Scope:

  • Direct Impact: None (no inheritance within SeaTunnel)
  • Indirect Impact: Possible third-party Connectors
  • Affected Area: LocalFile Connector

Severity: MINOR

Improvement Suggestions:

// Keep old class as deprecated adapter
@Deprecated
public class LocalFileAccordingToSplitSizeSplitStrategy extends AccordingToSplitSizeSplitStrategy {
    public LocalFileAccordingToSplitSizeSplitStrategy(
            String rowDelimiter, long skipHeaderRowNumber, String encodingName, long splitSize) {
        super(new LocalFileHadoopConfig(), rowDelimiter, skipHeaderRowNumber, encodingName, splitSize);
    }
}

Rationale: Keep one version as a deprecated adapter to give users migration time, and completely remove it in version 2.8.0.

Issue 2: Boundary Condition Defect - delimiterBytes.length == 0 causes seek failure

Location: AccordingToSplitSizeSplitStrategy.java:207-209

long scanStart = Math.max(0, startPos - (delimiterBytes.length - 1));
input.seek(scanStart);

Related Context:

  • Constructor validation: if (rowDelimiter == null || rowDelimiter.isEmpty()) guards against null, but not against length 0
  • Actually "".isEmpty() == true, so the constructor will intercept, but this is an implicit dependency

Problem Description: Although the constructor will intercept rowDelimiter.isEmpty(), the algorithm in findNextDelimiterWithSeek assumes delimiterBytes.length >= 1. If someone modifies constructor validation logic in the future (e.g., allowing empty delimiter to indicate no splitting), a bug will occur here.

A more serious problem is: when delimiterBytes.length == 1, scanStart = startPos - 0 = startPos, the algorithm is correct; but when future support for multi-byte delimiter is added (already supported), the algorithm assumes scanStart might be less than startPos to capture cross-boundary delimiters.

Potential Risks:

  • Risk 1: If constructor validation is bypassed (e.g., reflection, deserialization attack), delimiterBytes.length == 0 will cause scanStart = startPos - (-1) = startPos + 1, seek exceeds file range
  • Risk 2: Code maintainers may not understand why length - 1, mistakenly changing it to other values

Impact Scope:

  • Direct Impact: AccordingToSplitSizeSplitStrategy.findNextDelimiterWithSeek()
  • Indirect Impact: All scenarios using text/csv/json splitting
  • Affected Area: HdfsFile, LocalFile

Severity: MINOR

Improvement Suggestions:

private long findNextDelimiterWithSeek(FSDataInputStream input, long startPos, long fileSize)
        throws IOException {
    // Explicit assertion to clarify algorithm assumptions
    if (delimiterBytes.length == 0) {
        throw new IllegalStateException("delimiterBytes must not be empty");
    }
    
    // Extract magic numbers as constants
    long scanStart = Math.max(0, startPos - (delimiterBytes.length - 1));
    input.seek(scanStart);
    // ...
}

Rationale: Add defensive assertions to make algorithm assumptions explicit even though the constructor has validated, facilitating future maintenance.

Issue 3: API Compatibility - AccordingToSplitSizeSplitStrategy changed from abstract to concrete

Location: AccordingToSplitSizeSplitStrategy.java:50

// Old version
public abstract class AccordingToSplitSizeSplitStrategy implements FileSplitStrategy {
    protected abstract InputStream getInputStream(String filePath) throws IOException;
    protected abstract long getFileSize(String filePath) throws IOException;
}

// New version
public class AccordingToSplitSizeSplitStrategy implements FileSplitStrategy, Closeable {
    public AccordingToSplitSizeSplitStrategy(HadoopConf hadoopConf, ...) {
        this.hadoopFileSystemProxy = new HadoopFileSystemProxy(hadoopConf);
    }
}

Related Context:

  • Subclass: LocalFileAccordingToSplitSizeSplitStrategy (deleted)
  • No other known subclasses

Problem Description: AccordingToSplitSizeSplitStrategy changed from abstract class to concrete class, and constructor signature changed from (String, long, String, long) to (HadoopConf, String, long, String, long). This is a breaking change.

Although there are no other subclasses within SeaTunnel, if external extensions inherit from this class, they will:

  1. Compilation error: unable to instantiate parent class
  2. Abstract methods getInputStream()/getFileSize() disappear, subclass compilation fails

Potential Risks:

  • Risk 1: Third-party Connectors inheriting this class will fail to compile
  • Risk 2: In a plugin architecture, user-defined File Sources may be affected

Impact Scope:

  • Direct Impact: All classes inheriting AccordingToSplitSizeSplitStrategy (none internally)
  • Indirect Impact: Possible third-party extensions
  • Affected Area: File Connector ecosystem

Severity: MINOR

Improvement Suggestions: Option 1 (Recommended): Explicitly mark Breaking Changes in release notes, requiring users upgrading to 2.7.x to check if they inherit this class.

Option 2: Keep abstract base class, add concrete implementation

// Keep old abstract class (deprecated)
@Deprecated
public abstract class LegacyAccordingToSplitSizeSplitStrategy implements FileSplitStrategy {
    // Old abstract method
    protected abstract InputStream getInputStream(String filePath) throws IOException;
    protected abstract long getFileSize(String filePath) throws IOException;
}

// New concrete implementation
public class AccordingToSplitSizeSplitStrategy implements FileSplitStrategy, Closeable {
    // New implementation
}

Rationale: Although there are no internal subclasses currently, as a framework library, API stability should be maintained, at least marking @Deprecated to provide a transition period.

Issue 4: Code Duplication - HdfsFileSource and LocalFileSource have identical initialization logic

Location:

  • HdfsFileSource.java:36-48 (newly added)
  • LocalFileSource.java:35-47 (modified)

Related Context:

// HdfsFileSource.java
private static FileSplitStrategy initFileSplitStrategy(MultipleTableHdfsFileSourceConfig config) {
    Map<String, FileSplitStrategy> splitStrategies = new HashMap<>();
    for (BaseFileSourceConfig fileSourceConfig : config.getFileSourceConfigs()) {
        String tableId = fileSourceConfig.getCatalogTable().getTableId().toTablePath().toString();
        splitStrategies.put(tableId,
            FileSplitStrategyFactory.initFileSplitStrategy(
                fileSourceConfig.getBaseFileSourceConfig(),
                fileSourceConfig.getHadoopConfig()));
    }
    return new MultipleTableFileSplitStrategy(splitStrategies);
}

// LocalFileSource.java - Exactly the same!
private static FileSplitStrategy initFileSplitStrategy(MultipleTableLocalFileSourceConfig sourceConfig) {
    Map<String, FileSplitStrategy> splitStrategies = new HashMap<>();
    for (BaseFileSourceConfig fileSourceConfig : sourceConfig.getFileSourceConfigs()) {
        String tableId = fileSourceConfig.getCatalogTable().getTableId().toTablePath().toString();
        splitStrategies.put(tableId,
            FileSplitStrategyFactory.initFileSplitStrategy(
                fileSourceConfig.getBaseFileSourceConfig(),
                fileSourceConfig.getHadoopConfig()));
    }
    return new MultipleTableFileSplitStrategy(splitStrategies);
}

Problem Description: The initFileSplitStrategy() method logic in both classes is completely identical, violating the DRY (Don't Repeat Yourself) principle. If initialization logic needs to be modified in the future (e.g., adding caching, logging), two places need to be modified simultaneously.

Potential Risks:

  • Risk 1: Increased maintenance cost (synchronized modifications to two places)
  • Risk 2: Easy to create inconsistencies (e.g., modifying only one place, forgetting the other)

Impact Scope:

  • Direct Impact: HdfsFileSource, LocalFileSource
  • Indirect Impact: Newly added FileSources in the future (e.g., S3FileSource)
  • Affected Area: All File Sources

Severity: MINOR

Improvement Suggestions:

// BaseMultipleTableFileSource.java
public abstract class BaseMultipleTableFileSource {
    protected static FileSplitStrategy initFileSplitStrategy(
            BaseMultipleTableFileSourceConfig sourceConfig) {
        Map<String, FileSplitStrategy> splitStrategies = new HashMap<>();
        for (BaseFileSourceConfig fileSourceConfig : sourceConfig.getFileSourceConfigs()) {
            String tableId = fileSourceConfig.getCatalogTable().getTableId().toTablePath().toString();
            splitStrategies.put(tableId,
                FileSplitStrategyFactory.initFileSplitStrategy(
                    fileSourceConfig.getBaseFileSourceConfig(),
                    fileSourceConfig.getHadoopConfig()));
        }
        return new MultipleTableFileSplitStrategy(splitStrategies);
    }
}

// HdfsFileSource.java
private HdfsFileSource(MultipleTableHdfsFileSourceConfig sourceConfig) {
    super(sourceConfig, initFileSplitStrategy(sourceConfig));  // Call parent class method
}

// LocalFileSource.java
private LocalFileSource(MultipleTableLocalFileSourceConfig sourceConfig) {
    super(sourceConfig, initFileSplitStrategy(sourceConfig));  // Call parent class method
}

Rationale: Extract common logic to parent class to reduce duplication and improve maintainability.

Issue 6: Incomplete Documentation - Missing performance recommendations and limitation notes

Location: docs/en/connectors/source/HdfsFile.md:256-265

Related Context: Current documentation:

### enable_file_split [boolean]
Turn on the file splitting function, the default is false. ...

### file_split_size [long]
File split size, which can be filled in when the enable_file_split parameter is true. ...
The default value is the number of bytes of 128MB, which is 134217728.

Problem Description: The documentation is missing the following key information:

  1. Performance Recommendations: In which scenarios should it be enabled? In which scenarios should it not be?
  2. Limitation Notes: Maximum split number limit? Memory overhead?
  3. Tuning Guide: How to choose an appropriate file_split_size?
  4. Known Issues: For example, compressed files, archive files do not support splitting

Potential Risks:

  • Risk 1: Users abuse the feature (e.g., enabling splitting for small files) resulting in performance degradation
  • Risk 2: Users set too small split_size (e.g., 1KB) causing OOM
  • Risk 3: Users expect compressed files to also be splittable, actually not supported

Impact Scope:

  • Direct Impact: User experience
  • Indirect Impact: Increased support cost
  • Affected Area: HdfsFile, LocalFile

Severity: MINOR

Improvement Suggestions:

### enable_file_split [boolean]
Turn on the file splitting function, the default is false. ...

**Performance Recommendations:**
- **Enable when**: Reading few large files (> 1GB each) with parallelism > 4
- **Disable when**: Reading many small files (< 128MB each) or parallelism = 1
- **Not supported for**: Compressed files (gzip, bzip2), archive files (zip, tar)

**Limitations:**
- Maximum splits per file: 100,000 (configurable via `seaunnel.source.file.split.max-count`)
- Memory overhead: ~100 bytes per split

### file_split_size [long]
File split size, which can be filled in when the enable_file_split parameter is true. ...

**Tuning Guidelines:**
- **Recommended**: 128MB - 512MB for HDFS, 64MB - 256MB for local file
- **Too small (< 16MB)**: Excessive splits, high overhead
- **Too large (> 1GB)**: Insufficient parallelism
- **Formula**: `file_split_size = total_file_size / desired_parallelism`

Rationale: Add performance guidelines and limitation notes to help users use the feature correctly.

Issue 7: Architectural Complexity - LocalFile requiring HadoopConf configuration concept overflow

Location: LocalFileSource.java:39-42

splitStrategies.put(tableId,
    FileSplitStrategyFactory.initFileSplitStrategy(
        fileSourceConfig.getBaseFileSourceConfig(),
        fileSourceConfig.getHadoopConfig()));  // ← Does LocalFile need HadoopConf?

Related Context:

  • LocalFileHadoopConfig: LocalFile's HadoopConf implementation (actually just configuring file:///)
  • AccordingToSplitSizeSplitStrategy: requires HadoopConf to create HadoopFileSystemProxy

Problem Description: To reuse AccordingToSplitSizeSplitStrategy, LocalFile scenarios are also forced to introduce the HadoopConf concept. Although technically feasible (through LocalFileHadoopConfig), it is conceptually unnatural:

  • LocalFile users typically don't care about Hadoop configuration
  • Error messages may contain "Hadoop" wording, confusing LocalFile users

Potential Risks:

  • Risk 1: User confusion ("Why do I need Hadoop configuration to read local files?")
  • Risk 2: Misleading error messages (e.g., "Hadoop configuration error")

Impact Scope:

  • Direct Impact: LocalFile Source user experience
  • Indirect Impact: Documentation complexity
  • Affected Area: LocalFile Connector

Severity: MINOR

Improvement Suggestions: Option 1 (Recommended): Explicitly state in documentation

For **LocalFile**, the `hadoop_conf` is optional and defaults to `file:///` schema.
You rarely need to configure it unless you have custom Hadoop settings.

Option 2 (Complex): Abstract FileSystemProxy interface

public interface FileSystemProxy {
    FSDataInputStream getInputStream(String path) throws IOException;
    FileStatus getFileStatus(String path) throws IOException;
}

public class HadoopFileSystemProxy implements FileSystemProxy { ... }
public class LocalFileSystemProxy implements FileSystemProxy { ... }  // Pure Java NIO

Rationale: Option 1 is simple and effective, Option 2 has clearer architecture but requires more work. The current solution is acceptable.

Issue 8: Checkpoint Compatibility Not Fully Validated

Location: FileSourceSplit.java:44-49

public FileSourceSplit(String tableId, String filePath, long start, long length) {
    this.tableId = tableId;
    this.filePath = filePath;
    this.start = start;      // ← New field assigned in constructor
    this.length = length;    // ← New field assigned in constructor
}

Related Context:

  • serialVersionUID = 1L
  • Old version FileSourceSplit may only have tableId/filePath, start/length use default values 0/-1

Problem Description: The PR did not validate the behavior of restoring from old version (without file split feature) checkpoints to the new version:

  1. After deserializing FileSourceSplit from old checkpoint, start=0, length=-1
  2. New code will read by byte range, length=-1 may cause problems
  3. Need to check if AbstractReadStrategy.actualRead() correctly handles length=-1

Validation: Check AbstractReadStrategy.safeSlice():

protected static InputStream safeSlice(InputStream in, long start, long length) throws IOException {
    // ...
    return new BoundedInputStream(in, length);  // How does BoundedInputStream handle length=-1?
}

Apache Commons IO's BoundedInputStream(-1) will read to the end of the stream, so the behavior is correct. But this is an implicit dependency.

Potential Risks:

  • Risk 1: Jobs upgrading from 2.6.x to 2.7.x, after checkpoint recovery, data inconsistency
  • Risk 2: The semantics of length=-1 are not clear enough (does it mean "read to end of file" or "disable split"?)

Impact Scope:

  • Direct Impact: Job recovery for upgrading users
  • Indirect Impact: Data consistency
  • Affected Area: All jobs using File Source

Severity: MAJOR

Improvement Suggestions:

// FileSourceSplit.java
public FileSourceSplit(String tableId, String filePath, long start, long length) {
    this.tableId = tableId;
    this.filePath = filePath;
    this.start = start;
    // Clarify semantics: -1 means "entire file" (compatible with old version)
    this.length = length;
}

// AbstractReadStrategy.java
protected static InputStream safeSlice(InputStream in, long start, long length) throws IOException {
    if (start > 0) {
        if (in instanceof Seekable) {
            ((Seekable) in).seek(start);
        } else {
            // ...
        }
    }
    // length=-1 means unlimited (compatible with old checkpoint)
    if (length < 0) {
        return in;
    }
    return new BoundedInputStream(in, length);
}

Also add tests:

@Test
void testDeserializeFromOldCheckpoint() {
    // Simulate FileSourceSplit from old version serialization
    FileSourceSplit oldSplit = new FileSourceSplit("table", "path");  // start=0, length=-1
    // Verify reading entire file
    List<String> records = readBySplit(oldSplit);
    Assertions.assertEquals(fullFileRecords, records);
}

Rationale: Clarify the semantics of length=-1, add tests to verify upgrade compatibility.

The above suggestions have been implemented.

Copy link
Member

@Carl-Zhou-CN Carl-Zhou-CN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@davidzollo davidzollo merged commit 4c110c8 into apache:dev Jan 29, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants