[Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read by yzeng1618 · Pull Request #10332 · apache/seatunnel

yzeng1618 · 2026-01-13T09:22:19Z

Purpose of this pull request

#10326
HdfsFile source currently uses “one file = one split”, which limits parallelism when there are only a few huge files.

Does this PR introduce any user-facing change?

yes

This PR adds enable_file_split / file_split_size to HdfsFile source and wires HDFS-specific split strategies:

text/csv/json: split by file_split_size and align split end to the next row_delimiter (HDFS seek-based implementation for large files).
parquet: split by RowGroup (logical split) and read footer metadata using HadoopConf-backed Configuration (works with Kerberos/HA/NameService).

Example:

enable_file_split = true
file_split_size = 268435456
row_delimiter = "\n" (for text/csv/json)

How was this patch tested?

Unit tests
HdfsFileAccordingToSplitSizeSplitStrategyTest#testReadBySplitsShouldMatchFullRead
E2E
HdfsFileIT#testHdfsTextReadWithFileSplit

Check list

[ * ] If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
[ * ] If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

…arallel read

...l/connectors/seatunnel/file/hdfs/source/split/HdfsFileAccordingToSplitSizeSplitStrategy.java

...ache/seatunnel/connectors/seatunnel/file/hdfs/source/split/HdfsFileSplitStrategyFactory.java

...ache/seatunnel/connectors/seatunnel/file/hdfs/source/split/HdfsParquetFileSplitStrategy.java

…trategies in base and fix restore

...va/org/apache/seatunnel/connectors/seatunnel/file/source/split/ParquetFileSplitStrategy.java

Carl-Zhou-CN · 2026-01-15T03:07:34Z

hi,It seems that a parquetd use case is still missing

docs/en/connectors/source/HdfsFile.md

zhangshenghang · 2026-01-15T13:37:16Z

Thanks for implementing this important feature! The overall approach looks solid, but I found 2 CRITICAL issues that should be addressed before merging:

1. Poor error observability (GEN-002)
AccordingToSplitSizeSplitStrategy.java:107-109,115-117 - All IOExceptions are mapped to FILE_READ_FAILED, losing distinction between FileNotFoundException, permission denied, network timeout, etc. This makes troubleshooting difficult. Consider catching specific exception types and mapping them to appropriate error codes.

2. Severe performance regression for large file splits (GEN-003)
AbstractReadStrategy.java:510-521 - safeSlice() uses InputStream.skip() in a loop, which reads byte-by-byte. For HDFS files with non-first splits (e.g., start=10GB), this causes massive network/disk I/O just to reach the offset. HDFS supports FSDataInputStream.seek(), which should be used instead. Without this fix, enabling splits on large HDFS files will be slower than not splitting at all.

Note on COR-001: The idempotency concern in addSplitsBack() couldn't be fully verified from code review. The FileSourceSplit.equals() implementation suggests duplicates would be detected, but runtime behavior should be validated post-merge.

…able

yzeng1618 · 2026-01-16T07:18:03Z

Thanks for implementing this important feature! The overall approach looks solid, but I found 2 CRITICAL issues that should be addressed before merging:

1. Poor error observability (GEN-002) AccordingToSplitSizeSplitStrategy.java:107-109,115-117 - All IOExceptions are mapped to FILE_READ_FAILED, losing distinction between FileNotFoundException, permission denied, network timeout, etc. This makes troubleshooting difficult. Consider catching specific exception types and mapping them to appropriate error codes.

2. Severe performance regression for large file splits (GEN-003) AbstractReadStrategy.java:510-521 - safeSlice() uses InputStream.skip() in a loop, which reads byte-by-byte. For HDFS files with non-first splits (e.g., start=10GB), this causes massive network/disk I/O just to reach the offset. HDFS supports FSDataInputStream.seek(), which should be used instead. Without this fix, enabling splits on large HDFS files will be slower than not splitting at all.

Note on COR-001: The idempotency concern in addSplitsBack() couldn't be fully verified from code review. The FileSourceSplit.equals() implementation suggests duplicates would be detected, but runtime behavior should be validated post-merge.

The repair has been completed as required.

Carl-Zhou-CN

+1

...va/org/apache/seatunnel/connectors/seatunnel/file/source/split/ParquetFileSplitStrategy.java

davidzollo · 2026-01-24T09:11:40Z

+1 if conflicts solved

yzeng1618 · 2026-01-24T12:21:55Z

+1 if conflicts solved

The conflict has been resolved.

davidzollo

+1
LGTM

DanielCarter-stack · 2026-01-24T14:35:10Z

Issue 1: Binary Compatibility Break - Removal of LocalFileAccordingToSplitSizeSplitStrategy

Location:

seatunnel-connectors-v2/connector-file/connector-file-local/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/local/source/split/LocalFileAccordingToSplitSizeSplitStrategy.java (deleted)

Related Context:

Original class: LocalFileAccordingToSplitSizeSplitStrategy extends AccordingToSplitSizeSplitStrategy
Replaced by: Direct use of AccordingToSplitSizeSplitStrategy + LocalFileHadoopConfig

Problem Description:
The PR deleted the LocalFileAccordingToSplitSizeSplitStrategy class. If external users or third-party Connectors inherit from this class, compilation errors or runtime ClassNotFoundException will occur after upgrade.

Potential Risks:

Risk 1: External inheritors will fail to compile (if source dependency)
Risk 2: Third-party Connectors compiled with old versions will have runtime ClassNotFoundException (if binary dependency)
Risk 3: Although File Connector is an internal module, there is no guarantee that users haven't extended based on it

Impact Scope:

Direct Impact: None (no inheritance within SeaTunnel)
Indirect Impact: Possible third-party Connectors
Affected Area: LocalFile Connector

Severity: MINOR

Improvement Suggestions:

// Keep old class as deprecated adapter
@Deprecated
public class LocalFileAccordingToSplitSizeSplitStrategy extends AccordingToSplitSizeSplitStrategy {
    public LocalFileAccordingToSplitSizeSplitStrategy(
            String rowDelimiter, long skipHeaderRowNumber, String encodingName, long splitSize) {
        super(new LocalFileHadoopConfig(), rowDelimiter, skipHeaderRowNumber, encodingName, splitSize);
    }
}

Rationale: Keep one version as a deprecated adapter to give users migration time, and completely remove it in version 2.8.0.

Issue 2: Boundary Condition Defect - delimiterBytes.length == 0 causes seek failure

Location:
AccordingToSplitSizeSplitStrategy.java:207-209

long scanStart = Math.max(0, startPos - (delimiterBytes.length - 1));
input.seek(scanStart);

Related Context:

Constructor validation: if (rowDelimiter == null || rowDelimiter.isEmpty()) guards against null, but not against length 0
Actually "".isEmpty() == true, so the constructor will intercept, but this is an implicit dependency

Problem Description:
Although the constructor will intercept rowDelimiter.isEmpty(), the algorithm in findNextDelimiterWithSeek assumes delimiterBytes.length >= 1. If someone modifies constructor validation logic in the future (e.g., allowing empty delimiter to indicate no splitting), a bug will occur here.

A more serious problem is: when delimiterBytes.length == 1, scanStart = startPos - 0 = startPos, the algorithm is correct; but when future support for multi-byte delimiter is added (already supported), the algorithm assumes scanStart might be less than startPos to capture cross-boundary delimiters.

Potential Risks:

Risk 1: If constructor validation is bypassed (e.g., reflection, deserialization attack), delimiterBytes.length == 0 will cause scanStart = startPos - (-1) = startPos + 1, seek exceeds file range
Risk 2: Code maintainers may not understand why length - 1, mistakenly changing it to other values

Impact Scope:

Direct Impact: AccordingToSplitSizeSplitStrategy.findNextDelimiterWithSeek()
Indirect Impact: All scenarios using text/csv/json splitting
Affected Area: HdfsFile, LocalFile

Severity: MINOR

Improvement Suggestions:

private long findNextDelimiterWithSeek(FSDataInputStream input, long startPos, long fileSize)
        throws IOException {
    // Explicit assertion to clarify algorithm assumptions
    if (delimiterBytes.length == 0) {
        throw new IllegalStateException("delimiterBytes must not be empty");
    }
    
    // Extract magic numbers as constants
    long scanStart = Math.max(0, startPos - (delimiterBytes.length - 1));
    input.seek(scanStart);
    // ...
}

Rationale: Add defensive assertions to make algorithm assumptions explicit even though the constructor has validated, facilitating future maintenance.

Issue 3: API Compatibility - AccordingToSplitSizeSplitStrategy changed from abstract to concrete

Location:
AccordingToSplitSizeSplitStrategy.java:50

// Old version
public abstract class AccordingToSplitSizeSplitStrategy implements FileSplitStrategy {
    protected abstract InputStream getInputStream(String filePath) throws IOException;
    protected abstract long getFileSize(String filePath) throws IOException;
}

// New version
public class AccordingToSplitSizeSplitStrategy implements FileSplitStrategy, Closeable {
    public AccordingToSplitSizeSplitStrategy(HadoopConf hadoopConf, ...) {
        this.hadoopFileSystemProxy = new HadoopFileSystemProxy(hadoopConf);
    }
}

Related Context:

Subclass: LocalFileAccordingToSplitSizeSplitStrategy (deleted)
No other known subclasses

Problem Description:
AccordingToSplitSizeSplitStrategy changed from abstract class to concrete class, and constructor signature changed from (String, long, String, long) to (HadoopConf, String, long, String, long). This is a breaking change.

Although there are no other subclasses within SeaTunnel, if external extensions inherit from this class, they will:

Compilation error: unable to instantiate parent class
Abstract methods getInputStream()/getFileSize() disappear, subclass compilation fails

Potential Risks:

Risk 1: Third-party Connectors inheriting this class will fail to compile
Risk 2: In a plugin architecture, user-defined File Sources may be affected

Impact Scope:

Direct Impact: All classes inheriting AccordingToSplitSizeSplitStrategy (none internally)
Indirect Impact: Possible third-party extensions
Affected Area: File Connector ecosystem

Severity: MINOR

Improvement Suggestions:
Option 1 (Recommended): Explicitly mark Breaking Changes in release notes, requiring users upgrading to 2.7.x to check if they inherit this class.

Option 2: Keep abstract base class, add concrete implementation

// Keep old abstract class (deprecated)
@Deprecated
public abstract class LegacyAccordingToSplitSizeSplitStrategy implements FileSplitStrategy {
    // Old abstract method
    protected abstract InputStream getInputStream(String filePath) throws IOException;
    protected abstract long getFileSize(String filePath) throws IOException;
}

// New concrete implementation
public class AccordingToSplitSizeSplitStrategy implements FileSplitStrategy, Closeable {
    // New implementation
}

Rationale: Although there are no internal subclasses currently, as a framework library, API stability should be maintained, at least marking @Deprecated to provide a transition period.

Issue 4: Code Duplication - HdfsFileSource and LocalFileSource have identical initialization logic

Location:

HdfsFileSource.java:36-48 (newly added)
LocalFileSource.java:35-47 (modified)

Related Context:

// HdfsFileSource.java
private static FileSplitStrategy initFileSplitStrategy(MultipleTableHdfsFileSourceConfig config) {
    Map<String, FileSplitStrategy> splitStrategies = new HashMap<>();
    for (BaseFileSourceConfig fileSourceConfig : config.getFileSourceConfigs()) {
        String tableId = fileSourceConfig.getCatalogTable().getTableId().toTablePath().toString();
        splitStrategies.put(tableId,
            FileSplitStrategyFactory.initFileSplitStrategy(
                fileSourceConfig.getBaseFileSourceConfig(),
                fileSourceConfig.getHadoopConfig()));
    }
    return new MultipleTableFileSplitStrategy(splitStrategies);
}

// LocalFileSource.java - Exactly the same!
private static FileSplitStrategy initFileSplitStrategy(MultipleTableLocalFileSourceConfig sourceConfig) {
    Map<String, FileSplitStrategy> splitStrategies = new HashMap<>();
    for (BaseFileSourceConfig fileSourceConfig : sourceConfig.getFileSourceConfigs()) {
        String tableId = fileSourceConfig.getCatalogTable().getTableId().toTablePath().toString();
        splitStrategies.put(tableId,
            FileSplitStrategyFactory.initFileSplitStrategy(
                fileSourceConfig.getBaseFileSourceConfig(),
                fileSourceConfig.getHadoopConfig()));
    }
    return new MultipleTableFileSplitStrategy(splitStrategies);
}

Problem Description:
The initFileSplitStrategy() method logic in both classes is completely identical, violating the DRY (Don't Repeat Yourself) principle. If initialization logic needs to be modified in the future (e.g., adding caching, logging), two places need to be modified simultaneously.

Potential Risks:

Risk 1: Increased maintenance cost (synchronized modifications to two places)
Risk 2: Easy to create inconsistencies (e.g., modifying only one place, forgetting the other)

Impact Scope:

Direct Impact: HdfsFileSource, LocalFileSource
Indirect Impact: Newly added FileSources in the future (e.g., S3FileSource)
Affected Area: All File Sources

Severity: MINOR

Improvement Suggestions:

// BaseMultipleTableFileSource.java
public abstract class BaseMultipleTableFileSource {
    protected static FileSplitStrategy initFileSplitStrategy(
            BaseMultipleTableFileSourceConfig sourceConfig) {
        Map<String, FileSplitStrategy> splitStrategies = new HashMap<>();
        for (BaseFileSourceConfig fileSourceConfig : sourceConfig.getFileSourceConfigs()) {
            String tableId = fileSourceConfig.getCatalogTable().getTableId().toTablePath().toString();
            splitStrategies.put(tableId,
                FileSplitStrategyFactory.initFileSplitStrategy(
                    fileSourceConfig.getBaseFileSourceConfig(),
                    fileSourceConfig.getHadoopConfig()));
        }
        return new MultipleTableFileSplitStrategy(splitStrategies);
    }
}

// HdfsFileSource.java
private HdfsFileSource(MultipleTableHdfsFileSourceConfig sourceConfig) {
    super(sourceConfig, initFileSplitStrategy(sourceConfig));  // Call parent class method
}

// LocalFileSource.java
private LocalFileSource(MultipleTableLocalFileSourceConfig sourceConfig) {
    super(sourceConfig, initFileSplitStrategy(sourceConfig));  // Call parent class method
}

Rationale: Extract common logic to parent class to reduce duplication and improve maintainability.

Issue 6: Incomplete Documentation - Missing performance recommendations and limitation notes

Location:
docs/en/connectors/source/HdfsFile.md:256-265

Related Context:
Current documentation:

### enable_file_split [boolean]
Turn on the file splitting function, the default is false. ...

### file_split_size [long]
File split size, which can be filled in when the enable_file_split parameter is true. ...
The default value is the number of bytes of 128MB, which is 134217728.

Problem Description:
The documentation is missing the following key information:

Performance Recommendations: In which scenarios should it be enabled? In which scenarios should it not be?
Limitation Notes: Maximum split number limit? Memory overhead?
Tuning Guide: How to choose an appropriate file_split_size?
Known Issues: For example, compressed files, archive files do not support splitting

Potential Risks:

Risk 1: Users abuse the feature (e.g., enabling splitting for small files) resulting in performance degradation
Risk 2: Users set too small split_size (e.g., 1KB) causing OOM
Risk 3: Users expect compressed files to also be splittable, actually not supported

Impact Scope:

Direct Impact: User experience
Indirect Impact: Increased support cost
Affected Area: HdfsFile, LocalFile

Severity: MINOR

Improvement Suggestions:

### enable_file_split [boolean]
Turn on the file splitting function, the default is false. ...

**Performance Recommendations:**
- **Enable when**: Reading few large files (> 1GB each) with parallelism > 4
- **Disable when**: Reading many small files (< 128MB each) or parallelism = 1
- **Not supported for**: Compressed files (gzip, bzip2), archive files (zip, tar)

**Limitations:**
- Maximum splits per file: 100,000 (configurable via `seaunnel.source.file.split.max-count`)
- Memory overhead: ~100 bytes per split

### file_split_size [long]
File split size, which can be filled in when the enable_file_split parameter is true. ...

**Tuning Guidelines:**
- **Recommended**: 128MB - 512MB for HDFS, 64MB - 256MB for local file
- **Too small (< 16MB)**: Excessive splits, high overhead
- **Too large (> 1GB)**: Insufficient parallelism
- **Formula**: `file_split_size = total_file_size / desired_parallelism`

Rationale: Add performance guidelines and limitation notes to help users use the feature correctly.

Issue 7: Architectural Complexity - LocalFile requiring HadoopConf configuration concept overflow

Location:
LocalFileSource.java:39-42

splitStrategies.put(tableId,
    FileSplitStrategyFactory.initFileSplitStrategy(
        fileSourceConfig.getBaseFileSourceConfig(),
        fileSourceConfig.getHadoopConfig()));  // ← Does LocalFile need HadoopConf?

Related Context:

LocalFileHadoopConfig: LocalFile's HadoopConf implementation (actually just configuring file:///)
AccordingToSplitSizeSplitStrategy: requires HadoopConf to create HadoopFileSystemProxy

Problem Description:
To reuse AccordingToSplitSizeSplitStrategy, LocalFile scenarios are also forced to introduce the HadoopConf concept. Although technically feasible (through LocalFileHadoopConfig), it is conceptually unnatural:

LocalFile users typically don't care about Hadoop configuration
Error messages may contain "Hadoop" wording, confusing LocalFile users

Potential Risks:

Risk 1: User confusion ("Why do I need Hadoop configuration to read local files?")
Risk 2: Misleading error messages (e.g., "Hadoop configuration error")

Impact Scope:

Direct Impact: LocalFile Source user experience
Indirect Impact: Documentation complexity
Affected Area: LocalFile Connector

Severity: MINOR

Improvement Suggestions:
Option 1 (Recommended): Explicitly state in documentation

For **LocalFile**, the `hadoop_conf` is optional and defaults to `file:///` schema.
You rarely need to configure it unless you have custom Hadoop settings.

Option 2 (Complex): Abstract FileSystemProxy interface

public interface FileSystemProxy {
    FSDataInputStream getInputStream(String path) throws IOException;
    FileStatus getFileStatus(String path) throws IOException;
}

public class HadoopFileSystemProxy implements FileSystemProxy { ... }
public class LocalFileSystemProxy implements FileSystemProxy { ... }  // Pure Java NIO

Rationale: Option 1 is simple and effective, Option 2 has clearer architecture but requires more work. The current solution is acceptable.

Issue 8: Checkpoint Compatibility Not Fully Validated

Location:
FileSourceSplit.java:44-49

public FileSourceSplit(String tableId, String filePath, long start, long length) {
    this.tableId = tableId;
    this.filePath = filePath;
    this.start = start;      // ← New field assigned in constructor
    this.length = length;    // ← New field assigned in constructor
}

Related Context:

serialVersionUID = 1L
Old version FileSourceSplit may only have tableId/filePath, start/length use default values 0/-1

Problem Description:
The PR did not validate the behavior of restoring from old version (without file split feature) checkpoints to the new version:

After deserializing FileSourceSplit from old checkpoint, start=0, length=-1
New code will read by byte range, length=-1 may cause problems
Need to check if AbstractReadStrategy.actualRead() correctly handles length=-1

Validation:
Check AbstractReadStrategy.safeSlice():

protected static InputStream safeSlice(InputStream in, long start, long length) throws IOException {
    // ...
    return new BoundedInputStream(in, length);  // How does BoundedInputStream handle length=-1?
}

Apache Commons IO's BoundedInputStream(-1) will read to the end of the stream, so the behavior is correct. But this is an implicit dependency.

Potential Risks:

Risk 1: Jobs upgrading from 2.6.x to 2.7.x, after checkpoint recovery, data inconsistency
Risk 2: The semantics of length=-1 are not clear enough (does it mean "read to end of file" or "disable split"?)

Impact Scope:

Direct Impact: Job recovery for upgrading users
Indirect Impact: Data consistency
Affected Area: All jobs using File Source

Severity: MAJOR

Improvement Suggestions:

// FileSourceSplit.java
public FileSourceSplit(String tableId, String filePath, long start, long length) {
    this.tableId = tableId;
    this.filePath = filePath;
    this.start = start;
    // Clarify semantics: -1 means "entire file" (compatible with old version)
    this.length = length;
}

// AbstractReadStrategy.java
protected static InputStream safeSlice(InputStream in, long start, long length) throws IOException {
    if (start > 0) {
        if (in instanceof Seekable) {
            ((Seekable) in).seek(start);
        } else {
            // ...
        }
    }
    // length=-1 means unlimited (compatible with old checkpoint)
    if (length < 0) {
        return in;
    }
    return new BoundedInputStream(in, length);
}

Also add tests:

@Test
void testDeserializeFromOldCheckpoint() {
    // Simulate FileSourceSplit from old version serialization
    FileSourceSplit oldSplit = new FileSourceSplit("table", "path");  // start=0, length=-1
    // Verify reading entire file
    List<String> records = readBySplit(oldSplit);
    Assertions.assertEquals(fullFileRecords, records);
}

Rationale: Clarify the semantics of length=-1, add tests to verify upgrade compatibility.

yzeng1618 · 2026-01-26T12:13:36Z

Issue 1: Binary Compatibility Break - Removal of LocalFileAccordingToSplitSizeSplitStrategy

Location:

seatunnel-connectors-v2/connector-file/connector-file-local/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/local/source/split/LocalFileAccordingToSplitSizeSplitStrategy.java (deleted)

Related Context:

Original class: LocalFileAccordingToSplitSizeSplitStrategy extends AccordingToSplitSizeSplitStrategy

Replaced by: Direct use of AccordingToSplitSizeSplitStrategy + LocalFileHadoopConfig

Problem Description: The PR deleted the LocalFileAccordingToSplitSizeSplitStrategy class. If external users or third-party Connectors inherit from this class, compilation errors or runtime ClassNotFoundException will occur after upgrade.

Potential Risks:

Risk 1: External inheritors will fail to compile (if source dependency)

Risk 2: Third-party Connectors compiled with old versions will have runtime ClassNotFoundException (if binary dependency)

Risk 3: Although File Connector is an internal module, there is no guarantee that users haven't extended based on it

Impact Scope:

Direct Impact: None (no inheritance within SeaTunnel)

Indirect Impact: Possible third-party Connectors

Affected Area: LocalFile Connector

Severity: MINOR

Improvement Suggestions:
// Keep old class as deprecated adapter
@Deprecated
public class LocalFileAccordingToSplitSizeSplitStrategy extends AccordingToSplitSizeSplitStrategy {
    public LocalFileAccordingToSplitSizeSplitStrategy(
            String rowDelimiter, long skipHeaderRowNumber, String encodingName, long splitSize) {
        super(new LocalFileHadoopConfig(), rowDelimiter, skipHeaderRowNumber, encodingName, splitSize);
    }
}
Rationale: Keep one version as a deprecated adapter to give users migration time, and completely remove it in version 2.8.0.

Issue 2: Boundary Condition Defect - delimiterBytes.length == 0 causes seek failure

Location: AccordingToSplitSizeSplitStrategy.java:207-209
long scanStart = Math.max(0, startPos - (delimiterBytes.length - 1));
input.seek(scanStart);
Related Context:

Constructor validation: if (rowDelimiter == null || rowDelimiter.isEmpty()) guards against null, but not against length 0

Actually "".isEmpty() == true, so the constructor will intercept, but this is an implicit dependency

Problem Description: Although the constructor will intercept rowDelimiter.isEmpty(), the algorithm in findNextDelimiterWithSeek assumes delimiterBytes.length >= 1. If someone modifies constructor validation logic in the future (e.g., allowing empty delimiter to indicate no splitting), a bug will occur here.

A more serious problem is: when delimiterBytes.length == 1, scanStart = startPos - 0 = startPos, the algorithm is correct; but when future support for multi-byte delimiter is added (already supported), the algorithm assumes scanStart might be less than startPos to capture cross-boundary delimiters.

Potential Risks:

Risk 1: If constructor validation is bypassed (e.g., reflection, deserialization attack), delimiterBytes.length == 0 will cause scanStart = startPos - (-1) = startPos + 1, seek exceeds file range

Risk 2: Code maintainers may not understand why length - 1, mistakenly changing it to other values

Impact Scope:

Direct Impact: AccordingToSplitSizeSplitStrategy.findNextDelimiterWithSeek()

Indirect Impact: All scenarios using text/csv/json splitting

Affected Area: HdfsFile, LocalFile

Severity: MINOR

Improvement Suggestions:
private long findNextDelimiterWithSeek(FSDataInputStream input, long startPos, long fileSize)
        throws IOException {
    // Explicit assertion to clarify algorithm assumptions
    if (delimiterBytes.length == 0) {
        throw new IllegalStateException("delimiterBytes must not be empty");
    }
    
    // Extract magic numbers as constants
    long scanStart = Math.max(0, startPos - (delimiterBytes.length - 1));
    input.seek(scanStart);
    // ...
}
Rationale: Add defensive assertions to make algorithm assumptions explicit even though the constructor has validated, facilitating future maintenance.

Issue 3: API Compatibility - AccordingToSplitSizeSplitStrategy changed from abstract to concrete

Location: AccordingToSplitSizeSplitStrategy.java:50
// Old version
public abstract class AccordingToSplitSizeSplitStrategy implements FileSplitStrategy {
    protected abstract InputStream getInputStream(String filePath) throws IOException;
    protected abstract long getFileSize(String filePath) throws IOException;
}

// New version
public class AccordingToSplitSizeSplitStrategy implements FileSplitStrategy, Closeable {
    public AccordingToSplitSizeSplitStrategy(HadoopConf hadoopConf, ...) {
        this.hadoopFileSystemProxy = new HadoopFileSystemProxy(hadoopConf);
    }
}
Related Context:

Subclass: LocalFileAccordingToSplitSizeSplitStrategy (deleted)

No other known subclasses

Problem Description: AccordingToSplitSizeSplitStrategy changed from abstract class to concrete class, and constructor signature changed from (String, long, String, long) to (HadoopConf, String, long, String, long). This is a breaking change.

Although there are no other subclasses within SeaTunnel, if external extensions inherit from this class, they will:

Compilation error: unable to instantiate parent class

Abstract methods getInputStream()/getFileSize() disappear, subclass compilation fails

Potential Risks:

Risk 1: Third-party Connectors inheriting this class will fail to compile

Risk 2: In a plugin architecture, user-defined File Sources may be affected

Impact Scope:

Direct Impact: All classes inheriting AccordingToSplitSizeSplitStrategy (none internally)

Indirect Impact: Possible third-party extensions

Affected Area: File Connector ecosystem

Severity: MINOR

Improvement Suggestions: Option 1 (Recommended): Explicitly mark Breaking Changes in release notes, requiring users upgrading to 2.7.x to check if they inherit this class.

Option 2: Keep abstract base class, add concrete implementation
// Keep old abstract class (deprecated)
@Deprecated
public abstract class LegacyAccordingToSplitSizeSplitStrategy implements FileSplitStrategy {
    // Old abstract method
    protected abstract InputStream getInputStream(String filePath) throws IOException;
    protected abstract long getFileSize(String filePath) throws IOException;
}

// New concrete implementation
public class AccordingToSplitSizeSplitStrategy implements FileSplitStrategy, Closeable {
    // New implementation
}
Rationale: Although there are no internal subclasses currently, as a framework library, API stability should be maintained, at least marking @Deprecated to provide a transition period.

Issue 4: Code Duplication - HdfsFileSource and LocalFileSource have identical initialization logic

Location:

HdfsFileSource.java:36-48 (newly added)

LocalFileSource.java:35-47 (modified)

Related Context:
// HdfsFileSource.java
private static FileSplitStrategy initFileSplitStrategy(MultipleTableHdfsFileSourceConfig config) {
    Map<String, FileSplitStrategy> splitStrategies = new HashMap<>();
    for (BaseFileSourceConfig fileSourceConfig : config.getFileSourceConfigs()) {
        String tableId = fileSourceConfig.getCatalogTable().getTableId().toTablePath().toString();
        splitStrategies.put(tableId,
            FileSplitStrategyFactory.initFileSplitStrategy(
                fileSourceConfig.getBaseFileSourceConfig(),
                fileSourceConfig.getHadoopConfig()));
    }
    return new MultipleTableFileSplitStrategy(splitStrategies);
}

// LocalFileSource.java - Exactly the same!
private static FileSplitStrategy initFileSplitStrategy(MultipleTableLocalFileSourceConfig sourceConfig) {
    Map<String, FileSplitStrategy> splitStrategies = new HashMap<>();
    for (BaseFileSourceConfig fileSourceConfig : sourceConfig.getFileSourceConfigs()) {
        String tableId = fileSourceConfig.getCatalogTable().getTableId().toTablePath().toString();
        splitStrategies.put(tableId,
            FileSplitStrategyFactory.initFileSplitStrategy(
                fileSourceConfig.getBaseFileSourceConfig(),
                fileSourceConfig.getHadoopConfig()));
    }
    return new MultipleTableFileSplitStrategy(splitStrategies);
}
Problem Description: The initFileSplitStrategy() method logic in both classes is completely identical, violating the DRY (Don't Repeat Yourself) principle. If initialization logic needs to be modified in the future (e.g., adding caching, logging), two places need to be modified simultaneously.

Potential Risks:

Risk 1: Increased maintenance cost (synchronized modifications to two places)

Risk 2: Easy to create inconsistencies (e.g., modifying only one place, forgetting the other)

Impact Scope:

Direct Impact: HdfsFileSource, LocalFileSource

Indirect Impact: Newly added FileSources in the future (e.g., S3FileSource)

Affected Area: All File Sources

Severity: MINOR

Improvement Suggestions:
// BaseMultipleTableFileSource.java
public abstract class BaseMultipleTableFileSource {
    protected static FileSplitStrategy initFileSplitStrategy(
            BaseMultipleTableFileSourceConfig sourceConfig) {
        Map<String, FileSplitStrategy> splitStrategies = new HashMap<>();
        for (BaseFileSourceConfig fileSourceConfig : sourceConfig.getFileSourceConfigs()) {
            String tableId = fileSourceConfig.getCatalogTable().getTableId().toTablePath().toString();
            splitStrategies.put(tableId,
                FileSplitStrategyFactory.initFileSplitStrategy(
                    fileSourceConfig.getBaseFileSourceConfig(),
                    fileSourceConfig.getHadoopConfig()));
        }
        return new MultipleTableFileSplitStrategy(splitStrategies);
    }
}

// HdfsFileSource.java
private HdfsFileSource(MultipleTableHdfsFileSourceConfig sourceConfig) {
    super(sourceConfig, initFileSplitStrategy(sourceConfig));  // Call parent class method
}

// LocalFileSource.java
private LocalFileSource(MultipleTableLocalFileSourceConfig sourceConfig) {
    super(sourceConfig, initFileSplitStrategy(sourceConfig));  // Call parent class method
}
Rationale: Extract common logic to parent class to reduce duplication and improve maintainability.

Issue 6: Incomplete Documentation - Missing performance recommendations and limitation notes

Location: docs/en/connectors/source/HdfsFile.md:256-265

Related Context: Current documentation:
### enable_file_split [boolean]
Turn on the file splitting function, the default is false. ...

### file_split_size [long]
File split size, which can be filled in when the enable_file_split parameter is true. ...
The default value is the number of bytes of 128MB, which is 134217728.
Problem Description: The documentation is missing the following key information:

Performance Recommendations: In which scenarios should it be enabled? In which scenarios should it not be?

Limitation Notes: Maximum split number limit? Memory overhead?

Tuning Guide: How to choose an appropriate file_split_size?

Known Issues: For example, compressed files, archive files do not support splitting

Potential Risks:

Risk 1: Users abuse the feature (e.g., enabling splitting for small files) resulting in performance degradation

Risk 2: Users set too small split_size (e.g., 1KB) causing OOM

Risk 3: Users expect compressed files to also be splittable, actually not supported

Impact Scope:

Direct Impact: User experience

Indirect Impact: Increased support cost

Affected Area: HdfsFile, LocalFile

Severity: MINOR

Improvement Suggestions:
### enable_file_split [boolean]
Turn on the file splitting function, the default is false. ...

**Performance Recommendations:**
- **Enable when**: Reading few large files (> 1GB each) with parallelism > 4
- **Disable when**: Reading many small files (< 128MB each) or parallelism = 1
- **Not supported for**: Compressed files (gzip, bzip2), archive files (zip, tar)

**Limitations:**
- Maximum splits per file: 100,000 (configurable via `seaunnel.source.file.split.max-count`)
- Memory overhead: ~100 bytes per split

### file_split_size [long]
File split size, which can be filled in when the enable_file_split parameter is true. ...

**Tuning Guidelines:**
- **Recommended**: 128MB - 512MB for HDFS, 64MB - 256MB for local file
- **Too small (< 16MB)**: Excessive splits, high overhead
- **Too large (> 1GB)**: Insufficient parallelism
- **Formula**: `file_split_size = total_file_size / desired_parallelism`
Rationale: Add performance guidelines and limitation notes to help users use the feature correctly.

Issue 7: Architectural Complexity - LocalFile requiring HadoopConf configuration concept overflow

Location: LocalFileSource.java:39-42
splitStrategies.put(tableId,
    FileSplitStrategyFactory.initFileSplitStrategy(
        fileSourceConfig.getBaseFileSourceConfig(),
        fileSourceConfig.getHadoopConfig()));  // ← Does LocalFile need HadoopConf?
Related Context:

LocalFileHadoopConfig: LocalFile's HadoopConf implementation (actually just configuring file:///)

AccordingToSplitSizeSplitStrategy: requires HadoopConf to create HadoopFileSystemProxy

Problem Description: To reuse AccordingToSplitSizeSplitStrategy, LocalFile scenarios are also forced to introduce the HadoopConf concept. Although technically feasible (through LocalFileHadoopConfig), it is conceptually unnatural:

LocalFile users typically don't care about Hadoop configuration

Error messages may contain "Hadoop" wording, confusing LocalFile users

Potential Risks:

Risk 1: User confusion ("Why do I need Hadoop configuration to read local files?")

Risk 2: Misleading error messages (e.g., "Hadoop configuration error")

Impact Scope:

Direct Impact: LocalFile Source user experience

Indirect Impact: Documentation complexity

Affected Area: LocalFile Connector

Severity: MINOR

Improvement Suggestions: Option 1 (Recommended): Explicitly state in documentation
For **LocalFile**, the `hadoop_conf` is optional and defaults to `file:///` schema.
You rarely need to configure it unless you have custom Hadoop settings.
Option 2 (Complex): Abstract FileSystemProxy interface
public interface FileSystemProxy {
    FSDataInputStream getInputStream(String path) throws IOException;
    FileStatus getFileStatus(String path) throws IOException;
}

public class HadoopFileSystemProxy implements FileSystemProxy { ... }
public class LocalFileSystemProxy implements FileSystemProxy { ... }  // Pure Java NIO
Rationale: Option 1 is simple and effective, Option 2 has clearer architecture but requires more work. The current solution is acceptable.

Issue 8: Checkpoint Compatibility Not Fully Validated

Location: FileSourceSplit.java:44-49
public FileSourceSplit(String tableId, String filePath, long start, long length) {
    this.tableId = tableId;
    this.filePath = filePath;
    this.start = start;      // ← New field assigned in constructor
    this.length = length;    // ← New field assigned in constructor
}
Related Context:

serialVersionUID = 1L

Old version FileSourceSplit may only have tableId/filePath, start/length use default values 0/-1

Problem Description: The PR did not validate the behavior of restoring from old version (without file split feature) checkpoints to the new version:

After deserializing FileSourceSplit from old checkpoint, start=0, length=-1

New code will read by byte range, length=-1 may cause problems

Need to check if AbstractReadStrategy.actualRead() correctly handles length=-1

Validation: Check AbstractReadStrategy.safeSlice():
protected static InputStream safeSlice(InputStream in, long start, long length) throws IOException {
    // ...
    return new BoundedInputStream(in, length);  // How does BoundedInputStream handle length=-1?
}
Apache Commons IO's BoundedInputStream(-1) will read to the end of the stream, so the behavior is correct. But this is an implicit dependency.

Potential Risks:

Risk 1: Jobs upgrading from 2.6.x to 2.7.x, after checkpoint recovery, data inconsistency

Risk 2: The semantics of length=-1 are not clear enough (does it mean "read to end of file" or "disable split"?)

Impact Scope:

Direct Impact: Job recovery for upgrading users

Indirect Impact: Data consistency

Affected Area: All jobs using File Source

Severity: MAJOR

Improvement Suggestions:
// FileSourceSplit.java
public FileSourceSplit(String tableId, String filePath, long start, long length) {
    this.tableId = tableId;
    this.filePath = filePath;
    this.start = start;
    // Clarify semantics: -1 means "entire file" (compatible with old version)
    this.length = length;
}

// AbstractReadStrategy.java
protected static InputStream safeSlice(InputStream in, long start, long length) throws IOException {
    if (start > 0) {
        if (in instanceof Seekable) {
            ((Seekable) in).seek(start);
        } else {
            // ...
        }
    }
    // length=-1 means unlimited (compatible with old checkpoint)
    if (length < 0) {
        return in;
    }
    return new BoundedInputStream(in, length);
}
Also add tests:
@Test
void testDeserializeFromOldCheckpoint() {
    // Simulate FileSourceSplit from old version serialization
    FileSourceSplit oldSplit = new FileSourceSplit("table", "path");  // start=0, length=-1
    // Verify reading entire file
    List<String> records = readBySplit(oldSplit);
    Assertions.assertEquals(fullFileRecords, records);
}
Rationale: Clarify the semantics of length=-1, add tests to verify upgrade compatibility.

The above suggestions have been implemented.

Carl-Zhou-CN

LGTM

davidzollo

+1

[Feature][Connector-V2][HdfsFile] Support true large-file split for p…

21a4243

…arallel read

github-actions bot added document connectors-v2 e2e file labels Jan 13, 2026

chl-wxp reviewed Jan 13, 2026

View reviewed changes

yzeng1618 requested a review from chl-wxp January 14, 2026 02:10

zengyi added 2 commits January 14, 2026 15:38

[Improve][connectors-v2/connector-file] Unify Local/HDFS file split s…

3f4c1df

…trategies in base and fix restore

[Improve][connectors-v2/connector-file] fix ci

d612815

Carl-Zhou-CN reviewed Jan 15, 2026

View reviewed changes

...va/org/apache/seatunnel/connectors/seatunnel/file/source/split/ParquetFileSplitStrategy.java Show resolved Hide resolved

Carl-Zhou-CN reviewed Jan 15, 2026

View reviewed changes

docs/en/connectors/source/HdfsFile.md Show resolved Hide resolved

zengyi added 2 commits January 15, 2026 11:51

Merge branch 'dev' into dev-hdfs-split

675143c

[Feature][Connector-V2][HdfsFile] add parquet e2e test

d826f55

yzeng1618 requested a review from Carl-Zhou-CN January 15, 2026 06:41

[Feature][Connector-V2][HdfsFile] add FileConnectorErrorCode and Seek…

b57806d

…able

Carl-Zhou-CN previously approved these changes Jan 17, 2026

View reviewed changes

github-actions bot added approved reviewed labels Jan 17, 2026

chl-wxp reviewed Jan 19, 2026

View reviewed changes

...va/org/apache/seatunnel/connectors/seatunnel/file/source/split/ParquetFileSplitStrategy.java Show resolved Hide resolved

yzeng1618 requested a review from chl-wxp January 19, 2026 10:29

chl-wxp approved these changes Jan 19, 2026

View reviewed changes

Merge branch 'dev' into dev-hdfs-split

bdf5235

yzeng1618 dismissed Carl-Zhou-CN’s stale review via bdf5235 January 24, 2026 12:20

github-actions bot removed the approved label Jan 24, 2026

davidzollo previously approved these changes Jan 24, 2026

View reviewed changes

github-actions bot added the approved label Jan 24, 2026

[Feature][Connector-V2][HdfsFile] mvn spotless:apply

e8e64e2

yzeng1618 dismissed davidzollo’s stale review via e8e64e2 January 24, 2026 12:26

github-actions bot removed the approved label Jan 24, 2026

yzeng1618 requested review from Carl-Zhou-CN, davidzollo and zhangshenghang January 24, 2026 12:27

[Feature][Connector-V2][HdfsFile] fix issue

22b3e1b

Carl-Zhou-CN approved these changes Jan 28, 2026

View reviewed changes

github-actions bot added the approved label Jan 28, 2026

davidzollo approved these changes Jan 29, 2026

View reviewed changes

davidzollo merged commit 4c110c8 into apache:dev Jan 29, 2026
5 checks passed

Conversation

yzeng1618 commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Carl-Zhou-CN commented Jan 15, 2026

Uh oh!

Uh oh!

zhangshenghang commented Jan 15, 2026

Uh oh!

yzeng1618 commented Jan 16, 2026

Uh oh!

Carl-Zhou-CN left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davidzollo commented Jan 24, 2026

Uh oh!

yzeng1618 commented Jan 24, 2026

Uh oh!

davidzollo left a comment

Choose a reason for hiding this comment

Uh oh!

DanielCarter-stack commented Jan 24, 2026

Issue 1: Binary Compatibility Break - Removal of LocalFileAccordingToSplitSizeSplitStrategy

Issue 2: Boundary Condition Defect - delimiterBytes.length == 0 causes seek failure

Issue 3: API Compatibility - AccordingToSplitSizeSplitStrategy changed from abstract to concrete

Issue 4: Code Duplication - HdfsFileSource and LocalFileSource have identical initialization logic

Issue 6: Incomplete Documentation - Missing performance recommendations and limitation notes

Issue 7: Architectural Complexity - LocalFile requiring HadoopConf configuration concept overflow

Issue 8: Checkpoint Compatibility Not Fully Validated

Uh oh!

yzeng1618 commented Jan 26, 2026

Issue 1: Binary Compatibility Break - Removal of LocalFileAccordingToSplitSizeSplitStrategy

Issue 2: Boundary Condition Defect - delimiterBytes.length == 0 causes seek failure

Issue 3: API Compatibility - AccordingToSplitSizeSplitStrategy changed from abstract to concrete

Issue 4: Code Duplication - HdfsFileSource and LocalFileSource have identical initialization logic

Issue 6: Incomplete Documentation - Missing performance recommendations and limitation notes

Issue 7: Architectural Complexity - LocalFile requiring HadoopConf configuration concept overflow

Issue 8: Checkpoint Compatibility Not Fully Validated

Uh oh!

Carl-Zhou-CN left a comment

Choose a reason for hiding this comment

Uh oh!

davidzollo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yzeng1618 commented Jan 13, 2026 •

edited

Loading