Skip to content

Conversation

@yzeng1618
Copy link
Contributor

@yzeng1618 yzeng1618 commented Jan 13, 2026

Purpose of this pull request

#10326
HdfsFile source currently uses “one file = one split”, which limits parallelism when there are only a few huge files.

Does this PR introduce any user-facing change?

yes

  1. This PR adds enable_file_split / file_split_size to HdfsFile source and wires HDFS-specific split strategies:
  • text/csv/json: split by file_split_size and align split end to the next row_delimiter (HDFS seek-based implementation for large files).

  • parquet: split by RowGroup (logical split) and read footer metadata using HadoopConf-backed Configuration (works with Kerberos/HA/NameService).

  1. Example:
enable_file_split = true
file_split_size = 268435456
row_delimiter = "\n" (for text/csv/json)

How was this patch tested?

  1. Unit tests
    HdfsFileAccordingToSplitSizeSplitStrategyTest#testReadBySplitsShouldMatchFullRead

  2. E2E
    HdfsFileIT#testHdfsTextReadWithFileSplit

Check list

import java.util.ArrayList;
import java.util.List;

public class HdfsFileAccordingToSplitSizeSplitStrategy implements FileSplitStrategy, Closeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class HdfsFileAccordingToSplitSizeSplitStrategy implements FileSplitStrategy, Closeable {
public class HdfsFileAccordingToSplitSizeSplitStrategy extends AccordingToSplitSizeSplitStrategy{

1、Modify AccordingToSplitSizeSplitStrategy to introduce HadoopFileSystemProxy
2、Delete LocalFileAccordingToSplitSizeSplitStrategy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so you still suggest reusing the base code as much as possible? My main concern is that it might affect the logic of the localfile code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so you still suggest reusing the base code as much as possible? My main concern is that it might affect the logic of the localfile code.

Yes, the sharding logic should be maintained uniformly in the base, and there should not be a lot of code redundancy. It is also convenient to add this function to file system connectors such as S3 and OSS.


import static org.apache.seatunnel.connectors.seatunnel.file.config.FileBaseSourceOptions.DEFAULT_ROW_DELIMITER;

public class HdfsFileSplitStrategyFactory {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public class HdfsFileSplitStrategyFactory {
public class FileSplitStrategyFactory {

Delete LocalFileSplitStrategyFactory

import java.util.ArrayList;
import java.util.List;

public class HdfsParquetFileSplitStrategy implements FileSplitStrategy, Closeable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove HdfsParquetFileSplitStrategy and enhance ParquetFileSplitStrategy

@yzeng1618 yzeng1618 requested a review from chl-wxp January 14, 2026 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants