-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read #10332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
| import java.util.ArrayList; | ||
| import java.util.List; | ||
|
|
||
| public class HdfsFileAccordingToSplitSizeSplitStrategy implements FileSplitStrategy, Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| public class HdfsFileAccordingToSplitSizeSplitStrategy implements FileSplitStrategy, Closeable { | |
| public class HdfsFileAccordingToSplitSizeSplitStrategy extends AccordingToSplitSizeSplitStrategy{ |
1、Modify AccordingToSplitSizeSplitStrategy to introduce HadoopFileSystemProxy
2、Delete LocalFileAccordingToSplitSizeSplitStrategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so you still suggest reusing the base code as much as possible? My main concern is that it might affect the logic of the localfile code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so you still suggest reusing the base code as much as possible? My main concern is that it might affect the logic of the localfile code.
Yes, the sharding logic should be maintained uniformly in the base, and there should not be a lot of code redundancy. It is also convenient to add this function to file system connectors such as S3 and OSS.
|
|
||
| import static org.apache.seatunnel.connectors.seatunnel.file.config.FileBaseSourceOptions.DEFAULT_ROW_DELIMITER; | ||
|
|
||
| public class HdfsFileSplitStrategyFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| public class HdfsFileSplitStrategyFactory { | |
| public class FileSplitStrategyFactory { |
Delete LocalFileSplitStrategyFactory
| import java.util.ArrayList; | ||
| import java.util.List; | ||
|
|
||
| public class HdfsParquetFileSplitStrategy implements FileSplitStrategy, Closeable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove HdfsParquetFileSplitStrategy and enhance ParquetFileSplitStrategy
Purpose of this pull request
#10326
HdfsFile source currently uses “one file = one split”, which limits parallelism when there are only a few huge files.
Does this PR introduce any user-facing change?
yes
text/csv/json: split by file_split_size and align split end to the next row_delimiter (HDFS seek-based implementation for large files).
parquet: split by RowGroup (logical split) and read footer metadata using HadoopConf-backed Configuration (works with Kerberos/HA/NameService).
How was this patch tested?
Unit tests
HdfsFileAccordingToSplitSizeSplitStrategyTest#testReadBySplitsShouldMatchFullRead
E2E
HdfsFileIT#testHdfsTextReadWithFileSplit
Check list
New License Guide
incompatible-changes.mdto describe the incompatibility caused by this PR.