Search before asking
Description
Currently connector-file-hadoop’s HdfsFile source still uses the default split behavior: one file -> one split. When the number of files is small but a single file is huge (tens of GB), the read parallelism cannot scale, so the job effectively reads with single concurrency.
connector-file-local already added large-file splitting support in PR #10142 (select split strategy by config: row-delimiter split for Text/CSV/JSON, RowGroup split for Parquet). However, HdfsFile is not covered
Usage Scenario
- Ingest single / few extremely large files (CSV / plain log / NDJSON, tens of GB) stored in HDFS.
- Current behavior: only one split is generated per file, so only one reader does work even if env.parallelism is high.
- Expected behavior: when enable_file_split=true, split the large file into multiple splits and read in parallel:
- Text/CSV/JSON: split by file_split_size and align to row_delimiter (no broken lines, no duplicates/missing).
- Parquet: split by RowGroup (each RowGroup as a split, or pack RowGroups by size).
Related issues
#10129
Are you willing to submit a PR?
Code of Conduct