Skip to content

[Feature][Connector-V2][HdfsFile] Support true large-file split for parallel read (byte-range/row-delimiter + Parquet RowGroup) #10326

@yzeng1618

Description

@yzeng1618

Search before asking

  • I had searched in the feature and found no similar feature requirement.

Description

Currently connector-file-hadoop’s HdfsFile source still uses the default split behavior: one file -> one split. When the number of files is small but a single file is huge (tens of GB), the read parallelism cannot scale, so the job effectively reads with single concurrency.

connector-file-local already added large-file splitting support in PR #10142 (select split strategy by config: row-delimiter split for Text/CSV/JSON, RowGroup split for Parquet). However, HdfsFile is not covered

Usage Scenario

  1. Ingest single / few extremely large files (CSV / plain log / NDJSON, tens of GB) stored in HDFS.
  2. Current behavior: only one split is generated per file, so only one reader does work even if env.parallelism is high.
  3. Expected behavior: when enable_file_split=true, split the large file into multiple splits and read in parallel:
  • Text/CSV/JSON: split by file_split_size and align to row_delimiter (no broken lines, no duplicates/missing).
  • Parquet: split by RowGroup (each RowGroup as a split, or pack RowGroups by size).

Related issues

#10129

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions