Bug Report
The issue comes up when importing large volumes of compressed Parquet files. The on-disk size can be well below 500GB (around 90GB in our case), but the data expands past 500GB once decompressed. Because the split decision is based on the compressed file size rather than the actual uncompressed data, the import ends up with far fewer subtasks than it should, which hurts parallelism.
Bug Report
The issue comes up when importing large volumes of compressed Parquet files. The on-disk size can be well below 500GB (around 90GB in our case), but the data expands past 500GB once decompressed. Because the split decision is based on the compressed file size rather than the actual uncompressed data, the import ends up with far fewer subtasks than it should, which hurts parallelism.