Description
- Part of [EPIC] Avoid sort for already sorted Parquet files that do not overlap values on condition #6672
Describe the bug
As @xudong963 mentions in
And also brought up again in
When table_schema is different from file_schema then the current statistics merging code will incorrectly merge statistics
Specifically, it merges column statistics based on their ordinal position (order in the file)
Currently this isn't a huge problem as the statistics are only used in a limited way for some optimizations, but as we start to rely on statistics for correctness, such as #6672 it is more important
To Reproduce
if we have two files
- File 1:
(a int32, b int32)
- File 2:
(b int32, a int32)
I think the code on main will combine statistics for columns a in File 1 and column b
in File 2 together.
Expected behavior
I expect that only statistics from the same logical column are merged together.
Additional context
After #15661 is merged, I suggest:
- adding some function that knows how to map columns from a file schema --> table schema (filling in any missing columns with
ColumnStatistics::new_unnown
) before combining them - Adding testst
Maybe we can simply reuse the existing SchemaMapper
/ factory 🤔 so we are sure the statistics merging is consistent with runtime