Skip to content

ListingTable statistics improperly merges statistics when files have different schemas #15689

Open
@alamb

Description

@alamb

Describe the bug

As @xudong963 mentions in

And also brought up again in

When table_schema is different from file_schema then the current statistics merging code will incorrectly merge statistics

Specifically, it merges column statistics based on their ordinal position (order in the file)

Currently this isn't a huge problem as the statistics are only used in a limited way for some optimizations, but as we start to rely on statistics for correctness, such as #6672 it is more important

To Reproduce

if we have two files

  • File 1: (a int32, b int32)
  • File 2: (b int32, a int32)

I think the code on main will combine statistics for columns a in File 1 and column b in File 2 together.

Expected behavior

I expect that only statistics from the same logical column are merged together.

Additional context

After #15661 is merged, I suggest:

  1. adding some function that knows how to map columns from a file schema --> table schema (filling in any missing columns with ColumnStatistics::new_unnown) before combining them
  2. Adding testst

Maybe we can simply reuse the existing SchemaMapper / factory 🤔 so we are sure the statistics merging is consistent with runtime

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions