Delta Lake: Build the stats based on the table data columns #27953
+16
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
If the parquet data file statistics contain more columns (e.g.: the partition columns) than the data columns from the Delta Lake table, an NPE is going to be thrown when encoding the min/max stats.
Flip the existing logic to build the stats based on the columns of the table instead of the stats of the parquet data file.
NOTE that this means that, when writing with Trino, the min/max stats will be missing the partition columns, even though the data files contain such stats.
Relevant stacktrace
Additional context and related issues
The issue can be easily reproduced with
TestDeltaLakeWriteDatabricksCompatibility.testCaseUpdateInPartitiontest on a Databricks 14.x runtime.It seems that the partition columns are being materialized in the data files even without the feature
materializePartitionColumnsbeing present in the table.https://github.com/delta-io/delta/blob/master/PROTOCOL.md#materialize-partition-columns
Release notes
(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: