Spark: when doing rewrite_data_files, check for partitioning schema compatibility #12651
+80
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context - the implementation as seen in the current release allows for this kind of scenario:
rewrite_data_files
, we obtain 200 file-groups (assigned randomly from the parquet-files) x 100GB each. This is caused by a bit of logic that says "if those files do not match the latest partitioning schema, assume they're unpartitioned".rewrite_data_files
, we're left with 2000 fine partitions, but since every file-group can (at least in theory) write to every partition, the expected result is something like 2000 partitions x 200 files x 50MB.rewrite_data_files
.rewrite_data_files
, we're finally left with 2000 partitions x 20 files x 512MB.This Pull-request proposes an algorithm that simplifies the scenario:
When building the file-groups for the first
rewrite_data_files
, check if the old partitioning schema is a coarser variant of the current schema. If that's the case, try to build file-groups using that partitioning system. The scenario now becomes:rewrite_data_files
, we obtain 200 file-groups x 100GB each (based on the old partitioning schema).rewrite_data_files
, we're left with 2000 fine partitions, but since every fine-partition can be obtained from a single parent old-partition, the expected result is something like 2000 partitions x 20 files x 512MB.This is a significant improvement in terms of time taken to apply the new partitioning schema.
The criteria to determine if the new partitioning is "finer or the same" than the old partitioning look something like this:
AND
AND
old.field[i].transformation
isnew.field[i].transformation
is the same or more specificmonth
day
hour
identity
day
hour
identity
hour
identity
identity
identity
identity
For the third bullet-point in the list of criteria, I have found that the
boolean Transform.satisfiesOrderOf(Transform a)
method that implements that predicate pretty well - except maybe for thebucket
case, for which it'll fall back to the "unpartitioned" scenario.