Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark: when doing rewrite_data_files, check for partitioning schema compatibility #12651

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

adrians
Copy link

@adrians adrians commented Mar 26, 2025

Context - the implementation as seen in the current release allows for this kind of scenario:

  • We start with a table of 20TB of data-files, divided in 200 coarse partitions (200 partitions X 200 parquet files X 512MB).
  • We want to do a partitioning schema evolution, to split every partition in 10 smaller partitions.
  • Doing a first rewrite_data_files, we obtain 200 file-groups (assigned randomly from the parquet-files) x 100GB each. This is caused by a bit of logic that says "if those files do not match the latest partitioning schema, assume they're unpartitioned".
  • After the first rewrite_data_files, we're left with 2000 fine partitions, but since every file-group can (at least in theory) write to every partition, the expected result is something like 2000 partitions x 200 files x 50MB.
  • At this point, we need to compact those data-files, so we run a second rewrite_data_files.
  • After the second rewrite_data_files, we're finally left with 2000 partitions x 20 files x 512MB.

This Pull-request proposes an algorithm that simplifies the scenario:

When building the file-groups for the first rewrite_data_files, check if the old partitioning schema is a coarser variant of the current schema. If that's the case, try to build file-groups using that partitioning system. The scenario now becomes:

  • Doing a first rewrite_data_files, we obtain 200 file-groups x 100GB each (based on the old partitioning schema).
  • After the first rewrite_data_files, we're left with 2000 fine partitions, but since every fine-partition can be obtained from a single parent old-partition, the expected result is something like 2000 partitions x 20 files x 512MB.
  • The second pass is not necessary. (In practice, if the coarse-partitions are slightly larger than 100GB, they might be split into 2 file-groups, so there might be some small parquet-files to compact, but this task is orders of magnitude faster now).

This is a significant improvement in terms of time taken to apply the new partitioning schema.

The criteria to determine if the new partitioning is "finer or the same" than the old partitioning look something like this:

  • the new (finer) partitioning spec has more (or the same) number of fields than the old (coarse) one;
    AND
  • the first N source-columns for the new (finer) partitioning spec must be the same as the N source-columns of the old partitioning-spec (N = number of fields in the old partition-spec)
    AND
  • the first N fields of the new (finer) partitioning spec must have "finer" transformations than the N fields in the old spec (N = number of fields in the old partition-spec) - see table below
if old.field[i].transformation is then new.field[i].transformation is the same or more specific
identity identity
year year
month
day
hour
identity
month month
day
hour
identity
day day
hour
identity
hour hour
identity
truncate(x) truncate(y) AND y≥x
identity
bucket(x) bucket(y) AND y≥x AND y%x=0
identity

For the third bullet-point in the list of criteria, I have found that the boolean Transform.satisfiesOrderOf(Transform a) method that implements that predicate pretty well - except maybe for the bucket case, for which it'll fall back to the "unpartitioned" scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant