Description
Environment
Delta-rs version: 0.17.3
Binding: python
Bug
What happened:
In the docs, compact is described as idempotent
This operation is idempotent; if run twice on the same table (assuming it has not been updated) it will do nothing the second time.
In one of my delta tables, I noticed this is not true. Looking at the optimize algorithm, it's pretty simple, grouping files based on their size
up to the target_size
.
However, the actual file outputted from this algorithm will often be smaller than the bin.total_file_size()
because of parquet compression when the data is actually written.
I ran into a scenario where the sizes of the bins were the following (where the target size is the default 104857600):
[104856888, 104687238, 104754998, 104857489, 104679957, 4207383]
But the actual sizes were:
[61364358, 60037383, 58517127, 56870681, 53391180, 3111870]
which means that you can actually compact it again.
What you expected to happen:
I don't think this is a trivial problem to solve (would need to write temp parquet files to get the actual sizes, and run optimize in a loop based off of these actual file sizes?), so would be satisfied with just removing the idempotency claim in the docs.