Skip to content

Performing multiple optimize operations after creating a checkpoint can lead to data errors and exponential data bloat. #3047

@Curricane

Description

@Curricane

Environment

python3
Delta-rs version:
0.22.2
Binding:

Environment:

  • Cloud provider:
  • OS: centos8
  • Other:

Bug

What happened:
i insert 4000 rows data into a delta table

create checkpoint

optimize more than once

data now is 12000 rows

In [109]: dt.create_checkpoint()

In [110]: dt.optimize.compact(target_size=1024*256)
Out[110]:
{'numFilesAdded': 2,
 'numFilesRemoved': 2,
 'filesAdded': '{"avg":112461.0,"max":168540,"min":56382,"totalFiles":2,"totalSize":224922}',
 'filesRemoved': '{"avg":127678.5,"max":168540,"min":86817,"totalFiles":2,"totalSize":255357}',
 'partitionsOptimized': 1,
 'numBatches': 4,
 'totalConsideredFiles': 2,
 'totalFilesSkipped': 0,
 'preserveInsertionOrder': True}

In [111]: dt.version()
Out[111]: 5

In [112]: dt.to_pandas()
Out[112]:
        id                                              value
0     1000  value-1000-4ryilo616rsw4pz8on92tbyi2o04hgkrug0...
1     1001  value-1001-3eh8aav3x21jwkme3h9e56dyc4lrdhlzur1...
2     1002  value-1002-bpxn3ndnll87fq6f17tv1ij0pqhra7wj0jx...
3     1003  value-1003-g0bssmsjxrt21a3p95a7g8q2mic043ym511...
4     1004  value-1004-6yjmva7ezuwtwlw0vymf1ldzq60ih4yzvmc...
...    ...                                                ...
3995   995  value-995-x96gnl6173qdzuev650z9o2dfb0pg3wzthyq...
3996   996  value-996-0883a2ltvaic6wfsu1wk7quj6n04kawgnnfx...
3997   997  value-997-kjp4433vk5x37ly1yrf0ozboqzvn4mfh5u94...
3998   998  value-998-6bvhlcqlbkn1jsr8rh3xes3ggm4glwd3pk7i...
3999   999  value-999-r4qcrvung6u0kvq6slgcw9jp4plutst3109h...

[4000 rows x 2 columns]

In [113]:

In [113]: dt.optimize.compact(target_size=1024*256)
Out[113]:
{'numFilesAdded': 2,
 'numFilesRemoved': 2,
 'filesAdded': '{"avg":112461.0,"max":168540,"min":56382,"totalFiles":2,"totalSize":224922}',
 'filesRemoved': '{"avg":112461.0,"max":168540,"min":56382,"totalFiles":2,"totalSize":224922}',
 'partitionsOptimized': 1,
 'numBatches': 4,
 'totalConsideredFiles': 2,
 'totalFilesSkipped': 0,
 'preserveInsertionOrder': True}

In [114]:

In [114]: dt=DeltaTable(path)

In [115]: dt.optimize.compact(target_size=1024*256)
Out[115]:
{'numFilesAdded': 4,
 'numFilesRemoved': 4,
 'filesAdded': '{"avg":112461.0,"max":168540,"min":56382,"totalFiles":4,"totalSize":449844}',
 'filesRemoved': '{"avg":120069.75,"max":168540,"min":56382,"totalFiles":4,"totalSize":480279}',
 'partitionsOptimized': 1,
 'numBatches': 8,
 'totalConsideredFiles': 4,
 'totalFilesSkipped': 0,
 'preserveInsertionOrder': True}

In [116]:

In [116]: dt=DeltaTable(path)

In [117]: dt.to_pandas()
Out[117]:
         id                                              value
0      1000  value-1000-4ryilo616rsw4pz8on92tbyi2o04hgkrug0...
1      1001  value-1001-3eh8aav3x21jwkme3h9e56dyc4lrdhlzur1...
2      1002  value-1002-bpxn3ndnll87fq6f17tv1ij0pqhra7wj0jx...
3      1003  value-1003-g0bssmsjxrt21a3p95a7g8q2mic043ym511...
4      1004  value-1004-6yjmva7ezuwtwlw0vymf1ldzq60ih4yzvmc...
...     ...                                                ...
11995   995  value-995-x96gnl6173qdzuev650z9o2dfb0pg3wzthyq...
11996   996  value-996-0883a2ltvaic6wfsu1wk7quj6n04kawgnnfx...
11997   997  value-997-kjp4433vk5x37ly1yrf0ozboqzvn4mfh5u94...
11998   998  value-998-6bvhlcqlbkn1jsr8rh3xes3ggm4glwd3pk7i...
11999   999  value-999-r4qcrvung6u0kvq6slgcw9jp4plutst3109h...

[12000 rows x 2 columns]

What you expected to happen:
after multi optimize,the data will still be 4000 rows
How to reproduce it:
do not create checkpoint if i want optimze
More details:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingon-holdIssues and Pull Requests that are on hold for some reason

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions