What happens?
We have a situation where some of our DuckLake tables in our application do not appear to get compacted when calling ducklake_merge_adjacent_files.
One of the tables that we have been testing with contains about 2838 rows; looking through the ducklake_data_files table and filtering by the table_id of that table, we see that there are 2555 data files linked to the table, most of them containing only 1 row, with the notable exception of the initial file, which contined 157 rows. There are no entries in ducklake_deleted_files associated with the table, and none of the existing files contain any lightweight snapshot entries in the partial_file_info column. We also do not have a custom target_file_size set. Yet, for some reason the table just refuses to get compacted down, and as a result our queries running against this table end up experiencing significant slowdown.
I have attempted to reproduce the issue but have not been able to so far. I have created tables with approximately the same number of rows as well as distribution amongst the files, but while I have sometimes seen a table compact only "partially" (in the sense that a handful of single-row files survive the compaction), I have not been able to reproduce the issue where no files at all are getting compacted.
I have attached the rows of the ducklake_data_files table that pertain to the table that we investigated.
ducklake_data_file_tb_13.csv
To Reproduce
Unfortunately I do not have a fully reproducible example at this point, but I am still trying to get one and will update here if I succeed.
OS:
macOS 15.5
DuckDB Version:
1.4.1
DuckLake Version:
f134ad8
DuckDB Client:
Python
Hardware:
No response
Full Name:
Oliver Hsu
Affiliation:
Ascend.io
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
No - Other reason (please specify in the issue body)
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
What happens?
We have a situation where some of our DuckLake tables in our application do not appear to get compacted when calling
ducklake_merge_adjacent_files.One of the tables that we have been testing with contains about 2838 rows; looking through the
ducklake_data_filestable and filtering by thetable_idof that table, we see that there are 2555 data files linked to the table, most of them containing only 1 row, with the notable exception of the initial file, which contined 157 rows. There are no entries inducklake_deleted_filesassociated with the table, and none of the existing files contain any lightweight snapshot entries in thepartial_file_infocolumn. We also do not have a customtarget_file_sizeset. Yet, for some reason the table just refuses to get compacted down, and as a result our queries running against this table end up experiencing significant slowdown.I have attempted to reproduce the issue but have not been able to so far. I have created tables with approximately the same number of rows as well as distribution amongst the files, but while I have sometimes seen a table compact only "partially" (in the sense that a handful of single-row files survive the compaction), I have not been able to reproduce the issue where no files at all are getting compacted.
I have attached the rows of the
ducklake_data_filestable that pertain to the table that we investigated.ducklake_data_file_tb_13.csv
To Reproduce
Unfortunately I do not have a fully reproducible example at this point, but I am still trying to get one and will update here if I succeed.
OS:
macOS 15.5
DuckDB Version:
1.4.1
DuckLake Version:
f134ad8
DuckDB Client:
Python
Hardware:
No response
Full Name:
Oliver Hsu
Affiliation:
Ascend.io
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
No - Other reason (please specify in the issue body)
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?