You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm curious how files are identified as being "adjacent" in a scenario where no partitioning is present on a table.
I have a test application that performs a number of small inserts every X minutes. Initially I had a few hundred smaller files (expected) so I ran the merge_adjacent_files() function and combined them. This created a larger file ~82KB in size.
After performing additional inserts into the same table and again running the merge_adjacent_files() function, I then had the original 82KB file and a new 10KB file.
In this case I was expecting the new records get combined into the larger file (since both are much less than the configured max file size)
Is this the expected behavior? I'm assuming that DuckLake determined these files were not adjacent but I'm curious why? There's no partitioning configured on the table
Update:
After further testing this seems to be related to expiring snapshots. It seems that once the snapshot that created a file(s) has been expired the files can no longer be merged with adjacent files? I'm not sure if this is expected behavior or not.
Steps:
A series of many inserts into a table creating a number of small files
merge_adjacent_files() successfully combining all the small files
expire snapshots before "now"
perform another series of small inserts creating small files
merge_adjacent_files() at this point only the files created in step 4 get combined and they do not combine with the larger file from step 2
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm curious how files are identified as being "adjacent" in a scenario where no partitioning is present on a table.
I have a test application that performs a number of small inserts every
X
minutes. Initially I had a few hundred smaller files (expected) so I ran themerge_adjacent_files()
function and combined them. This created a larger file ~82KB in size.After performing additional inserts into the same table and again running the
merge_adjacent_files()
function, I then had the original 82KB file and a new 10KB file.In this case I was expecting the new records get combined into the larger file (since both are much less than the configured max file size)
Is this the expected behavior? I'm assuming that DuckLake determined these files were not adjacent but I'm curious why? There's no partitioning configured on the table
Update:
After further testing this seems to be related to expiring snapshots. It seems that once the snapshot that created a file(s) has been expired the files can no longer be merged with adjacent files? I'm not sure if this is expected behavior or not.
Steps:
merge_adjacent_files()
successfully combining all the small filesmerge_adjacent_files()
at this point only the files created in step 4 get combined and they do not combine with the larger file from step 2Beta Was this translation helpful? Give feedback.
All reactions