Method for user performed compaction? #329
javafanboy
started this conversation in
Ideas
Replies: 3 comments 5 replies
-
I'm working on a super rough proof of concept for a way to order files during compaction. The initial plan would be to use the existing logic to determine how many files to be compacted and then sort within those pre-selected files. A few open questions / thoughts!
cc @guillesd - I would love your input! |
Beta Was this translation helpful? Give feedback.
4 replies
-
Not sure what you mean with "approximate" sorting but imho sorting of the
merged data should be "exact" as even some misplaced vales" have a quite
negative effect on row group selectiveness ..
…On Sun, Oct 5, 2025, 06:45 Alex Monahan ***@***.***> wrote:
I'm working on a super rough proof of concept
<https://github.com/Alex-Monahan/ducklake/tree/ordered-compaction>for a
way to order files during compaction. The initial plan would be to use the
existing logic to determine how many files to be compacted and then sort
within those pre-selected files. A few open questions / thoughts!
- I was thinking of calling this default_sort since this is
approximate (And that is similar to Iceberg's approach
<https://iceberg.apache.org/spec/#sorting>).
- I was planning to add an optional parameter to
ducklake_merge_adjacent_files, as well as a setting (so that CHECKPOINT
can pick it up as well). I am super flexible on naming. Maybe
default_sort?
- Should the setting be only settable at the table level? Or should
it follow the typical global/schema/table hierarchy?
- In the future, I could see another optimization for picking which
files to combine for each specific compaction operation (each destination
file) using the file level statistics.
- If we are sorting by column A, we could first sort by the min (or
max or (min+max)/2) ) of column A according to the file stats. Then we
could at least approximately sort the files prior to selecting the right
ones to compact according to the file size cutoff.
- My thought is that it would not take too much
computation/overhead to do this since the stats would already be in the
metadata DB and not at the file level.
cc @guillesd <https://github.com/guillesd> - I would love your input!
—
Reply to this email directly, view it on GitHub
<#329 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXQFYBBLW4URQYZ6DGQ6D3WCPAHAVCNFSM6AAAAACCOLIZVSVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTINJZGQ3TMMA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
1 reply
-
Assuming we are ingesting new data by date/time (and possibly even
partitioning on say hours) then reading as many "adjacent" objects (by
date/time) as we can safely hold in memory, sort them and write as a new
more optimally sized and internally well "indexed" object will work well...
…On Sun, Oct 5, 2025, 14:36 Alex Monahan ***@***.***> wrote:
Good question! I am a bit worried about trying to sort all uncompacted
data in a single sort operation. It may be a very large operation that is
not feasible to complete. So, instead I was planning to sort each
destination file, not the entire set of uncompacted files. So that would
make sorting exact within a file, but approximate at the table level. Here
is some of my thought process from this thread (#389
<#389> ).
1. Sort all data in files eligible for compaction
- Pros: Gets best performance for reads
- Cons: Takes potentially quite a lot of memory and CPU during
compaction, especially if compacting for the first time or not regularly
2. Sort data within the files selected to be compacted into a single
file
- Ex: 25 total files being compacted, but every 5 files get compacted
into 1 file since that aligns with target size. This would just sort each
set of 5 files independently.
- Pros: Relatively bounded memory use (could be configured with target
file size)
- Cons: Less optimal performance for reads. Could drive towards a
larger file size to get more sorting benefits.
Iceberg uses approach 2 based on my reading of the Iceberg spec
<https://iceberg.apache.org/spec/#sorting>.
—
Reply to this email directly, view it on GitHub
<#329 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXQF5GEN6VA233MUNGRKD3WEGGNAVCNFSM6AAAAACCOLIZVSVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTINJZGY3DSMY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I think it would be really neat if there similarly to adding a file/object existed a method for compacting one or more files/objects belonging to the same table/partition into a new file/object (as with the built in support both the original and new file would be retained until snapshots are removed).
This would provide users with fine grained control over what files/objects to compact and in particular perform custom sorting of the compacted file.
Sorting of compacted files is desirable for optimal performance in cases where ordering is based on a composite key - just appending several small, individually sorted files may in this case produce a compacted file where row group metrics will not provide efficient performance improvement...
The concept of rowid usind in ducklake may. depending on how it is implemented, be problematic for implementing sorting of compacted files...
Beta Was this translation helpful? Give feedback.
All reactions