Replies: 6 comments 34 replies
-
I think this is generally not the point of OLAP systems on top of Lakehouse formats like DuckLake. Delta and Iceberg also rely on column statistics and other probabilistic structures to determine whether files should be read or not. Usually the focus is on the data layout to improve data locality (for example by ordering according to multiple dimensions) and this will generally improve file skipping and therefore reduce IO. We will probably dive deeper into optimizations in the future, but I am not sure column indexing at the file level will be the solution! |
Beta Was this translation helpful? Give feedback.
-
I'm not sure I follow. What is it that is "not the point"? What I'm mostly after is finding optimizations that can drastically reduce the number of files that need to be scanned. All those data lake formats already have some version of ducklake's column statistics as you say, so expanding on that idea (whether it takes the shape I'm outlining here or another) seems aligned with what's already being done no?
I get that, but I think this forces a tradeoff that could be partly mitigated by "more granular column statistics" (or indexes as I've called them here. Maybe calling them granular column statistics is a better representation of the idea). Data layout makes queries that are known ahead of time incredibly fast, but when doing more exploratory work around a dataset, the data layout you have might only be a partial fit. As far as I know this leaves us with the only choices of:
I'd love to see something more flexible where an initial data layout can bring you a lot further without having to maintain multiple copies of the data. For context if that's useful, what I'm looking at is mostly observability or events data. We have the data sorted by the event timestamp which is a great default sort for that use case. But the ability to efficiently drill down on all the other dimensions of those events is very valuable too. And since it's likely that more than one dimensions is going to be used to observe any interesting occurrences in the data, I don't know that there's really a data layout that can match that. Thanks for the response |
Beta Was this translation helpful? Give feedback.
-
these seems to include some interesting overviews, bits and concepts:
|
Beta Was this translation helpful? Give feedback.
-
I'm considering a DuckLake side project and wondering if maybe adding a Bloom filter at the file/column level would be feasible for a pretty rookie C++ person... Would the |
Beta Was this translation helpful? Give feedback.
-
Hi Alex:
Bloom filters in ducklake would especially work well for situations where queries can skip whole files. But OTOH they should not be too large. These two constraints do often not line up. For instance, skipping would work very well on key=value queries on unique-key columns, but unique columns would have the maximum amount of unique values, which would need a big bloom filer (think at least 5 bits per tuple, and that is cutting it already too close). But OTOH if the column has few unique values, the probability that a file contains any value (an extreme example would be a boolean column) will rise quickly, and this implies no skipping.
Where it can help is in temporal value distribution shifts (like Iphone17 sales starting sept 19 only) in cases where the shift in value appearance is not necessarily ordered (thus they appear in the middle of a min-max range, making min-max skipping ineffective). Probably you had that in mind already?
Peter
… On 15 Sep 2025, at 19:00, Alex Monahan ***@***.***> wrote:
I'm considering a DuckLake side project and wondering if maybe adding a Bloom filter at the file/column level would be feasible for a pretty rookie C++ person...
Would the extra_stats column in the DuckLake metadata catalog be a good place for something like that to live?
(That was added here for spatial indexes: #412 )
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
It would be nice to have the row and column ranges within the files as well. So something like |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
👋 . I'm curious about the possibility of adding something akin to indexes to ducklake. For some types of fields and queries, it could significantly reduce the work needed to respond to the query.
Right now, for each underlying parquet file, ducklake knows the range of values for each column represented in the file. This info is stored in
ducklake_file_column_statistics
with his schema:Outer pipes Cell padding
No sorting
So we can identify which files to query based on the min and max values of a column. This works well when the data is at least mostly sorted based on that column. In our case, data is mostly sorted based on some timestamp and queries that include that timestamp as a filter can quickly get to the right files.
One of the fields in the table is a
user_id
the unique values are in the millions while the number of records in the table is many billions.For any parquet file it's extremely likely to contain at least 1 row where the
user_id
is at the beginning of the rangeaaa.....
and one where theuser_id
is at the end of the rangezzz....
. This makes this range and column stat relatively useless when we want to query data for a specificuser_id
.But with the number of distinct
user_id
in this case being multiple orders of magnitude lower than the number of records, it seems like there is an opportunity to store in the metadata RDBMS of ducklake an index of which files include each specificuser_id
.Currently a query for
SELECT * WHERE user_id = 'maaaaaaa'
would have to scan every single parquet file because it's right in the middle of the range and every file will have at least 1 record with a user_id before and after it.I'm not sure yet exactly which form this would take, mostly gauging if there is interest in this beyond just myself at this point. A very basic idea would be that it could take this kind of shape:
At first glance, it seems listing all the
data_file_ids
in a single column here is sufficient vs. adding 1 record for eachcolumn_value | data_file_id
combination, and would make the overall index table handle a lot fewer rows.I'm also not clear whether each index should be its own table in the backing RDBMS or if they all should be within one giant table.
With something like this in place, a query like
SELECT * WHERE user_id = 'maaaaaaa'
could restrict itself to only work on 5 files (1,2,7,89,645
) instead of thousands.What about non sparse columns
The case I presented here is for a column like
user_id
which has a finite set of values that come back in every data file. But what if we wanted to create an index on something like a timestamp. Sometimes you'll have multiple timestamps on a single record and the data will only be sorted in the data files based on one of those. For example if the data contained the timestamp of an event, the user_id concerned by the event and the created_at timestamp for the user_id.If my service has been operating from 2015 to 2025, then looking for users created in june 2022 within that data is again likely to have to scan absolutely all files.
But mapping that
user_created_at
column to the closest 30m interval would create very small ranges that are unlikely to all be present in every data file and 10 years of 30 minutes interval is under 200k distinct values to be indexed.Thougths?
Would there be any interest in exploring this further and trying to bring something like this to DuckLake?
I have not found any conversation around that in my searches.
Beta Was this translation helpful? Give feedback.
All reactions