Adding indexes #389

khepin · 2025-08-20T17:10:08Z

khepin
Aug 20, 2025

👋 . I'm curious about the possibility of adding something akin to indexes to ducklake. For some types of fields and queries, it could significantly reduce the work needed to respond to the query.

Right now, for each underlying parquet file, ducklake knows the range of values for each column represented in the file. This info is stored in ducklake_file_column_statistics with his schema:

Outer pipes Cell padding
No sorting

data_file_id	table_id	column_id	column_size_bytes	value_count	null_count	min_value	max_value	contains_nan
1	1	2	321564621	12315	0	aaa	zzz
2	1	2	231654	123555	0	aab	zzf

So we can identify which files to query based on the min and max values of a column. This works well when the data is at least mostly sorted based on that column. In our case, data is mostly sorted based on some timestamp and queries that include that timestamp as a filter can quickly get to the right files.

One of the fields in the table is a user_id the unique values are in the millions while the number of records in the table is many billions.

For any parquet file it's extremely likely to contain at least 1 row where the user_id is at the beginning of the range aaa..... and one where the user_id is at the end of the range zzz..... This makes this range and column stat relatively useless when we want to query data for a specific user_id.

But with the number of distinct user_id in this case being multiple orders of magnitude lower than the number of records, it seems like there is an opportunity to store in the metadata RDBMS of ducklake an index of which files include each specific user_id.

Currently a query for SELECT * WHERE user_id = 'maaaaaaa' would have to scan every single parquet file because it's right in the middle of the range and every file will have at least 1 record with a user_id before and after it.

I'm not sure yet exactly which form this would take, mostly gauging if there is interest in this beyond just myself at this point. A very basic idea would be that it could take this kind of shape:

table_id	column_id	column_value	data_file_ids
1	2	maaaaa	1,2,7,89,645
1	2	abcdefg	5,22,89
1	2	zkzkzkz	16,522,3257

At first glance, it seems listing all the data_file_ids in a single column here is sufficient vs. adding 1 record for each column_value | data_file_id combination, and would make the overall index table handle a lot fewer rows.

I'm also not clear whether each index should be its own table in the backing RDBMS or if they all should be within one giant table.

With something like this in place, a query like SELECT * WHERE user_id = 'maaaaaaa' could restrict itself to only work on 5 files (1,2,7,89,645) instead of thousands.

What about non sparse columns

The case I presented here is for a column like user_id which has a finite set of values that come back in every data file. But what if we wanted to create an index on something like a timestamp. Sometimes you'll have multiple timestamps on a single record and the data will only be sorted in the data files based on one of those. For example if the data contained the timestamp of an event, the user_id concerned by the event and the created_at timestamp for the user_id.

event_ts	user_id	user_created_at
2025-08-20T17:00:35+00:00	maaaaaa	2023-02-20T16:17:35+00:00
2025-08-20T17:01:30+00:00	zkzkzkzk	2021-11-13T08:41:12+00:00

If my service has been operating from 2015 to 2025, then looking for users created in june 2022 within that data is again likely to have to scan absolutely all files.

But mapping that user_created_at column to the closest 30m interval would create very small ranges that are unlikely to all be present in every data file and 10 years of 30 minutes interval is under 200k distinct values to be indexed.

Thougths?

Would there be any interest in exploring this further and trying to bring something like this to DuckLake?
I have not found any conversation around that in my searches.

guillesd · 2025-08-26T09:30:39Z

guillesd
Aug 26, 2025
Collaborator

I think this is generally not the point of OLAP systems on top of Lakehouse formats like DuckLake. Delta and Iceberg also rely on column statistics and other probabilistic structures to determine whether files should be read or not. Usually the focus is on the data layout to improve data locality (for example by ordering according to multiple dimensions) and this will generally improve file skipping and therefore reduce IO. We will probably dive deeper into optimizations in the future, but I am not sure column indexing at the file level will be the solution!

7 replies

Alex-Monahan Aug 26, 2025

Nice! ~10x speed is prettttttty sweet. Ordering will definitely have a nice impact where you can line it up with query patterns.

I think the compaction step just gives a unique opportunity to also get those fast read performance benefits in cases where data is inserted in small batches.

Is that something on the list in some way? How high on the list? :-) Just curious - no concrete needs or anything.

Thank you!!

guillesd Aug 27, 2025
Collaborator

It wasn't really, but now it is in my head which means it will be in the backlog jeje

Alex-Monahan Sep 15, 2025

I was thinking about this a little more! Let me know what you think:
There are 2 options I can see for how to sort during compaction:

Sort all data in files eligible for compaction
- Pros: Gets best performance for reads
- Cons: Takes potentially quite a lot of memory and CPU during compaction, especially if compacting for the first time or not regularly
Sort data within the files selected to be compacted into a single file
- Ex: 25 total files being compacted, but every 5 files get compacted into 1 file since that aligns with target size. This would just sort each set of 5 files independently.
- Pros: Relatively bounded memory use (could be configured with target file size)
- Cons: Less optimal performance for reads. Could drive towards a larger file size to get more sorting benefits.

Alex-Monahan Sep 15, 2025

Based on my quick read of the Iceberg sorting spec, it looks like Iceberg handles this per file, so option 2 above. That feels most logical to me due to the data sizes involved. They also have a concept of a default sort, which kinda captures the approximate sorting aspect of it in my opinion.

Alex-Monahan Oct 5, 2025

I'm working on a proof of concept for this! Just got a super hardcoded example to run and pass at least 1 test...
Maybe we could have a design discussion on it over here? #329 I'll add some initial ideas!

khepin · 2025-08-26T16:21:12Z

khepin
Aug 26, 2025
Author

generally not the point of OLAP systems

I'm not sure I follow. What is it that is "not the point"? What I'm mostly after is finding optimizations that can drastically reduce the number of files that need to be scanned. All those data lake formats already have some version of ducklake's column statistics as you say, so expanding on that idea (whether it takes the shape I'm outlining here or another) seems aligned with what's already being done no?

the focus is on the data layout to improve data locality

I get that, but I think this forces a tradeoff that could be partly mitigated by "more granular column statistics" (or indexes as I've called them here. Maybe calling them granular column statistics is a better representation of the idea).

Data layout makes queries that are known ahead of time incredibly fast, but when doing more exploratory work around a dataset, the data layout you have might only be a partial fit. As far as I know this leaves us with the only choices of:

a query that's much slower than it could be and scanning many more files than contain useful data to compute the response
copying the whole dataset to a new version to re-arrange the data layout and sort it by a more interesting order for the given query. Which is very long and costly on larger datasets.
Unless I'm missing a better approach.

I'd love to see something more flexible where an initial data layout can bring you a lot further without having to maintain multiple copies of the data.

For context if that's useful, what I'm looking at is mostly observability or events data. We have the data sorted by the event timestamp which is a great default sort for that use case.

But the ability to efficiently drill down on all the other dimensions of those events is very valuable too. And since it's likely that more than one dimensions is going to be used to observe any interesting occurrences in the data, I don't know that there's really a data layout that can match that.

Thanks for the response

8 replies

guillesd Aug 27, 2025
Collaborator

Hi again! Actually I did some digging internally (I'm not familiar with everything just yet) and actually we already use Bloom Filters under the hood. https://duckdb.org/docs/stable/data/parquet/metadata#bloom-filters
The DuckDB DuckLake extension uses the same parquet extension as regular DuckDB so we do create and use bloom filters automatically. This is actively being improved upon.

Note that not all column types are supported and I've also found that columns that are close to unique (so very low row count) will not have a bloom filter.

khepin Aug 27, 2025
Author

If I get this right, the bloom filters here would only be used once the parquet file has already been retrieved. Moving it to the ducklake layer seems an appealing option to prune which files need to be looked at.

guillesd Sep 2, 2025
Collaborator

Not really fully retrieved, only the parquet metadata (otherwise there's no real optimization). However you are right that we could skip this metadata read by pushing the bloom filter stats one level higher! I will keep this in mind for our internal discussions! Thanks @khepin

khepin Sep 21, 2025
Author

As I gain understanding on this topic, I don't know that bloom filters are a full solution.
From what we've discussed, we're saying that the promising approach is to pull bloom filters that exist at the file level up to the metadata ducklake layer.
But that would require having a bloom filter in that file to begin with.

IIRC, DuckDB uses Dictionary encoding for fields where the cardinality is < 10% of the number of rows in a rowgroup. And would therefore not have a bloom filter for those columns.

For a rowgroup of 100k rows (DuckDB default?), this would mean up to 10k unique values can be dictionary encoded. A bloom filter to hold 10k values with a 0.01% error rate would be ~ 25kB (see here). While 10k unique values if they are int32 would be 10kB.
If they are UUIDs (16 bytes) this would be 160kB.

I wonder if as a simpler start, bringing dictionaries up to the metadata layer would be more directly / immediately useful?

guillesd Sep 22, 2025
Collaborator

I guess this goes a bit back to the begining point (indexes in DuckLake). And I think in general point queries are just not something that Data Lakes are great at. Actually OLAP systems in general are not great at this I would say. Even in DuckDB itself we usually do not recommend to use indexes unless you are running point queries quite often because of the overhead to keep these.

hpvd · 2025-08-26T17:58:00Z

hpvd
Aug 26, 2025

these seems to include some interesting overviews, bits and concepts:

Embedding User-Defined Indexes in Apache Parquet Files
https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/
Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet
https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/

2 replies

guillesd Aug 27, 2025
Collaborator

Very interesting information!

jeschkies Oct 14, 2025

This realtes to #511 as @hpvd pointed out there.

Alex-Monahan · 2025-09-15T16:59:49Z

Alex-Monahan
Sep 15, 2025

I'm considering a DuckLake side project and wondering if maybe adding a Bloom filter at the file/column level would be feasible for a pretty rookie C++ person...

Would the extra_stats column in the DuckLake metadata catalog be a good place for something like that to live?
(That was added here for spatial indexes: #412 )

1 reply

guillesd Sep 16, 2025
Collaborator

Hey Alex! As I mentioned before, we are using bloom filters at the file level. So everytime we run a parquet metadata scan we use the bloom filter to prune row groups or even the entire files. There are some discussions inside to export the metadata from this bloom filters to the DuckLake metadata catalog to avoid unnecessary calls to the parquet metadata scan, maybe something we can get into 0.4, not sure yet.

peterboncz · 2025-09-15T17:21:43Z

peterboncz
Sep 15, 2025

Hi Alex: Bloom filters in ducklake would especially work well for situations where queries can skip whole files. But OTOH they should not be too large. These two constraints do often not line up. For instance, skipping would work very well on key=value queries on unique-key columns, but unique columns would have the maximum amount of unique values, which would need a big bloom filer (think at least 5 bits per tuple, and that is cutting it already too close). But OTOH if the column has few unique values, the probability that a file contains any value (an extreme example would be a boolean column) will rise quickly, and this implies no skipping. Where it can help is in temporal value distribution shifts (like Iphone17 sales starting sept 19 only) in cases where the shift in value appearance is not necessarily ordered (thus they appear in the middle of a min-max range, making min-max skipping ineffective). Probably you had that in mind already? Peter

…

On 15 Sep 2025, at 19:00, Alex Monahan ***@***.***> wrote: I'm considering a DuckLake side project and wondering if maybe adding a Bloom filter at the file/column level would be feasible for a pretty rookie C++ person... Would the extra_stats column in the DuckLake metadata catalog be a good place for something like that to live? (That was added here for spatial indexes: #412 ) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

13 replies

hpvd Sep 23, 2025

@guillesd

I am not sure we will ever do indexes though, since this seems to be something that is rather not done in Data Lakes (and I think there's a reason for it).

Yep, I fully agree on this. This is not done in current Data Lakes. Because it may not make sense with their architecture/characteristics.

But IMHO Ducklake is a lot different and one may do / think of system designs not possible with classic lakes...

There are some really appealing ideas around e.g.

Edge computing revolution - by enabling edge compute to interact directly and safely with a lakehouse. For example, in the water industry, would DuckLake enable hundreds of field devices connected to sensors to periodically write their telemetry data directly to a lake? This is very much in the purview of technologies such as NATS, but DuckLake could augment powerful event driven services by shortening the pathway for data collected at the edge that is destined for centralised analytics.

from https://endjin.com/blog/2025/06/ducklake-perspective-advanced-features-future-implications

Up coming new opportunities like this may fully justify rethinking about the benefits for Ducklake from things classic datalakes doesn't have / need like indexes etc ...

khepin Sep 23, 2025
Author

Right, I think some of us here can see that it's indeed not done today in data lakes but we see the value if it were done.
I also agree that there's indeed a reason it's not done today, but I don't agree that it's a good one!
If we look at iceberg, with the philosophy of having everything only be a file on block storage, maintaining an index and doing so atomically gets both very complex and very costly.

But that's where DuckLake has already said the design was limiting for "the wrong reasons" and introduced a DB for the metadata layer.

To me the same logic applies here. Ducklake took an approach that opens a world of possibilities that couldn't exist before.

Not sure if you've seen the article on DataFusion's blog but it seems that it's something they encourage so there is precedent for doing things this way.

hpvd Sep 23, 2025

To me the same logic applies here. Ducklake took an approach that opens a world of possibilities that couldn't exist before.

yep, and there are even other things beside index which may not have been seen in datalakes today but may perfectly make sense with ducklake... will open a discussion on this

guillesd Sep 24, 2025
Collaborator

I read the article. They mention things that are already in DuckLake (Except for the text index which is a bit of a different use case). What you where mentioning before @khepin was something similar to b-tree or art indexes that speed up point queries, which is not something they really recommend in the article.

Again, I'm not saying this will not be possible in DuckLake in the future. I am just saying that generally OLAP does not implement this indexes because they are very costly to maintain with large amounts of data (even if you have an efficient, database-backed catalog like DuckLake has).

hpvd Sep 24, 2025

when thinking about support for indexes in general (yes/no/how), the topic "embeddings" may be worth to be directly thought into... see #472

jeschkies · 2025-10-14T14:16:04Z

jeschkies
Oct 14, 2025

Currently a query for SELECT * WHERE user_id = 'maaaaaaa' would have to scan every single parquet file because it's right in the middle of the range and every file will have at least 1 record with a user_id before and after it.

It would be nice to have the row and column ranges within the files as well. So something like SELECT user_id, user_email WHERE user_id = 'maaaaaaa' would not have to download the complete Parquet file.

3 replies

guillesd Oct 14, 2025
Collaborator

DuckDB can already do row-group level pruning in general (using bloom filters or zonemaps), see https://duckdb.org/docs/stable/data/parquet/overview#partial-reading

jeschkies Oct 15, 2025

Good to know. However, that's specific to DuckDB and it would still require for DuckDB to read the metadata if I'm not mistaken.

My goal is to connect a different query eingine to DuckeLake.

guillesd Oct 15, 2025
Collaborator

yes that is a very good point, comes back to something I've mentioned in this discussion already: we are thinking of pushing some of the parquet file level stats to the catalog layer. I can't give more details rnow.

Adding indexes #389

Uh oh!

Uh oh!

What about non sparse columns

Thougths?

Replies: 6 comments · 34 replies

Uh oh!

guillesd Aug 26, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

guillesd Aug 27, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

khepin Aug 26, 2025 Author

Uh oh!

guillesd Aug 27, 2025 Collaborator

Uh oh!

khepin Aug 27, 2025 Author

Uh oh!

guillesd Sep 2, 2025 Collaborator

Uh oh!

khepin Sep 21, 2025 Author

Uh oh!

guillesd Sep 22, 2025 Collaborator

Uh oh!

Uh oh!

guillesd Aug 27, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

guillesd Sep 16, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

khepin Sep 23, 2025 Author

Uh oh!

Uh oh!

Uh oh!

guillesd Sep 24, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

guillesd Oct 14, 2025 Collaborator

Uh oh!

Uh oh!

guillesd Oct 15, 2025 Collaborator

Replies: 6 comments 34 replies

guillesd
Aug 26, 2025
Collaborator

guillesd Aug 27, 2025
Collaborator

khepin
Aug 26, 2025
Author

guillesd Aug 27, 2025
Collaborator

khepin Aug 27, 2025
Author

guillesd Sep 2, 2025
Collaborator

khepin Sep 21, 2025
Author

guillesd Sep 22, 2025
Collaborator

guillesd Aug 27, 2025
Collaborator

guillesd Sep 16, 2025
Collaborator

khepin Sep 23, 2025
Author

guillesd Sep 24, 2025
Collaborator

guillesd Oct 14, 2025
Collaborator

guillesd Oct 15, 2025
Collaborator