Partitioning quirks and best practices #277

resdevd · 2025-07-08T18:50:38Z

resdevd
Jul 8, 2025

Holy lake, this looks like a game changer for data lake ecosystem. Great job @Mytherin and team for prioritizing simplicity and maintainability. We are sold on this and deploying this to production. (We are aware docs say it's not production ready, and we are ok with dealing with changes/migration)

Here are a couple of observations/questions during our testing.

Partitioned tables need to be loaded in a particular order for efficiency. This could be mentioned in the documentation under "Best practices"

Let's say we have a ducklake table called "event_history"

CREATE TABLE event_history
(
    event_date         DATE    NOT NULL,
    country           varchar NOT NULL,
    user_id             varchar not null,
    data                  json

);
ALTER TABLE event_history SET PARTITIONED BY (event_date, country);

CALL ducklake_test.set_option('parquet_compression', 'zstd');
CALL ducklake_test.set_option('parquet_version', '2');
CALL ducklake_test.set_option('target_file_size', '100MB');   -- HAS NO EFFECT FOR PARTITIONED TABLE

And we are inserting or upserting data from another source, if we do not order by partitioned columns, we end up with many small files even when it could've been a single large file.

insert into event_history (event_date,
                                       country,
                                       user_id,
                                       data )
select load_date as event_date,
       country,
       user_id,
       data 
from  postgres.postgres_event_history
where load_date = '2025-07-08'
order by load_date, country, user_id -- WITHOUT THIS WE END UP WITH MANY SMALL FILES
;

This important loading step could be mentioned under best practices.

What's the best practice for query performance when filtering on non-partitioned columns like user_id, apart from min/max pruning anything to be done from a data pipeline standpoint to squeeze more performance?

When the loading process into data storage (R2) is killed (manually or externally), the orphaned files in R2 bucket are not cleaned up and metadata tables have no entries.

The ducklake metadata table ducklake_files_scheduled_for_deletion wouldn't have entries because the transaction never finished.

What's the best practice to deal with orphaned or aborted pending files in case of fatal errors?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Partitioning quirks and best practices #277

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Partitioning quirks and best practices #277

Uh oh!

resdevd Jul 8, 2025

Replies: 0 comments

resdevd
Jul 8, 2025