You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Holy lake, this looks like a game changer for data lake ecosystem. Great job @Mytherin and team for prioritizing simplicity and maintainability. We are sold on this and deploying this to production. (We are aware docs say it's not production ready, and we are ok with dealing with changes/migration)
Here are a couple of observations/questions during our testing.
Partitioned tables need to be loaded in a particular order for efficiency. This could be mentioned in the documentation under "Best practices"
Let's say we have a ducklake table called "event_history"
CREATETABLEevent_history
(
event_date DATENOT NULL,
country varcharNOT NULL,
user_id varcharnot null,
data json
);
ALTERTABLE event_history SET PARTITIONED BY (event_date, country);
CALL ducklake_test.set_option('parquet_compression', 'zstd');
CALL ducklake_test.set_option('parquet_version', '2');
CALL ducklake_test.set_option('target_file_size', '100MB'); -- HAS NO EFFECT FOR PARTITIONED TABLE
And we are inserting or upserting data from another source, if we do not order by partitioned columns, we end up with many small files even when it could've been a single large file.
insert into event_history (event_date,
country,
user_id,
data )
select load_date as event_date,
country,
user_id,
data
frompostgres.postgres_event_historywhere load_date ='2025-07-08'order by load_date, country, user_id -- WITHOUT THIS WE END UP WITH MANY SMALL FILES
;
This important loading step could be mentioned under best practices.
What's the best practice for query performance when filtering on non-partitioned columns like user_id, apart from min/max pruning anything to be done from a data pipeline standpoint to squeeze more performance?
When the loading process into data storage (R2) is killed (manually or externally), the orphaned files in R2 bucket are not cleaned up and metadata tables have no entries.
The ducklake metadata table ducklake_files_scheduled_for_deletion wouldn't have entries because the transaction never finished.
What's the best practice to deal with orphaned or aborted pending files in case of fatal errors?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Holy lake, this looks like a game changer for data lake ecosystem. Great job @Mytherin and team for prioritizing simplicity and maintainability. We are sold on this and deploying this to production. (We are aware docs say it's not production ready, and we are ok with dealing with changes/migration)
Here are a couple of observations/questions during our testing.
Let's say we have a ducklake table called "event_history"
And we are inserting or upserting data from another source, if we do not order by partitioned columns, we end up with many small files even when it could've been a single large file.
This important loading step could be mentioned under best practices.
What's the best practice for query performance when filtering on non-partitioned columns like user_id, apart from min/max pruning anything to be done from a data pipeline standpoint to squeeze more performance?
The ducklake metadata table ducklake_files_scheduled_for_deletion wouldn't have entries because the transaction never finished.
What's the best practice to deal with orphaned or aborted pending files in case of fatal errors?
Beta Was this translation helpful? Give feedback.
All reactions