Replies: 5 comments 5 replies
-
I'm guessing its because ducklake is created to be engine agnostic. Parquet is an open format supported by pretty much every query engine in existence whether it be spark, trino, polars, etc. Duckdb files on the other hand are supported only by duckdb. Using duckdb files for ducklake creates a vendor lock-in situation where people are forced to use duckdb if they want to use ducklake. This goes against one of the main tenets of lakehouses. |
Beta Was this translation helpful? Give feedback.
-
DuckDB files are also considerably larger than equivalent Parquet files. |
Beta Was this translation helpful? Give feedback.
-
DuckDB files could be used instead of Parquet files in theory, and we might optionally enable this in the future, but many of the benefits of DuckDB files are not present for DuckLake. DuckLake manages schemas and tables outside of the actual files - so we don't need to fit multiple tables into one file. The ACID properties that DuckDB files have are also not relevant when writing to blob storage such as S3 - since we cannot edit the files in-place and they are instead written once and then never modified again. We went for Parquet as a data format for various reasons:
That being said, DuckDB database files work very well as a catalog server for storing the metadata of DuckLake. |
Beta Was this translation helpful? Give feedback.
-
There are features that are supported in ducklake, but not duckdb files. Example: time travel, replication. But you're right - if end user value, not interoperability with large bodies of existing data was the main concern, the path you're suggesting makes more sense and Mark isn't ruling it out in the future. Something I wrote on the topic a couple of days ago: https://x.com/arundsharma/status/1933564289220272315 |
Beta Was this translation helpful? Give feedback.
-
Thank you for all the good work on DuckDB and now DuckLake. DuckLake could fulfill a nice bridge function between OTLP and OLAP. DuckDB may be used to handle transactional workloads. In that scenario, it seems preferable to persist to native DuckDB files on block-level storage (e.g. EBS,...). It would be great to be able to register the metadata of these transactional database(s) in DuckLake. Adding support for DuckDB files has the potential to result in an elegant redefinition of the medaillon architecture. Foundation for lineage straight back to the transactional data. Aligns well with the design objectives. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I am trying to understand why not just use DuckDb files instead of Parquet files in the DuckLake.
We can already connect to multiple DuckDb databases using ATTACH so seems like it's technically possible. In my testing, performance is better when the data is natively in DuckDb format vs Parquet.
Each DuckDb file can have multiple Schemas, multiple Tables, Views all in one file. Due to Schema Evaluation, Table Schemas may change over time but with UNION ALL BY NAME we have a way to merge same data from multiple DuckDb Tables with different Table Schemas.
Implementation: we will have a storage location with multiple DuckDb files (similar to Parquet) but the main difference is while Parquet gives one Table; DuckDb will have multiple Tables/Views so makes it possible to create a Data Model not possible with Parquet. The main job of DuckLake Extension is provide ability to SELECT * from a pool of DuckDb files (select * from read_duckdb(/store/*.duckdb)).
I am sure I am missing something. Can someone shed some light why the above approach is not sound?
Beta Was this translation helpful? Give feedback.
All reactions