DuckDb Tables Instead of Parquet #103

satsva · 2025-05-30T21:30:32Z

satsva
May 30, 2025

I am trying to understand why not just use DuckDb files instead of Parquet files in the DuckLake.

We can already connect to multiple DuckDb databases using ATTACH so seems like it's technically possible. In my testing, performance is better when the data is natively in DuckDb format vs Parquet.

Each DuckDb file can have multiple Schemas, multiple Tables, Views all in one file. Due to Schema Evaluation, Table Schemas may change over time but with UNION ALL BY NAME we have a way to merge same data from multiple DuckDb Tables with different Table Schemas.

Implementation: we will have a storage location with multiple DuckDb files (similar to Parquet) but the main difference is while Parquet gives one Table; DuckDb will have multiple Tables/Views so makes it possible to create a Data Model not possible with Parquet. The main job of DuckLake Extension is provide ability to SELECT * from a pool of DuckDb files (select * from read_duckdb(/store/*.duckdb)).

I am sure I am missing something. Can someone shed some light why the above approach is not sound?

gahtan-syarif · 2025-05-30T22:23:46Z

gahtan-syarif
May 30, 2025

I'm guessing its because ducklake is created to be engine agnostic. Parquet is an open format supported by pretty much every query engine in existence whether it be spark, trino, polars, etc. Duckdb files on the other hand are supported only by duckdb. Using duckdb files for ducklake creates a vendor lock-in situation where people are forced to use duckdb if they want to use ducklake. This goes against one of the main tenets of lakehouses.

1 reply

satsva May 31, 2025
Author

DuckDb is Open Source so I am not concerned about vendor lock-in but I do agree on your point DuckDb interoperability is not as wide spread as Parquet (yet!). I guess DuckLake could support multiple Table formats Parquet, DuckDb, Iceberg, Delta etc.,

jelder · 2025-06-06T01:17:50Z

jelder
Jun 6, 2025

DuckDB files are also considerably larger than equivalent Parquet files.

0 replies

Mytherin · 2025-06-06T07:10:50Z

Mytherin
Jun 6, 2025
Maintainer

DuckDB files could be used instead of Parquet files in theory, and we might optionally enable this in the future, but many of the benefits of DuckDB files are not present for DuckLake. DuckLake manages schemas and tables outside of the actual files - so we don't need to fit multiple tables into one file. The ACID properties that DuckDB files have are also not relevant when writing to blob storage such as S3 - since we cannot edit the files in-place and they are instead written once and then never modified again.

We went for Parquet as a data format for various reasons:

Engine interoperability - many engines can read/write Parquet
Interoperability with existing Lakehouse formats - Parquet is also used by Iceberg/Delta, and we can do metadata - using Parquet enables metadata-only migrations
Designed for "write-once" - Parquet files are designed to be written once and never modified. While this is often a disadvantage, it works well for blob stores and the immutable files that are used by Lakehouse formats.

That being said, DuckDB database files work very well as a catalog server for storing the metadata of DuckLake.

0 replies

adsharma · 2025-06-15T16:38:21Z

adsharma
Jun 15, 2025

There are features that are supported in ducklake, but not duckdb files. Example: time travel, replication.

But you're right - if end user value, not interoperability with large bodies of existing data was the main concern, the path you're suggesting makes more sense and Mark isn't ruling it out in the future.

Something I wrote on the topic a couple of days ago:

https://x.com/arundsharma/status/1933564289220272315
https://xcancel.com/arundsharma/status/1933564289220272315

2 replies

Tishj Jun 15, 2025
Collaborator

There are features that are supported in ducklake, but not duckdb files. Example: time travel, replication.

I think you misunderstand the proposal, or perhaps it's not clear what is meant
But I think the author proposed storing a single duckdb file containing the table data, instead of a single parquet file containing the table data.

adsharma Jun 16, 2025

It's possible that @satsva has different needs vs mine. For me the higher order bit is focusing on local-first (e.g. the 2012 Macbook pro) and then adding cloud support as an optional enhancement as opposed to going for parquet/cloud first and then addressing the needs of people using duckdb on a laptop.

As things stand now:

duckdb local files do some things better (vectorized deletion and updates, multiple schemas and tables)
ducklake does other things well (time travel, replication)

If someone wants all of these at the same time, they'll have to wait for a while. I'm not here to complain (its obvious why ducklake was prioritized). My intent is to advocate for the laptop use case with support for time travel and streaming replication, which could benefit the duckdb files on s3 use case as well.

If you store the metadata in duckdb and data in duckdb files on a filesystem and optimize that as the preferred configuration, I think you'll arrive at the same end result that I'm imagining.

saxomoose · 2025-06-17T17:24:03Z

saxomoose
Jun 17, 2025

Thank you for all the good work on DuckDB and now DuckLake.

DuckLake could fulfill a nice bridge function between OTLP and OLAP.

DuckDB may be used to handle transactional workloads. In that scenario, it seems preferable to persist to native DuckDB files on block-level storage (e.g. EBS,...). It would be great to be able to register the metadata of these transactional database(s) in DuckLake.

Adding support for DuckDB files has the potential to result in an elegant redefinition of the medaillon architecture. Foundation for lineage straight back to the transactional data. Aligns well with the design objectives.

2 replies

adsharma Jun 17, 2025

Using DuckDB files for write heavy transactional workloads isn't going to work well. Two challenges:

Storage engine isn't optimized for lots of small row level updates.
Concurrency model very limiting. Specifically one interpretation of this model is that writes block all readers at the file system level. On platforms like windows, a process can't read a file it wrote using another file handle.

A potential future architecture for ducklake is DuckDB supporting multiple storage engines efficiently.

Metadata is stored in a row oriented engine (e.g. sqlite)
Fast changing row level updates are also sent to a row oriented engine
Subsequently data is copied a column oriented engine (duckdb files)

This will give you a 2 level LSM using two separate, well tested storage engines.

Challenges:

There may be a performance delta between using sqlite (or other row oriented storage engines) natively vs via duckdb. Needs to be measured.
Concurrency model needs update for read-write transactional workloads.
Extensive work needs to be done to build replication tools. Something like this tool for sqlite -> duckdb -> other storage engines.
Doing it correctly requires designing externally visible LSNs. Designing it with high concurrency in mind can avoid the mistakes postgres made over 26 years.

lostmygithubaccount Jun 17, 2025

there is (currently very limited and experimental) "data inlining" in DuckLake, allowing small amounts of data stored directly in the metadata catalog (OLTP database) https://ducklake.select/docs/stable/duckdb/advanced_features/data_inlining

DuckDb Tables Instead of Parquet #103

Uh oh!

Replies: 5 comments · 5 replies

Uh oh!

Uh oh!

satsva May 31, 2025 Author

Uh oh!

Uh oh!

Mytherin Jun 6, 2025 Maintainer

Uh oh!

Uh oh!

Tishj Jun 15, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 5 comments 5 replies

satsva May 31, 2025
Author

Mytherin
Jun 6, 2025
Maintainer

Tishj Jun 15, 2025
Collaborator