DuckLake without DuckDB - working with DuckLake at scale (Apache Spark) #505

korolmi · 2025-10-09T09:12:47Z

korolmi
Oct 9, 2025

One can scale any of DuckLake components (storage, compute, metadata), let us consider one dimension here - compute.

There might be situations when data size is such that distributed compute due to its distributed nature is faster than current DuckDB extension implementation. To be concrete consider Apache Spark - well known and mature distributed engine.

Saying "working at scale" implies word "well" - working at scale well. This "well" in the case of big data means more or less "well from distributed processing standpoint".

What do we have now in case of spark:

reading from DuckLake with spark is possible (more or less well while not optimal)
writing to DuckLake from spark is impossible (see discussion in Q&A started by myself)
both require JDBC (explicitly) and DuckLake extension (implicitly), I consider this as not well

(I am spark-man, I can explain statements above separately if necessary AND I am DuckDB/DuckLake fan, I wish to make them better).

I love DuckLake simplicity - Iceberg/Delta are way too complicated. It would be great to have native way to work with DuckLake from spark (if this is possible it proves simplicity, if this is possible and can be done effectively - DuckLake scales!). Being spark-aware I tried to do sort of POC (with pyspark).

After just a while I came to this: what is ROWID (seems like not documented) and how can I work with them at scale (=from spark). I consider this as a crucial point to DuckLake adoption.

In my "happy-path" starting example I have 2 data and 1 delete file, how do I effectively apply delete file to the first data file with just spark and metadata?

I have some ideas but prior to that I have this question: is there any documentation/description/etc about rowids? "Row ID" concept in distributed world is a bit complicated (from the general standpoint)...

staticlibs · 2025-10-09T11:55:07Z

staticlibs
Oct 9, 2025

I think it would be great if Spark can work with DuckLake catalog directly without relying on JDBC and DuckLake extension. Deleted files should be registered in ducklake_delete_file table. Rows from deleted files should be filtered out when running SELECT.

1 reply

korolmi Oct 9, 2025
Author

Yes, that was the plan, but here some details:

step 1: read data from data_file (parquet) - dataDf
step 2: read rowids from delete_file (parquet) - delDf
step 3: do left-anti join (dataDf.join(delDf,"rowid","left-anti"))

The problem: how can I effectively add rowids to my data in dataDf? In spark there is no effective way to "number" rows - distributed world sometimes looks quite different.

More effective algorithm will be filtering (join is wide transformation, requires shuffle), but... first things first: how do I get rowids?

korolmi · 2025-10-09T20:30:55Z

korolmi
Oct 9, 2025
Author

While question about rowids and intention to make them better are still valid, I have further thoughts on Spark and DuckLake.

If we continue to think about using spark with DuckLake we inevitably come to this dilemma: should we go DeltaLake-way (DeltaTable object) or Iceberg-way (pure SQL). And this might lead us away from simplicity (DuckLake-way).

Why are we talking about spark in the first place - distributed processing (scale-out). What about not reinventing the wheel and taking best from already existing examples.

Apache Hive: one of its unique features is "execution engine" concept (Hive may work over MR, Spark and Tez). It sounds strange but Hive-on-spark works faster than spark-on-spark (at least on some queries - I showed this to students during my Hadoop course).

What if we provide "spark execution engine" for DuckLake?

Spark connect: communication between driver and executors in spark is bidirectional and way too complicated. This is not the only reason for Spark Connect while I consider it as the major one - Spark-connect app communicates with driver via gRPC and Spark Logical Plans (which were part of spark anyway). It sends plan to driver and receives back arrow batches (to put it simple).

Let DuckLake Metadata Server create (detailed) execution plan and let execution engine of choice (e.g. spark) execute it.

This way we keep SQL and allow simple (I hope) addition of whatever execution engine is required.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DuckLake without DuckDB - working with DuckLake at scale (Apache Spark) #505

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DuckLake without DuckDB - working with DuckLake at scale (Apache Spark) #505

Uh oh!

korolmi Oct 9, 2025

Replies: 2 comments · 1 reply

Uh oh!

staticlibs Oct 9, 2025

Uh oh!

Uh oh!

korolmi Oct 9, 2025 Author

Uh oh!

korolmi Oct 9, 2025 Author

korolmi
Oct 9, 2025

Replies: 2 comments 1 reply

staticlibs
Oct 9, 2025

korolmi Oct 9, 2025
Author

korolmi
Oct 9, 2025
Author