Replies: 2 comments 1 reply
-
I think it would be great if Spark can work with DuckLake catalog directly without relying on JDBC and DuckLake extension. Deleted files should be registered in ducklake_delete_file table. Rows from deleted files should be filtered out when running |
Beta Was this translation helpful? Give feedback.
-
While question about rowids and intention to make them better are still valid, I have further thoughts on Spark and DuckLake. If we continue to think about using spark with DuckLake we inevitably come to this dilemma: should we go DeltaLake-way (DeltaTable object) or Iceberg-way (pure SQL). And this might lead us away from simplicity (DuckLake-way). Why are we talking about spark in the first place - distributed processing (scale-out). What about not reinventing the wheel and taking best from already existing examples. Apache Hive: one of its unique features is "execution engine" concept (Hive may work over MR, Spark and Tez). It sounds strange but Hive-on-spark works faster than spark-on-spark (at least on some queries - I showed this to students during my Hadoop course). What if we provide "spark execution engine" for DuckLake? Spark connect: communication between driver and executors in spark is bidirectional and way too complicated. This is not the only reason for Spark Connect while I consider it as the major one - Spark-connect app communicates with driver via gRPC and Spark Logical Plans (which were part of spark anyway). It sends plan to driver and receives back arrow batches (to put it simple). Let DuckLake Metadata Server create (detailed) execution plan and let execution engine of choice (e.g. spark) execute it. This way we keep SQL and allow simple (I hope) addition of whatever execution engine is required. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
One can scale any of DuckLake components (storage, compute, metadata), let us consider one dimension here - compute.
There might be situations when data size is such that distributed compute due to its distributed nature is faster than current DuckDB extension implementation. To be concrete consider Apache Spark - well known and mature distributed engine.
Saying "working at scale" implies word "well" - working at scale well. This "well" in the case of big data means more or less "well from distributed processing standpoint".
What do we have now in case of spark:
(I am spark-man, I can explain statements above separately if necessary AND I am DuckDB/DuckLake fan, I wish to make them better).
I love DuckLake simplicity - Iceberg/Delta are way too complicated. It would be great to have native way to work with DuckLake from spark (if this is possible it proves simplicity, if this is possible and can be done effectively - DuckLake scales!). Being spark-aware I tried to do sort of POC (with pyspark).
After just a while I came to this: what is ROWID (seems like not documented) and how can I work with them at scale (=from spark). I consider this as a crucial point to DuckLake adoption.
In my "happy-path" starting example I have 2 data and 1 delete file, how do I effectively apply delete file to the first data file with just spark and metadata?
I have some ideas but prior to that I have this question: is there any documentation/description/etc about rowids? "Row ID" concept in distributed world is a bit complicated (from the general standpoint)...
Beta Was this translation helpful? Give feedback.
All reactions