Writing to ducklake from spark via JDBC: each dataframe row becomes parquet file #491

korolmi · 2025-10-01T13:51:27Z

korolmi
Oct 1, 2025

I am planning to work with Ducklake from Apache Spark.

Got initial ideas from here: https://motherduck.com/blog/spark-ducklake-getting-started

First simple write attempt caused this question, here is code snippet (I used get_jdbc_writer() from above):

rangeDf = spark.range(10)
(
    get_jdbc_writer(rangeDf)
    .option('dbtable', "range_10")
    .mode("overwrite")
    .save()
)

It worked but resulted in 10 parquet files in target directory.

I checked - it is not related to partitions (rangeDf has one partition), if I change range() parameter to e.g. 4, I get 4 parquet files.

I tried with bigger dataframe (several thousand rows) - the process never ended (too many files in directory, cannot even list them).

This can be easily reproduced, initially I used older versions of DuckDB/Ducklake/JDBC.

Is it a bug or is it "by design"?

Current version info:

Duckdb version - 1.4.0, ducklake - latest, JDBC - 1.4.0, spark - 3.5.1

Answered by staticlibs

Oct 2, 2025

Batch inserts are supported by JDBC driver, just, coming from Spark's dataframe.save(), they result in the following sequence:

PREPARE p AS INSERT INTO range_10 ("id") VALUES (?);
BEGIN TRANSACTION;
EXECUTE p(0);
EXECUTE p(1);
EXECUTE p(2);
...
EXECUTE p(9);
COMMIT;

The problem is that, until Data Inlining is implemented for Postgres catalog, DuckLake will write a Parquet file immediately on every EXECUTE call.

Besides ducklake_add_data_files and COPY mentioned above, this problem also can be avoided using:

PREPARE p AS INSERT INTO range_10 ("id") VALUES (?)(?)(?)...(?);
EXECUTE p(0, 1, 2, ..., 9);

But there seems to be no easy way to use this approach from Spark, so for now I think that d…

View full answer

staticlibs · 2025-10-01T15:39:19Z

staticlibs
Oct 1, 2025

Hi! What is happening here is that Spark issues queries like:

INSERT INTO range_10 ("id") VALUES (?)

So each insert ends up as a separate Parquet file in DuckLake.

Perhaps one of the following options can be used for DuckLake writing from Spark:

write Parquet files by Spark writer and then add them to DuckLake using ducklake_add_data_files
do all inserts from a data frame to an in-memory DuckDB table, and then COPY this table to DuckLake (though I cannot quickly see an elegant way to do this from Spark only)
do all inserts to an on-disk DuckDB file and then import the whole DB to DuckLake using COPY FROM DATABASE

For options 2. and 3., if you can use Java API, then using JDBC Appender to DuckDB + COPY to DuckLake likely to be the most performant way.

14 replies

korolmi Oct 2, 2025
Author

thank you for your thoughts,

batchsize defaults to 1000 in IDBC options, seems it is ignored by DuckDB JDBC driver...

bottom line is "one cannot use spark yet to write data to ducklake", right? Which is sad (I am spark man :-) ).

I will check the same with Iceberg while I doubt this problem exists there.

staticlibs Oct 2, 2025

Batch inserts are supported by JDBC driver, just, coming from Spark's dataframe.save(), they result in the following sequence:

PREPARE p AS INSERT INTO range_10 ("id") VALUES (?);
BEGIN TRANSACTION;
EXECUTE p(0);
EXECUTE p(1);
EXECUTE p(2);
...
EXECUTE p(9);
COMMIT;

The problem is that, until Data Inlining is implemented for Postgres catalog, DuckLake will write a Parquet file immediately on every EXECUTE call.

Besides ducklake_add_data_files and COPY mentioned above, this problem also can be avoided using:

PREPARE p AS INSERT INTO range_10 ("id") VALUES (?)(?)(?)...(?);
EXECUTE p(0, 1, 2, ..., 9);

But there seems to be no easy way to use this approach from Spark, so for now I think that dataframe.write.jdbc() should be avoided.

Answer selected by korolmi

korolmi Oct 2, 2025
Author

I am not familiar with JDBC driver code but from what I see the latter is what driver should do in case batchsize>1...

And returning to the topic in general: for now DuckLake is read-only from spark and this is not good for overall DuckLake acceptance (spark is one of the ways how one could use DuckLake with "bigdata" tools).

What do you think - should I (or someone else) "requalify" this as an issue? DuckLake is good idea, I wish it to succeed, I can help (with what is in my "knowledge/competence area").

staticlibs Oct 2, 2025

from what I see the latter is what driver should do in case batchsize>1

The driver has no means to transform INSERT INTO range_10 ("id") VALUES (?) SQL query (that is sent by Spark to prepareStatement call) into some other SQL query.

for now DuckLake is read-only from Spark

I can think of 2 approaches to insert data from Spark to DuckLake, they may not provide good UX out of the box, but it should be relatively straightforward to wrap one of them into a Java lib and call this lib from Spark:

write Parquet with Spark writer + ducklake_add_data_files

This is likely the best way for bulk data ingestion.

write records to a table inside the in-memory DB and then transfer it into DuckLake, example:

ATTACH ':memory:' AS mem;
CREATE TABLE mem.range_10(id BIGINT);

Write to mem.range_10 with dataframe.save() or any other method and then call:

INSERT INTO range_10(id) SELECT * FROM mem.range_10;

This approach provides all the flexibility (with controlling batch size etc) and can have very low per-row overhead (if Appender is used). Its connection setup may be cumbersome (both ducklake and mem must be attached, perhaps session_init_sql_file may help), but it should be possible to wrap it in some nice Java API.

What do you think - should I (or someone else) "requalify" this as an issue

I believe this is already covered by the pending data inlining implementation.

I can help (with what is in my "knowledge/competence area")

If you can share more details about your use case - this can be really helpful. For example are the insertions going to be OLTP-style or bulk ones etc.

guillesd Oct 2, 2025
Collaborator

Hi @korolmi if you are using a DuckDB catalog you can use Data Inlining to buffer all rows to DuckDB and then flush them after a while to parquet files --> https://ducklake.select/docs/stable/duckdb/advanced_features/data_inlining

korolmi Oct 2, 2025
Author

Use case is rather general: use spark as an engine for DuckLake, this is necessary when (a) data volumes are high and (b) there are other reasons to use spark as an engine (DuckLake is LakeHouse, LakeHouse should be flexible regarding compute engines).

So I am thinking of bulk inserts (and upserts) of rather large data volumes. Consider e.g. DWH presentation layer which is based on DuckLake. There might be cases when DuckDb as an ETL engine will be not fast enough (ETL from DDS to Presentation layer).

Inlining is meant for another - small - data, so I would not consider it as solution.

What I tried to do initially - as a POC test - was this:

custDf = (
    spark
    .read
    .format("parquet")
    .load("../28.Arrow/data/tpcds_10g.parquet/customer.parquet")
)
(
    get_jdbc_writer(custDf)
    .option('dbtable', "customer")
    .mode("append")
    .save()
)

This never finished (we all know why) and this is not about inlining while this is simple straightforward spark (and I am not touching parallelism here yet)...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Writing to ducklake from spark via JDBC: each dataframe row becomes parquet file #491

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 14 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Writing to ducklake from spark via JDBC: each dataframe row becomes parquet file #491

Uh oh!

korolmi Oct 1, 2025

Replies: 1 comment · 14 replies

Uh oh!

staticlibs Oct 1, 2025

Uh oh!

korolmi Oct 2, 2025 Author

Uh oh!

staticlibs Oct 2, 2025

Uh oh!

korolmi Oct 2, 2025 Author

Uh oh!

staticlibs Oct 2, 2025

Uh oh!

guillesd Oct 2, 2025 Collaborator

Uh oh!

korolmi Oct 2, 2025 Author

korolmi
Oct 1, 2025

Replies: 1 comment 14 replies

staticlibs
Oct 1, 2025

korolmi Oct 2, 2025
Author

korolmi Oct 2, 2025
Author

guillesd Oct 2, 2025
Collaborator

korolmi Oct 2, 2025
Author