Writing to ducklake from spark via JDBC: each dataframe row becomes parquet file #491
-
I am planning to work with Ducklake from Apache Spark. Got initial ideas from here: https://motherduck.com/blog/spark-ducklake-getting-started First simple write attempt caused this question, here is code snippet (I used get_jdbc_writer() from above):
It worked but resulted in 10 parquet files in target directory. I checked - it is not related to partitions (rangeDf has one partition), if I change range() parameter to e.g. 4, I get 4 parquet files. I tried with bigger dataframe (several thousand rows) - the process never ended (too many files in directory, cannot even list them). This can be easily reproduced, initially I used older versions of DuckDB/Ducklake/JDBC. Is it a bug or is it "by design"? Current version info: Duckdb version - 1.4.0, ducklake - latest, JDBC - 1.4.0, spark - 3.5.1 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 14 replies
-
Hi! What is happening here is that Spark issues queries like: INSERT INTO range_10 ("id") VALUES (?) So each insert ends up as a separate Parquet file in DuckLake. Perhaps one of the following options can be used for DuckLake writing from Spark:
For options 2. and 3., if you can use Java API, then using JDBC Appender to DuckDB + |
Beta Was this translation helpful? Give feedback.
Batch inserts are supported by JDBC driver, just, coming from Spark's
dataframe.save()
, they result in the following sequence:The problem is that, until Data Inlining is implemented for Postgres catalog, DuckLake will write a Parquet file immediately on every
EXECUTE
call.Besides
ducklake_add_data_files
andCOPY
mentioned above, this problem also can be avoided using:But there seems to be no easy way to use this approach from Spark, so for now I think that
d…