Creating DuckLake Parquet data outside of DuckDB? #309

javafanboy · 2025-07-20T07:01:49Z

javafanboy
Jul 20, 2025

Today I have a solution where I using plain Java (i.e. no Spark etc. only using Carpet) is creating Parquet objects in steaming mode (sent to S3 using multi part upload API as buffers are filled) to form fairly large objects (avoiding the need to do compaction as a later stage) into a partitioned "directory structure". My partitioning is not only based on date time but also on a few custom attributes. I then register these objects/partitions as they are added in the Glue data catalogue and access them using AWS Athena. I have also (bypassing the catalogue only taking advantage of the known partitioning structure) accessed the Parquet objects directly using DuckDB for "local analytics" - all of this works quite well today.

Now I would have liked to try, as an alternative (until AWS support DuckLake natively a comlement) to this add the same Parquet files to a DuckLake allowing more efficient and convinient access from DuckDB on our local machines. I do not currently see a lot of need to update the Parquet files once uploaded but this may be a future requirement. To allow access from multiple machines I intend, as discussed in a previous thread, to try using AWS DSQL as the meta server (otherwise faling back to AWS regular hosted Postgre service Aurora Serverless) but are seeing some potential challenges with this that I would like some feedback on:

I have understood it like there in addition to inserting the required metadata DuckLake 0.1 had some special requirements on the generated Parquet files (some ducklake related info that nededs to be stored inside of them) but it seems like the 0.2 version provides a name mapping that could solve this problem for me. I would however like to know if there is any performance penalty or other drawback of using this instead of having the info "inline" in the Parquet files? If not my thought would then be to continue creating and uploading the Parquet files the way I do it today but add a step where I using DuckDB register them as "existing" in the DuckLake.
Today I use "hive style" for the partititon attributes where they are only stored as part of the object pathname and not inside the Parquet files - is this currently supported in DuckLake? I saw some mention of date based "hive partitioning" was supported that lead me to belive it is not generally supported also for other partitioning attributes or?!

Thoughts on the above or any other potential challenges that anybody can think of with implementing the described workflow in DuckLake as described are very wellcome?

P.S. My code for creating the large Parquet objects make use of a lot of custom features to keep the size of the data down (all attributes that is not used for sorting are dictionary encoded until I create the Parquet records from them etc) and as mentioned the Parquet data is streamed to S3 to avoid having to fit both the full source and its Parquet representations at once in RAM. If using DuckDB for creating the Parquet objects I assume I would need to store the source and Parquet data on disk as the in-memory representation most likely is more space consuming not allowing as large Parquet objects to be created if held in RAM.

gahtan-syarif · 2025-07-20T22:19:18Z

gahtan-syarif
Jul 20, 2025

DuckLake provides a function that allows you to add external parquet files not written by DuckDB: https://ducklake.select/docs/stable/duckdb/metadata/adding_files

The limitation to this is that the function currently does not work with hive-partitioned data files where some of the information reside in the directory names. Support for adding external hive-partitioned parquet files will be added after DuckDB v1.4 is released: #252

0 replies

javafanboy · 2025-07-21T02:57:59Z

javafanboy
Jul 21, 2025
Author

Thanks for info - will come back to this after the new release then!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Creating DuckLake Parquet data outside of DuckDB? #309

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Creating DuckLake Parquet data outside of DuckDB? #309

Uh oh!

Uh oh!

javafanboy Jul 20, 2025

Replies: 2 comments

Uh oh!

Uh oh!

gahtan-syarif Jul 20, 2025

Uh oh!

javafanboy Jul 21, 2025 Author

javafanboy
Jul 20, 2025

gahtan-syarif
Jul 20, 2025

javafanboy
Jul 21, 2025
Author