Creating DuckLake Parquet data outside of DuckDB? #309
javafanboy
started this conversation in
General
Replies: 2 comments
-
DuckLake provides a function that allows you to add external parquet files not written by DuckDB: https://ducklake.select/docs/stable/duckdb/metadata/adding_files The limitation to this is that the function currently does not work with hive-partitioned data files where some of the information reside in the directory names. Support for adding external hive-partitioned parquet files will be added after DuckDB v1.4 is released: #252 |
Beta Was this translation helpful? Give feedback.
0 replies
-
Thanks for info - will come back to this after the new release then! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Today I have a solution where I using plain Java (i.e. no Spark etc. only using Carpet) is creating Parquet objects in steaming mode (sent to S3 using multi part upload API as buffers are filled) to form fairly large objects (avoiding the need to do compaction as a later stage) into a partitioned "directory structure". My partitioning is not only based on date time but also on a few custom attributes. I then register these objects/partitions as they are added in the Glue data catalogue and access them using AWS Athena. I have also (bypassing the catalogue only taking advantage of the known partitioning structure) accessed the Parquet objects directly using DuckDB for "local analytics" - all of this works quite well today.
Now I would have liked to try, as an alternative (until AWS support DuckLake natively a comlement) to this add the same Parquet files to a DuckLake allowing more efficient and convinient access from DuckDB on our local machines. I do not currently see a lot of need to update the Parquet files once uploaded but this may be a future requirement. To allow access from multiple machines I intend, as discussed in a previous thread, to try using AWS DSQL as the meta server (otherwise faling back to AWS regular hosted Postgre service Aurora Serverless) but are seeing some potential challenges with this that I would like some feedback on:
Thoughts on the above or any other potential challenges that anybody can think of with implementing the described workflow in DuckLake as described are very wellcome?
P.S. My code for creating the large Parquet objects make use of a lot of custom features to keep the size of the data down (all attributes that is not used for sorting are dictionary encoded until I create the Parquet records from them etc) and as mentioned the Parquet data is streamed to S3 to avoid having to fit both the full source and its Parquet representations at once in RAM. If using DuckDB for creating the Parquet objects I assume I would need to store the source and Parquet data on disk as the in-memory representation most likely is more space consuming not allowing as large Parquet objects to be created if held in RAM.
Beta Was this translation helpful? Give feedback.
All reactions