Replies: 2 comments
-
I think the main differentiator of DuckLake is to have all the metadata in the database and not in the file system. Probably better to keep all this logic in ducklake, and offer this as a client library to other tools. |
Beta Was this translation helpful? Give feedback.
-
Sorry for taking forever to contemplate. I would argue that the need to use a metadata store in DuckLake (or Iceberg or Delta Lake, etc) is an inconvenience -- it's necessary for a lot of purposes but kind of annoying. With a bucket full of Parquet files, all that's needed to query them is the ability to list the bucket and read the files -- no extra credentials are needed, no port assignments, no special software, no coordination between readers and anyone else [0], etc. With a more complex system, you may need credentials to read the metadata (if that metadata isn't in the same bucket), you need the ability to connect to the metadata system (which involves wiring up access to a database port if the metadata is stored in a DBMS), and you need at least some amount of configuration (at the very least you need to know that it's DuckLake and not, say, Iceberg). So if I have a data lake and I want to give some analyst or AI tool or employee query access, this should in principle have no effect whatsoever on writers (as readers and writers generally don't have to coordinate), but in DuckLake, I would need to get the MySQL/Postgres DBA to create credentials and to open ports or set up VPN access, and that's a lot more work than just creating an object storage access rule for them or handing them a hard disk. (And I wouldn't need to do this in Iceberg.) And I suspect that column min/max is, by a rather large margin, the most important statistic for getting decent query performance for many use cases. So I think would be a neat feature for column min/max to be exposed in an efficiently-readable manner in the actual underlying Parquet data so that query tools (and maybe eventually not just ones from Mother Duck) could find them and optimize queries accordingly. [0] Okay, if you want to read data that isn't the most recently written/committed, you need to speak the transaction language. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone-
I mentioned something related to this on the HN and it got some interest, so I'll propose it for real-ish here:
I have a personal pet peeve about buckets full of Parquet files: almost every tool supports Hive partitioning (and DuckDB is even the winning Google hit right now for the phrase "hive partitioning"!). But Hive partitioning is rather weak for a lot of applications, including timestamped data. And every data lake layer on top of Parquet supports something better, but they're generally incompatible with each other and optimizations are lost if you try to query the underlying Parquet data without consuming the data lake metadata.
So here's my strawman suggestion: maybe some enlightened warehouse/lake/lakehous/whatever you call it, e.g. DuckLake, could offer an extension to put column min/max values into filenames. Think Hive partitioning but better. As a probably terrible concrete proposal, right now a Hive-partitioned table can have a directory called "key=1" and the column "key" is omitted from the Parquet files. A better-than-Hive-ranged table would have a directory called "key_range=1...10" and the column "key" would be in the Parquet files, but query tools that support this would infer that the min is at least 1 and the max is at most 10.
DuckLake could perhaps support this as an option on a table (or a schema? or everywhere?) to give the Parquet files path names that include the data, which is just the relevant values from ducklake_file_column_statistics. And then DuckDB could maybe learn how to read this without even needing to connect to the DuckLake metadata. And, if this got popular, maybe someone could teach polars.scan_parquet to do the same thing. And then people could use DuckLake on the write side to write time-series data and query it painlessly (not quite optimally but a whole lot better than today) from polars or other tools without needing to touch the database.
Is this a good idea? Would DuckLake be at all interested in such an extension?
Beta Was this translation helpful? Give feedback.
All reactions