Replies: 2 comments 4 replies
-
DuckLake by itself is not a cloud service. But one could e.g. host it on RDS postgres, and have different EC2 query worker machines access that service using the DuckDB pg catalog, servicing different users concurrently. These EC2 DuckDB workers could be provided in many ways, but one of those ways could be MotherDuck. That covers bullet point 1. As you mention, the design of DuckLake (a meta-data database) does not rule out at all the possibility to execute queries over the data lake in a scale-out system. For instance, e.g. BigQuery or Snowflake could adopt DuckLake. DuckDB nor MotherDuck provides scale-out execution out-of-the-box, they don't even want to, but other systems could -- and one could even use DuckDB as a component to construct a scale-out system. But indeed, scale-out execution on DuckLakes is not available currently. However, when using systems less efficient than DuckDB, the need for scale-out may appear larger than it is in reality. Or in fact, be almost completely absent, as discussed in the context of the AWS Redshift workload analysis, here: https://motherduck.com/blog/redshift-files-hunt-for-big-data/ |
Beta Was this translation helpful? Give feedback.
-
wouldn’t using the duckdb wasm app running in wasmtime on K8 be a perfect fit to scale point one? edit: if this looks interesting, this video from wasmcloud gives a good overview what else is possible with wasm e.g. working with components etc.. https://www.youtube.com/watch?v=PexRsU8sVIs edit2: hmm.. using wasmcloud to manage wasmapps this may be also possible even simpler.. without using K8 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
First off, everything here seems excellent, but after reading through all the available docs, I can't help but notice there's a bit of a gap when it comes to distributed compute node deployment and coordination.
What I'm referring to here is the situation where you have a relatively large lake (say, backed by S3), an external metadata store (think postgres just as an example), and the need for query compute which is beyond what a single node can service. There are three relevant scenarios that I'm thinking about here:
GROUP BY
(and similar) work under such circumstancesThe first one can be handled with a simple load balancer but the second two, I believe, require a higher level distributed query planner. While the ducklake design is manifestly compatible with this, I don't see any evidence that it exists. Am I missing something? Secondarily, has there been any work on standard practices for the first bullet?
Beta Was this translation helpful? Give feedback.
All reactions