Distributed compute deployment? #39

djspiewak · 2025-05-27T15:07:09Z

djspiewak
May 27, 2025

First off, everything here seems excellent, but after reading through all the available docs, I can't help but notice there's a bit of a gap when it comes to distributed compute node deployment and coordination.

What I'm referring to here is the situation where you have a relatively large lake (say, backed by S3), an external metadata store (think postgres just as an example), and the need for query compute which is beyond what a single node can service. There are three relevant scenarios that I'm thinking about here:

Many independent queries across potentially different tables simultaneously (external QPS that exceeds the capacity of a single node)
A smaller set of queries on very large partitioned tables, where the tables (or table) is large enough that the query should be sharded across multiple compute workers. Bonus points for explaining how GROUP BY (and similar) work under such circumstances
Any form of join across tables which do not collectively (or perhaps even individually) fit on a single node (i.e. distributed join)

The first one can be handled with a simple load balancer but the second two, I believe, require a higher level distributed query planner. While the ducklake design is manifestly compatible with this, I don't see any evidence that it exists. Am I missing something? Secondarily, has there been any work on standard practices for the first bullet?

peterboncz · 2025-05-27T15:30:35Z

peterboncz
May 27, 2025

DuckLake by itself is not a cloud service. But one could e.g. host it on RDS postgres, and have different EC2 query worker machines access that service using the DuckDB pg catalog, servicing different users concurrently. These EC2 DuckDB workers could be provided in many ways, but one of those ways could be MotherDuck. That covers bullet point 1.

As you mention, the design of DuckLake (a meta-data database) does not rule out at all the possibility to execute queries over the data lake in a scale-out system. For instance, e.g. BigQuery or Snowflake could adopt DuckLake. DuckDB nor MotherDuck provides scale-out execution out-of-the-box, they don't even want to, but other systems could -- and one could even use DuckDB as a component to construct a scale-out system. But indeed, scale-out execution on DuckLakes is not available currently.

However, when using systems less efficient than DuckDB, the need for scale-out may appear larger than it is in reality. Or in fact, be almost completely absent, as discussed in the context of the AWS Redshift workload analysis, here: https://motherduck.com/blog/redshift-files-hunt-for-big-data/

4 replies

djspiewak May 27, 2025
Author

But indeed, scale-out execution on DuckLakes is not available currently

This is helpful! Also thank you for the motherduck pointer.

To be clear, bullet points 2 and 3 are not about performance, they are about feasibility. An operation such as GROUP BY on a table which is larger than the total available storage of a single compute node cannot be executed on a single node, regardless of how fast the code is. JOIN simply generalizes this to multiple tables simultaneously (and their cumulative size impacts). My organization most definitely has these requirements.

The solution to this is probably to implement a form of query federation within duckdb itself, allowing the establishment of remote duckdb compute nodes as data sources for the downstream, which could then in turn query across all of them with the downstream planner deciding the subqueries (and subpartitions) to be federated. This is not trivial but again, not incompatible with the present architecture.

I'll look more at motherduck!

peterboncz May 27, 2025

Glad to provide context!

DuckDB has become quite resilient in working with large amounts of data. When a query has a large GROUP BY or join, its algorithms will start to use out-of-core memory. With current fast SSDs, this can still perform remarkably well, so the idea that a GROUP BY or JOIN with an intermediate larger than RAM would force the use of a scale-out solution is not true anymore in the case of DuckDB.

MPizzotti May 27, 2025

The solution to this is probably to implement a form of query federation within duckdb itself,

It could be helpful for your case to take a look at smallpond.

djspiewak May 27, 2025
Author

GROUP BY or JOIN with an intermediate larger than RAM would force the use of a scale-out solution is not true anymore in the case of DuckDB.

I'm mostly worried about exceeding the amount of local storage. I definitely agree that, with modern NVMe, going beyond RAM saturating the full local node is vastly preferable to anything distributed.

It could be helpful for your case to take a look at smallpond.

My understanding of smallpond is that it's mostly a Spark replacement, rather than something which directly addresses distributed query federation, but I appreciate you bringing it back to my attention! I'll ponder more deeply from this angle.

hpvd · 2025-05-29T08:21:25Z

hpvd
May 29, 2025

wouldn’t using the duckdb wasm app running in wasmtime on K8 be a perfect fit to scale point one?
Since wasm apps starts up ultra fast (<1ms cold start, much faster than other serverless functions), one may start it even on incoming request, keep one app per user and scaling down when not needed any more...wasm for the instant scaling demands and security and K8 nodes for the slower scaling waves..
see also #73 (comment)

edit: if this looks interesting, this video from wasmcloud gives a good overview what else is possible with wasm e.g. working with components etc.. https://www.youtube.com/watch?v=PexRsU8sVIs

edit2: hmm.. using wasmcloud to manage wasmapps this may be also possible even simpler.. without using K8

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed compute deployment? #39

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Distributed compute deployment? #39

Uh oh!

djspiewak May 27, 2025

Replies: 2 comments · 4 replies

Uh oh!

peterboncz May 27, 2025

Uh oh!

djspiewak May 27, 2025 Author

Uh oh!

peterboncz May 27, 2025

Uh oh!

MPizzotti May 27, 2025

Uh oh!

djspiewak May 27, 2025 Author

Uh oh!

Uh oh!

hpvd May 29, 2025

djspiewak
May 27, 2025

Replies: 2 comments 4 replies

peterboncz
May 27, 2025

djspiewak May 27, 2025
Author

djspiewak May 27, 2025
Author

hpvd
May 29, 2025