DuckDB + Ducklake as main DB #394
Replies: 2 comments
-
I looked into using ducklake but I found that it's simpler to insert data into a primary data store like duckdb/postgres and then periodically shift older data to S3. The tooling around this is not ideal but it's the fastest and most flexible option |
Beta Was this translation helpful? Give feedback.
-
I think DuckLake have been borned to be lightweight Lakehouse. You switched from MongoDB to Ducklake + DuckDB, which mean you switch from Data Warehouse architecture to Lakehouse Architecture. In my opinion, if your source data only have only come from 1 or 2 generator (Your application or something else), you shoul keep Data Warehouse architecture with Mongo or Postgres and schedually export data to S3 for analytics by DuckDB. Make sure you understand the Ducklake is not only the extension, it's more than that. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We’re currently exploring a fairly radical shift in our backend architecture, and I’d love to get some feedback.
Our current system is based on MongoDB combined with Atlas Search. We’re considering replacing it entirely with DuckDB + Ducklake, working directly on Parquet files stored in S3, without any additional database layer.
• Users can update data via the UI, which we plan to support using inline updates (DuckDB writes).
• Analytical jobs that update millions of records currently take hours – with DuckDB, we’ve seen they could take just minutes.
• All data is stored in columnar format and compressed, which significantly reduces both cost and latency for analytic workloads.
To support Ducklake, we’ll be using PostgreSQL as the catalog backend, while the actual data remains in S3.
The only real pain point we’re struggling with is retrieving a record by ID efficiently, which is trivial in MongoDB.
So here’s my question: Does it sound completely unreasonable to build a production-grade system that relies solely on Ducklake (on S3) as the primary datastore, assuming we handle write scenarios via inline updates and optimize access patterns?
Would love to hear from others who tried something similar – or any thoughts on potential pitfalls.
Beta Was this translation helpful? Give feedback.
All reactions