Replies: 7 comments
-
Hey JelteF, I've been involved in the pg_duckdb community for about six months, and really appreciate all the help along the way. Our main workload reads Parquet files from S3, performs ETL (joins, aggregations), and writes results to Postgres heap tables for ad hoc queries. We rely on:
To reduce S3 latency, we also implemented a cache layer on top of httpfs, leveraging our existing cache infrastructure. Initially, we loaded Parquet files into Postgres for updates and ran ETL on heap tables with DuckDB, but scan performance lagged behind. We also tried pg_mooncake v0.1 (Delta Lake), but it's being re-architected and v0.1 is no longer maintained. Now, the data is sit in S3, and analysis is conducted on it in a manner similar to that of a foreign table. Our pipeline works for daily needs, but we face:
To address these:
In addition to our internal use, we plan to make |
Beta Was this translation helpful? Give feedback.
-
Hi @JelteF , we want to use it to query Iceberg tables through postgres. We have a lot of users already using postgres and don't want to migrate to another engine just to use the iceberg tables. Therefore, we're really interested in at least querying the iceberg tables. |
Beta Was this translation helpful? Give feedback.
-
Hi we are also using it for accessing iceberg on s3 via postgres - we are using postgres with views to restrict access so the user doesnt need direct access to the s3 bucket - also we can then apply restrictions on what data they can view for example only the last months or a particular category etc We use it with postgres fdw so we loopback the view into the same postgres database which then allows us to join to a normal postgres table and then import data back into a normal postgres table or just treat the iceberg table as a normal postgres table with the postgres engine. You can update the view to pushdown queries to the duckdb engine as well -we want to look at caching https layer to speed things up but it does work fairly well even without it. In our setup we have postgres acting as a buffer to collect up changes to then be merged out to iceberg periodically - its possible with the loopback postgres fdw to then join postgres data with the iceberg data to get a single view in postgres of the table in the current state without keeping all the records in postgres - it also allows you to update the records with rules on the view to capture the changes for merging and the join. |
Beta Was this translation helpful? Give feedback.
-
We're looking to try and replace the now defunct The general idea though is that we get the benefit of using postgres (lots of support across different applications and platforms for connecting) but lose the requirement to dump and load data to postgres as we can push down queries to duckdb over parquet and iceberg tables in S3. Effectively it makes postgres act as just a query engine instead of a data store, which is especially important as we have some fairly large data-sets. |
Beta Was this translation helpful? Give feedback.
-
I am looking to use pg_duckdb for a couple of things in a large prod environment. Many databases have huge archive partitions that are rarely used but make database backups extremely difficult due to their sheer size. Moving these to s3/parquet I have seen an 80 percent reduction in size. In addition many of the queries that run against the parquet files via pg_duckdb see a 90 percent reduction in runtime. A few challenges I see
|
Beta Was this translation helpful? Give feedback.
-
Hi there, At Postgres Professional (https://postgrespro.com/), we're developing a PostgreSQL extension that implements a Parquet file catalog compatible with the Ducklake specification. This extension leverages a customized version of pg_duckdb to integrate the DuckDB analytics engine, with the catalog itself being hostable remotely. Our primary goal is to create a seamless environment for building data warehouses within the PostgreSQL ecosystem. This approach allows users to leverage PostgreSQL's strengths for both OLTP and performant OLAP workloads, all within a single, unified system. The extension supports creating analytical tables, ingesting data from Parquet files, and copying data directly from existing PostgreSQL tables. These analytical tables are then accessible as standard PostgreSQL views, making them easy to query. Along the way, we've tackled several interesting challenges, such as implementing granular access control for analytical tables and ensuring the catalog and underlying storage remain consistent, even in the event of failures. DuckDB engine is already integrated into Postgres Pro Enterprise distribution via our customized pg_duckdb extension, and the Ducklake catalog extension will be released soon, see https://postgrespro.com/blog/pgsql/5972234 |
Beta Was this translation helpful? Give feedback.
-
Hi there, Heavy user of DuckDB, looking at pg_duckdb as a way to reduce copy from local DuckDB into databases. I am the ETL, BI and python developer. The setup: So yes, there is a lot of copying data around because DuckDB is not accessible over network. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
With an open source project it's always hard to know how people are using it. If you're using pg_duckdb in production, could you share here a little bit on how? Also if you're currently evaluating pg_duckdb for some usecase, can you share that? What are the usecases that you're using it for and how well does it work for you?
(Tagging some people that have been active on the repo in the hope that they respond @YuweiXiao @askyx @ggnmstr @saygoodbyye @chestnutsj @sysadminmike @wasd171)
Beta Was this translation helpful? Give feedback.
All reactions