How are you using pg_duckdb? #886

JelteF · 2025-08-14T08:06:32Z

JelteF
Aug 14, 2025
Maintainer

With an open source project it's always hard to know how people are using it. If you're using pg_duckdb in production, could you share here a little bit on how? Also if you're currently evaluating pg_duckdb for some usecase, can you share that? What are the usecases that you're using it for and how well does it work for you?

(Tagging some people that have been active on the repo in the hope that they respond @YuweiXiao @askyx @ggnmstr @saygoodbyye @chestnutsj @sysadminmike @wasd171)

YuweiXiao · 2025-08-15T06:42:32Z

YuweiXiao
Aug 15, 2025

Hey JelteF, I've been involved in the pg_duckdb community for about six months, and really appreciate all the help along the way.

Our main workload reads Parquet files from S3, performs ETL (joins, aggregations), and writes results to Postgres heap tables for ad hoc queries. We rely on:

read_parquet
read_object_store (AWS S3)
write to Postgres tables (Support insert_into_select for Postgres table #688)
force DuckDB execution

To reduce S3 latency, we also implemented a cache layer on top of httpfs, leveraging our existing cache infrastructure.

Initially, we loaded Parquet files into Postgres for updates and ran ETL on heap tables with DuckDB, but scan performance lagged behind. We also tried pg_mooncake v0.1 (Delta Lake), but it's being re-architected and v0.1 is no longer maintained. Now, the data is sit in S3, and analysis is conducted on it in a manner similar to that of a foreign table.

Our pipeline works for daily needs, but we face:

Instability from memory limits applied per backend, not service-wide.
Suboptimal performance: S3 Parquet files are unsorted, limiting filter effectiveness.
SQL usability: Querying is less user-friendly due to "unknown" schemas and 'duckdb row' syntax.

To address these:

We limit concurrent SQL queries, though this isn't foolproof, especially with manual ETL jobs.
We're exploring embedded columnar store support (e.g., DuckLake) to improve performance on unsorted Parquet files. ducklake integration #830
We're considering a foreign table that infers and persists schema as a Postgres foreign table.

In addition to our internal use, we plan to make pg_duckdb available to our Postgres customers (YES, we also offer PG database directly). To achieve this, we are taking extra steps to ensure both security and usability, such as disabling the LocalFileSystem (while still allowing spill to disk), pre-packaging DuckDB extensions and disallowing runtime downloads.

0 replies

adeel-ansari · 2025-08-21T13:29:28Z

adeel-ansari
Aug 21, 2025

Hi @JelteF , we want to use it to query Iceberg tables through postgres. We have a lot of users already using postgres and don't want to migrate to another engine just to use the iceberg tables. Therefore, we're really interested in at least querying the iceberg tables.

0 replies

sysadminmike · 2025-08-21T19:53:06Z

sysadminmike
Aug 21, 2025

Hi we are also using it for accessing iceberg on s3 via postgres - we are using postgres with views to restrict access so the user doesnt need direct access to the s3 bucket - also we can then apply restrictions on what data they can view for example only the last months or a particular category etc

We use it with postgres fdw so we loopback the view into the same postgres database which then allows us to join to a normal postgres table and then import data back into a normal postgres table or just treat the iceberg table as a normal postgres table with the postgres engine. You can update the view to pushdown queries to the duckdb engine as well -we want to look at caching https layer to speed things up but it does work fairly well even without it.

In our setup we have postgres acting as a buffer to collect up changes to then be merged out to iceberg periodically - its possible with the loopback postgres fdw to then join postgres data with the iceberg data to get a single view in postgres of the table in the current state without keeping all the records in postgres - it also allows you to update the records with rules on the view to capture the changes for merging and the join.

0 replies

jdu · 2025-09-16T15:31:02Z

jdu
Sep 16, 2025

We're looking to try and replace the now defunct pg_analytics extension, however because of the typing on view columns (or lack of typing in the postgres catalog) we end up in a weird situation where downstream applications (Tableau for instance) struggle when they do pre-flight queries to get the structure and types for a given table.

The general idea though is that we get the benefit of using postgres (lots of support across different applications and platforms for connecting) but lose the requirement to dump and load data to postgres as we can push down queries to duckdb over parquet and iceberg tables in S3. Effectively it makes postgres act as just a query engine instead of a data store, which is especially important as we have some fairly large data-sets.

0 replies

AndrewJackson2020 · 2025-10-16T01:33:57Z

AndrewJackson2020
Oct 16, 2025

I am looking to use pg_duckdb for a couple of things in a large prod environment.

Many databases have huge archive partitions that are rarely used but make database backups extremely difficult due to their sheer size. Moving these to s3/parquet I have seen an 80 percent reduction in size. In addition many of the queries that run against the parquet files via pg_duckdb see a 90 percent reduction in runtime.

A few challenges I see

The query syntax is a tough sell. Basically takes ORMs off the table and makes things very awkward for developers. Would be nice to use normal postgres syntax via FDW or TAM while keeping the duckdb query execution, not sure how feasible that is though.
Question crop up along the lines of "why bother using duckdb in postgres when we can use postgres in duckdb". Postgres querying within duckdb seems more natural as well. I still see potential in hosting duckdb for users via postgres though.
Direction on resource allocation would be extremely beneficial. Currently I disable overcommit per postgres docs, have no idea if that is okay when running potentially hundreds of in process duckdb instances in client backends.
Would be nice if duckdb and pg_duckdb had binary packaging via RPM. Currently I have an internal RPM build but a lot of adopters would not be willing to do that and definitely kept me from exploring pg_duckdb sooner.
I have seen at least one or 2 crashes that I have not been able to reproduce. This definitely scares me from deploying pg_duckdb on a primary. If this does go to prod I would probably do it on a separate database with access to the main database via FDW until i see the crash reports and personal experience disappear.

0 replies

ggnmstr · 2025-10-16T12:50:40Z

ggnmstr
Oct 16, 2025

Hi there,

At Postgres Professional (https://postgrespro.com/), we're developing a PostgreSQL extension that implements a Parquet file catalog compatible with the Ducklake specification. This extension leverages a customized version of pg_duckdb to integrate the DuckDB analytics engine, with the catalog itself being hostable remotely.

Our primary goal is to create a seamless environment for building data warehouses within the PostgreSQL ecosystem. This approach allows users to leverage PostgreSQL's strengths for both OLTP and performant OLAP workloads, all within a single, unified system.

The extension supports creating analytical tables, ingesting data from Parquet files, and copying data directly from existing PostgreSQL tables. These analytical tables are then accessible as standard PostgreSQL views, making them easy to query.

Along the way, we've tackled several interesting challenges, such as implementing granular access control for analytical tables and ensuring the catalog and underlying storage remain consistent, even in the event of failures.

DuckDB engine is already integrated into Postgres Pro Enterprise distribution via our customized pg_duckdb extension, and the Ducklake catalog extension will be released soon, see https://postgrespro.com/blog/pgsql/5972234

0 replies

Alex-T-NPS · 2025-10-16T20:55:07Z

Alex-T-NPS
Oct 16, 2025

Hi there,

Heavy user of DuckDB, looking at pg_duckdb as a way to reduce copy from local DuckDB into databases.

I am the ETL, BI and python developer.
During the last year I had compared the speed of DuckDB on the same data as the old school database.
I have big queries that only work with DuckDB.

The setup:
-lots of small parquet files hosted in the cloud. Many small files are prohibitively slow to query, with any tool
-downloading daily the parquet files to a local machine : DuckDB is much faster with local files
-several Python scripts to aggregate data in different ways and upload CSVs to cloud, to be used for BI dashboards
-several Python scripts to aggregate data in different ways and insert to databases, to be used for BI dashboards

So yes, there is a lot of copying data around because DuckDB is not accessible over network.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How are you using pg_duckdb? #886

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 7 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How are you using pg_duckdb? #886

Uh oh!

Uh oh!

JelteF Aug 14, 2025 Maintainer

Replies: 7 comments

Uh oh!

YuweiXiao Aug 15, 2025

Uh oh!

adeel-ansari Aug 21, 2025

Uh oh!

Uh oh!

sysadminmike Aug 21, 2025

Uh oh!

jdu Sep 16, 2025

Uh oh!

AndrewJackson2020 Oct 16, 2025

Uh oh!

ggnmstr Oct 16, 2025

Uh oh!

Alex-T-NPS Oct 16, 2025

JelteF
Aug 14, 2025
Maintainer

YuweiXiao
Aug 15, 2025

adeel-ansari
Aug 21, 2025

sysadminmike
Aug 21, 2025

jdu
Sep 16, 2025

AndrewJackson2020
Oct 16, 2025

ggnmstr
Oct 16, 2025

Alex-T-NPS
Oct 16, 2025