Fix memory leak for JSON/LIST type during Postgres table scan #784

YuweiXiao · 2025-05-10T07:31:30Z

Fixes #750

YuweiXiao · 2025-05-10T07:33:53Z

Hey @JelteF, to establish a clear functional boundary for the memory context, I keep it scoped within PostgresScan and initialize & reset it alongside the scan's global state.

YuweiXiao · 2025-05-12T06:40:13Z

It seems the pycheck regression also failed on my local build with the main branch :(

@Alphaxxxxx

For ease of programming we have a `cur.sql()` method that can be used to execute any sql command. Fetching rows for DDL statements doesn't work though, so we were catching that error and ignoring it. [In psycopg 3.2.8 the exact error message was changed][1]. Instead of detecting the new error message, this change makes sure that we don't trigger this error at all. This is done by detecting whether our result has rows, before calling `fetchall()`. Originally reported @Alphaxxxxx in #784 (comment) [1]: psycopg/psycopg@1eb7e5a

JelteF · 2025-05-13T01:48:15Z

src/scan/postgres_scan.cpp

+		if (MemoryContextMemAllocated(duckdb_pg_scan_memory_ctx, false) > 8 * 1024 * 1024L) {
+			MemoryContextReset(duckdb_pg_scan_memory_ctx);
+		}


Is this one really needed? I'd rather keep it simple and only do the reset after we are done with the vector. Doing stuff with duckdb is always at the vector level. Then we also don't have to find a decent magic value, like the 8MB you use now.

Yes, let's simplify by removing it. I was concerned about the wide table or large JSON columns. For example, a single JSON row can be several MB in size, and 2048 such rows would quickly consume a lot of memory. However, in this case, the DuckDB data chunk itself also consumes a significant amount of memory.

JelteF · 2025-05-13T01:52:43Z

src/scan/postgres_scan.cpp

+	duckdb_pg_scan_memory_ctx =
+	    AllocSetContextCreate(CurrentMemoryContext, "DuckDBPerTupleContext", ALLOCSET_DEFAULT_MINSIZE,
+	                          ALLOCSET_DEFAULT_INITSIZE, ALLOCSET_DEFAULT_MAXSIZE);


This is incorrect. Every postgres scan will now create a separate memory context, which they all assign to the same global variable. The cleanup to set to nullptr is also dangerous, if one scan finishes before another. I think what would be better is to do the following:

Storing this context actually in the "global state" of the scan, instead of storing it into a process global variable.

Switch to this context right after we take the GlobalProcessLock in PostgresScanFunction, and switching back right before we release it.

Also the name of the context could be improved imo

Suggested change

duckdb_pg_scan_memory_ctx =

AllocSetContextCreate(CurrentMemoryContext, "DuckDBPerTupleContext", ALLOCSET_DEFAULT_MINSIZE,

ALLOCSET_DEFAULT_INITSIZE, ALLOCSET_DEFAULT_MAXSIZE);

duckdb_pg_scan_memory_ctx =

AllocSetContextCreate(CurrentMemoryContext, "DuckdbScanContext", ALLOCSET_DEFAULT_MINSIZE,

ALLOCSET_DEFAULT_INITSIZE, ALLOCSET_DEFAULT_MAXSIZE);

ah yes, you're right.

JelteF · 2025-05-13T01:55:48Z

src/pgduckdb_types.cpp

+	pfree(str->data);
+	pfree(str);


Is this still needed? If not, let's keep this simple and just rely on the memory context resets.

JelteF · 2025-05-13T10:28:56Z

src/scan/postgres_scan.cpp

+extern "C" {
+#include "postgres.h"
+
+#include "utils/memutils.h"
+}
+


Instead of inlcuding postgres headers here (in a file that we managed to remove all postgres includes from). Let's create some basic wrapper functions in a new file in src/pg/, maybe called memory.cpp. Also AllocSetContextCreate and MemoryContextReset need to be wrapped in a PostgresFunctionGuard when doing so.

Good point! I was bothered by the introduction of the C header.

JelteF · 2025-05-13T10:42:56Z

src/pgduckdb_types.cpp

-	auto jsonb_str = JsonbToCString(NULL, &jsonb->root, VARSIZE(jsonb));
-	duckdb::string_t str(jsonb_str);
+	StringInfo str = makeStringInfo();
+	auto json_str = JsonbToCString(str, &jsonb->root, VARSIZE(jsonb));


nice change to remove the additional copy. This change does make me realize that we need to wrap AppendJsonb in a PostgresFunctionGuard. And probably we need the same for the list one, since that uses also allocates right?

Actually it doesn't seem to save an additional copy. AddString internally will still create a duckdb::string_t out of the json_str + str->len. But it does save an additional strlen, so this still seems like a good change.

I believe we saved one copy of the string. The conversion steps are as follows:

JsonBToString creates the string buffer.

duckdb::string_t str(jsonb_str); duplicates the string buffer into a C++ string.

AppendString performs the string -> string copy internally.

With this PR, step 2 is eliminated. For LIST conversion, I wrap the PostgreSQL function with the guard.

JelteF

CI is failing (and see rest of my comments).

YuweiXiao · 2025-05-13T13:01:34Z

The CI passed in my local run with the latest change. Could you please trigger the pipeline, @JelteF ? Thank you!

JelteF mentioned this pull request May 12, 2025

tests: Handle DDL command in cur.sql() differently #785

Merged

YuweiXiao force-pushed the issue_json_list_mem_leak branch from bd5364d to bece715 Compare May 12, 2025 10:54

JelteF requested changes May 13, 2025

View reviewed changes

Yuwei Xiao added 3 commits May 13, 2025 10:45

fix

a836184

better comment

da5a542

Address review

c964b56

YuweiXiao force-pushed the issue_json_list_mem_leak branch from bece715 to c964b56 Compare May 13, 2025 02:45

YuweiXiao requested a review from JelteF May 13, 2025 02:53

JelteF reviewed May 13, 2025

View reviewed changes

JelteF requested changes May 13, 2025

View reviewed changes

Yuwei Xiao added 3 commits May 13, 2025 20:19

address review: dedicated memory.cpp as proxy to pg memutils

1752c7f

guard pg function

93064e4

fix compile

944c9fe

fix regress

118014b

YuweiXiao requested a review from JelteF May 15, 2025 13:40

JelteF approved these changes May 16, 2025

View reviewed changes

JelteF merged commit 90be71e into duckdb:main May 16, 2025
11 of 12 checks passed

JelteF mentioned this pull request May 26, 2025

Fix scans larger than 2048 #805

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix memory leak for JSON/LIST type during Postgres table scan #784

Fix memory leak for JSON/LIST type during Postgres table scan #784

Uh oh!

YuweiXiao commented May 10, 2025

Uh oh!

YuweiXiao commented May 10, 2025

Uh oh!

YuweiXiao commented May 12, 2025

Uh oh!

JelteF May 13, 2025

Uh oh!

YuweiXiao May 13, 2025

Uh oh!

JelteF May 13, 2025

Uh oh!

YuweiXiao May 13, 2025

Uh oh!

JelteF May 13, 2025

Uh oh!

JelteF May 13, 2025

Uh oh!

YuweiXiao May 13, 2025 •

edited

Loading

Uh oh!

JelteF May 13, 2025

Uh oh!

JelteF May 13, 2025

Uh oh!

YuweiXiao May 13, 2025 •

edited

Loading

Uh oh!

JelteF left a comment

Uh oh!

YuweiXiao commented May 13, 2025

Uh oh!

Uh oh!

Uh oh!

Fix memory leak for JSON/LIST type during Postgres table scan #784

Fix memory leak for JSON/LIST type during Postgres table scan #784

Uh oh!

Conversation

YuweiXiao commented May 10, 2025

Uh oh!

YuweiXiao commented May 10, 2025

Uh oh!

YuweiXiao commented May 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YuweiXiao May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YuweiXiao May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JelteF left a comment

Choose a reason for hiding this comment

Uh oh!

YuweiXiao commented May 13, 2025

Uh oh!

Uh oh!

Uh oh!

YuweiXiao May 13, 2025 •

edited

Loading

YuweiXiao May 13, 2025 •

edited

Loading