Skip to content

Fix memory leak for JSON/LIST type during Postgres table scan #784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 16, 2025

Conversation

YuweiXiao
Copy link
Contributor

Fixes #750

@YuweiXiao
Copy link
Contributor Author

Hey @JelteF, to establish a clear functional boundary for the memory context, I keep it scoped within PostgresScan and initialize & reset it alongside the scan's global state.

@YuweiXiao
Copy link
Contributor Author

It seems the pycheck regression also failed on my local build with the main branch :(

JelteF added a commit that referenced this pull request May 12, 2025
For ease of programming we have a `cur.sql()` method that can be used to
execute any sql command. Fetching rows for DDL statements doesn't work
though, so we were catching that error and ignoring it. [In psycopg
3.2.8 the exact error message was changed][1].

Instead of detecting the new error message, this change makes sure that
we don't trigger this error at all. This is done by detecting whether
our result has rows, before calling `fetchall()`.

Originally reported @Alphaxxxxx in
#784 (comment)

[1]: psycopg/psycopg@1eb7e5a
@YuweiXiao YuweiXiao force-pushed the issue_json_list_mem_leak branch from bd5364d to bece715 Compare May 12, 2025 10:54
Comment on lines 288 to 290
if (MemoryContextMemAllocated(duckdb_pg_scan_memory_ctx, false) > 8 * 1024 * 1024L) {
MemoryContextReset(duckdb_pg_scan_memory_ctx);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this one really needed? I'd rather keep it simple and only do the reset after we are done with the vector. Doing stuff with duckdb is always at the vector level. Then we also don't have to find a decent magic value, like the 8MB you use now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's simplify by removing it. I was concerned about the wide table or large JSON columns. For example, a single JSON row can be several MB in size, and 2048 such rows would quickly consume a lot of memory. However, in this case, the DuckDB data chunk itself also consumes a significant amount of memory.

Comment on lines 185 to 181
duckdb_pg_scan_memory_ctx =
AllocSetContextCreate(CurrentMemoryContext, "DuckDBPerTupleContext", ALLOCSET_DEFAULT_MINSIZE,
ALLOCSET_DEFAULT_INITSIZE, ALLOCSET_DEFAULT_MAXSIZE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. Every postgres scan will now create a separate memory context, which they all assign to the same global variable. The cleanup to set to nullptr is also dangerous, if one scan finishes before another. I think what would be better is to do the following:

  1. Storing this context actually in the "global state" of the scan, instead of storing it into a process global variable.
  2. Switch to this context right after we take the GlobalProcessLock in PostgresScanFunction, and switching back right before we release it.

Also the name of the context could be improved imo

Suggested change
duckdb_pg_scan_memory_ctx =
AllocSetContextCreate(CurrentMemoryContext, "DuckDBPerTupleContext", ALLOCSET_DEFAULT_MINSIZE,
ALLOCSET_DEFAULT_INITSIZE, ALLOCSET_DEFAULT_MAXSIZE);
duckdb_pg_scan_memory_ctx =
AllocSetContextCreate(CurrentMemoryContext, "DuckdbScanContext", ALLOCSET_DEFAULT_MINSIZE,
ALLOCSET_DEFAULT_INITSIZE, ALLOCSET_DEFAULT_MAXSIZE);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, you're right.

Comment on lines 1517 to 1518
pfree(str->data);
pfree(str);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed? If not, let's keep this simple and just rely on the memory context resets.

@YuweiXiao YuweiXiao force-pushed the issue_json_list_mem_leak branch from bece715 to c964b56 Compare May 13, 2025 02:45
@YuweiXiao YuweiXiao requested a review from JelteF May 13, 2025 02:53
Comment on lines 14 to 19
extern "C" {
#include "postgres.h"

#include "utils/memutils.h"
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of inlcuding postgres headers here (in a file that we managed to remove all postgres includes from). Let's create some basic wrapper functions in a new file in src/pg/, maybe called memory.cpp. Also AllocSetContextCreate and MemoryContextReset need to be wrapped in a PostgresFunctionGuard when doing so.

Copy link
Contributor Author

@YuweiXiao YuweiXiao May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I was bothered by the introduction of the C header.

auto jsonb_str = JsonbToCString(NULL, &jsonb->root, VARSIZE(jsonb));
duckdb::string_t str(jsonb_str);
StringInfo str = makeStringInfo();
auto json_str = JsonbToCString(str, &jsonb->root, VARSIZE(jsonb));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice change to remove the additional copy. This change does make me realize that we need to wrap AppendJsonb in a PostgresFunctionGuard. And probably we need the same for the list one, since that uses also allocates right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it doesn't seem to save an additional copy. AddString internally will still create a duckdb::string_t out of the json_str + str->len. But it does save an additional strlen, so this still seems like a good change.

Copy link
Contributor Author

@YuweiXiao YuweiXiao May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we saved one copy of the string. The conversion steps are as follows:

  1. JsonBToString creates the string buffer.
  2. duckdb::string_t str(jsonb_str); duplicates the string buffer into a C++ string.
  3. AppendString performs the string -> string copy internally.

With this PR, step 2 is eliminated. For LIST conversion, I wrap the PostgreSQL function with the guard.

Copy link
Collaborator

@JelteF JelteF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is failing (and see rest of my comments).

@YuweiXiao
Copy link
Contributor Author

The CI passed in my local run with the latest change. Could you please trigger the pipeline, @JelteF ? Thank you!

@YuweiXiao YuweiXiao requested a review from JelteF May 15, 2025 13:40
@JelteF JelteF merged commit 90be71e into duckdb:main May 16, 2025
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory leak when scanning postgres table with jsonb & list type
2 participants