-
Notifications
You must be signed in to change notification settings - Fork 105
Fix memory leak for JSON/LIST type during Postgres table scan #784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hey @JelteF, to establish a clear functional boundary for the memory context, I keep it scoped within PostgresScan and initialize & reset it alongside the scan's global state. |
It seems the pycheck regression also failed on my local build with the main branch :( |
For ease of programming we have a `cur.sql()` method that can be used to execute any sql command. Fetching rows for DDL statements doesn't work though, so we were catching that error and ignoring it. [In psycopg 3.2.8 the exact error message was changed][1]. Instead of detecting the new error message, this change makes sure that we don't trigger this error at all. This is done by detecting whether our result has rows, before calling `fetchall()`. Originally reported @Alphaxxxxx in #784 (comment) [1]: psycopg/psycopg@1eb7e5a
bd5364d
to
bece715
Compare
src/scan/postgres_scan.cpp
Outdated
if (MemoryContextMemAllocated(duckdb_pg_scan_memory_ctx, false) > 8 * 1024 * 1024L) { | ||
MemoryContextReset(duckdb_pg_scan_memory_ctx); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this one really needed? I'd rather keep it simple and only do the reset after we are done with the vector. Doing stuff with duckdb is always at the vector level. Then we also don't have to find a decent magic value, like the 8MB you use now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's simplify by removing it. I was concerned about the wide table or large JSON columns. For example, a single JSON row can be several MB in size, and 2048 such rows would quickly consume a lot of memory. However, in this case, the DuckDB data chunk itself also consumes a significant amount of memory.
src/scan/postgres_scan.cpp
Outdated
duckdb_pg_scan_memory_ctx = | ||
AllocSetContextCreate(CurrentMemoryContext, "DuckDBPerTupleContext", ALLOCSET_DEFAULT_MINSIZE, | ||
ALLOCSET_DEFAULT_INITSIZE, ALLOCSET_DEFAULT_MAXSIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. Every postgres scan will now create a separate memory context, which they all assign to the same global variable. The cleanup to set to nullptr is also dangerous, if one scan finishes before another. I think what would be better is to do the following:
- Storing this context actually in the "global state" of the scan, instead of storing it into a process global variable.
- Switch to this context right after we take the
GlobalProcessLock
inPostgresScanFunction
, and switching back right before we release it.
Also the name of the context could be improved imo
duckdb_pg_scan_memory_ctx = | |
AllocSetContextCreate(CurrentMemoryContext, "DuckDBPerTupleContext", ALLOCSET_DEFAULT_MINSIZE, | |
ALLOCSET_DEFAULT_INITSIZE, ALLOCSET_DEFAULT_MAXSIZE); | |
duckdb_pg_scan_memory_ctx = | |
AllocSetContextCreate(CurrentMemoryContext, "DuckdbScanContext", ALLOCSET_DEFAULT_MINSIZE, | |
ALLOCSET_DEFAULT_INITSIZE, ALLOCSET_DEFAULT_MAXSIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, you're right.
src/pgduckdb_types.cpp
Outdated
pfree(str->data); | ||
pfree(str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed? If not, let's keep this simple and just rely on the memory context resets.
bece715
to
c964b56
Compare
src/scan/postgres_scan.cpp
Outdated
extern "C" { | ||
#include "postgres.h" | ||
|
||
#include "utils/memutils.h" | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of inlcuding postgres headers here (in a file that we managed to remove all postgres includes from). Let's create some basic wrapper functions in a new file in src/pg/
, maybe called memory.cpp
. Also AllocSetContextCreate
and MemoryContextReset
need to be wrapped in a PostgresFunctionGuard
when doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I was bothered by the introduction of the C header.
src/pgduckdb_types.cpp
Outdated
auto jsonb_str = JsonbToCString(NULL, &jsonb->root, VARSIZE(jsonb)); | ||
duckdb::string_t str(jsonb_str); | ||
StringInfo str = makeStringInfo(); | ||
auto json_str = JsonbToCString(str, &jsonb->root, VARSIZE(jsonb)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice change to remove the additional copy. This change does make me realize that we need to wrap AppendJsonb in a PostgresFunctionGuard. And probably we need the same for the list one, since that uses also allocates right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it doesn't seem to save an additional copy. AddString internally will still create a duckdb::string_t out of the json_str
+ str->len
. But it does save an additional strlen
, so this still seems like a good change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we saved one copy of the string. The conversion steps are as follows:
JsonBToString
creates the string buffer.duckdb::string_t str(jsonb_str);
duplicates the string buffer into a C++ string.AppendString
performs the string -> string copy internally.
With this PR, step 2 is eliminated. For LIST conversion, I wrap the PostgreSQL function with the guard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI is failing (and see rest of my comments).
The CI passed in my local run with the latest change. Could you please trigger the pipeline, @JelteF ? Thank you! |
Fixes #750