Support parallel DuckDB threads for Postgres table scan #762

YuweiXiao · 2025-05-07T07:16:14Z

Currently, we use a single DuckDB thread for Postgres table scan, even though multiple Postgres workers will be initialized. This leads to a performance bottleneck when scanning large amounts of data.

This PR parallelizes the conversion from Postgres tuple to DuckDB data chunk. Below are benchmark results on a 5GB TPCH lineitem table.

Benchmark query: select * from lineitem order by 1 limit 1
Other GUC setups: duckdb.max_workers_per_postgres_scan = 2

Threads (`duckdb.threads_for_postgres_scan`)	Costs (seconds)
1	15.8
2	8.7
4	5.8

JelteF

Very cool! The perf differences you report are very impressive. I think a bit more code comments would be quite helpful to make understanding this easier.

Similarly to #688 I'm postponing this until after 1.0 though, given it touches a very core part of pg_duckdb.

JelteF · 2025-05-07T11:00:13Z

src/pgduckdb_types.cpp

@@ -1505,6 +1505,7 @@ AppendString(duckdb::Vector &result, Datum value, idx_t offset, bool is_bpchar)

 static void
 AppendJsonb(duckdb::Vector &result, Datum value, idx_t offset) {
+	std::lock_guard<std::recursive_mutex> lock(GlobalProcessLock::GetLock());


Are these the only types that need this additional locking now? Would be good to explicitely state in a comment for each other type why they are thread safe. That way we won't forget to check when introducing support for new types.

Yes, only JSON and LIST. Additionally, as mentioned in #750, both of these types have memory issues.

I will add a comment for ConvertPostgresToDuckValue. The overall rule is that it is thread-safe as long as it does not use Postgres MemContext.

JelteF · 2025-05-07T11:11:09Z

src/scan/postgres_scan.cpp

+	bool is_parallel_scan = local_state.global_state->MaxThreads() > 1;
+	if (!is_parallel_scan) {
+		std::lock_guard<std::recursive_mutex> lock(GlobalProcessLock::GetLock());


I think this difference in behaviour between 1 and more than one threads doesn't completely make sense. Even if max_threads_per_postgres_scan is 1, it's still possible to have two different postgres scans running in parallel. Those two concurrent postgres scans would still benefit from not holding a lock during InsertTupleIntoChunk.

After reading more I now realize this is probably important for the case where we don't use backgroundworkers for the scan. So maybe we should keep this functionality. But I think it's worth refactoring and/or commenting this a bit more, because now the logic is quite hard to follow.

Ah, okay. I was trying to keep the original code unchanged for the single-thread case.

I think it would be nice to have the two cases share a bit more code. Now it's unclear if the places where they are different are on purpose or by accident.

JelteF · 2025-05-07T11:21:11Z

src/pg/relations.cpp

+void
+SlotGetAllAttrsUnsafe(TupleTableSlot *slot) {
+	slot_getallattrs(slot);
+}


This seemed scary to me, but looking closely at the implementation of slot_getallattrs it doesn't use memory contexts nor can it throw an error. The only place where it throws an error is:

if (unlikely(attnum > slot->tts_tupleDescriptor->natts)) elog(ERROR, "invalid attribute number %d", attnum);

But that can condition can never be false, due to the fact that slot->tts_tupleDescriptor->natts is passed as attnum.

Could you merge this function function with SlotGetAllAttrs? And add the above information in a code comment for it.

Yes, sure. I am using the term "unsafe" to refer to the fact that it is not protected by PostgresFunctionGuard, even though it does not actually require that protection :)

JelteF · 2025-05-07T11:29:26Z

include/pgduckdb/scan/postgres_scan.hpp

 };

 // Local State
-
+#define LOCAL_STATE_SLOT_BATCH_SIZE 32


Why 32? Maybe we should we make this configurable?

I was concerned about burdening users with another GUC hyperparameter.

I tested batch sizes of 8, 16, 32, and 64, and found that 32 performs the best. BTW, the batch size helps to amortize the lock overhead across threads.

JelteF · 2025-05-07T11:41:09Z

include/pgduckdb/scan/postgres_table_reader.hpp

+	TupleTableSlot *InitTupleSlot();
+	bool
+	IsParallelScan() const {
+		return nworkers_launched > 0;


I cannot find the place where you change nworkers_launched.

It is assigned during the initialization of Postgres parallel workers, which is part of the original code logic. I simply added 0 initialization and the interface.

JelteF · 2025-05-07T11:47:49Z

src/scan/postgres_scan.cpp

-		SlotGetAllAttrs(slot);
-		InsertTupleIntoChunk(output, local_state, slot);
+		for (size_t j = 0; j < valid_slots; j++) {
+			MinimalTuple minmal_tuple = reinterpret_cast<MinimalTuple>(local_state.minimal_tuple_buffer[j].data());


@Y-- I need your C++ knowledge. Is this a good way to keep a buffer of MinimalTuples?

One thought I had is that now we do two copies of the minimal tuple:

Once from the stack into the buffer (in GetnextMinimalTuple)

Once from the buffer back to the stack (here).

I think if we instead have an array of MinimalTuple that we realloc instead of using vectors of bytes, then we only need to copy once and we can pass the minimal tuple from the buffer directly into ExecStoreMinimalTupleUnsafe.

This doesn't do any copy (and thus doesn't extend/modify its lifetime).
It forces the compiler to accept that the bytes stored in the minimal_tuple_buffer[j] vector are a MinimalTuple (aka MinimalTupleData *, where MinimalTupleData is itself a struct).

I haven't read the code yet, but my first question would be: why are they vector<uint8_t> instead of vector<MinimalTuple> in the first place?

As @Y-- pointed out, only one copy occurs (from Postgres parallel workers' shared memory to the buffer).

One benefit of using vector<uint8_t> is that we have an off-the-shelf API to enlarge or shrink the buffer (i.e., resize). Additionally, there is no need to worry about memory leaks, as they are handled by RAII.

It forces the compiler to accept that the bytes stored in the minimal_tuple_buffer[j] vector are a MinimalTuple (aka MinimalTupleData *, where MinimalTupleData is itself a struct).

Sounds like that could cause alignment problems.

Ah, yes. Let me confirm the alignment issue.

Double-checked that there won't be any alignment issues. std::vector and MemoryContext (which internally uses malloc) follow the same rule to align to at least alignof(max_align_t).

JelteF · 2025-05-07T11:58:43Z

src/pg/relations.cpp

+}
+
+TupleTableSlot *
+ExecStoreMinimalTupleUnsafe(MinimalTuple minmal_tuple, TupleTableSlot *slot, bool shouldFree) {


Similarly to the comment I left above. Let's add a comment why this is safe to use without the lock. Something like:

It's safe to call ExecStoreMinimalTuple without the PostgresFunctionGuard because it does not allocate in memory contexts and the only error it can throw is when the slot is not a minimal slot. That error is an obvious programming error so we can ignore it here.

And just like the function above let's drop the Unsafe from the name. (you probably need to change the body to call the original like ::ExecStoreMinimalTuple(...))

ExecStoreMinimalTuple might do pfree if the slot is owned by the tuple (TTS_SHOULDFREE(slot)). I added the comment to it.

YuweiXiao · 2025-05-07T12:54:42Z

@JelteF Thanks for the review! YES, go for 1.1.0 is reasonable.

JelteF · 2025-05-30T09:26:39Z

Do you plan on addressing the review feedback, I'm considering maybe merging this in for 1.0 anyway if it's in a good state.

YuweiXiao · 2025-05-30T09:36:19Z

Yeah, that would be nice! Let me resolve the conflict first.

JelteF · 2025-05-30T13:43:15Z

src/scan/postgres_scan.cpp

+			MinimalTuple minmal_tuple = reinterpret_cast<MinimalTuple>(local_state.minimal_tuple_buffer[j].data());
+			local_state.slot = ExecStoreMinimalTupleUnsafe(minmal_tuple, local_state.slot, false);
+			SlotGetAllAttrs(local_state.slot);
+			InsertTupleIntoChunk(output, local_state, local_state.slot);


We're not switching the to the tuple memory context here. That will cause the leaks again. I think we probably want to pass the memory context into InsertTupleIntoChunk, because we only want to switch to it when we need to for the type.

It is switched before the for loop. But you reminds me that it is not thread safe to do the switch. Let me check how we can resolve the leak here.

We shouldn't set it before the loop, because then it will also be used when getting the next tuple, which caused problems here: #805

Passing the memory context seems to be the only way, in order to maintain parallelism. Should we fallback to single-thread processing at the very beginning when we encounter JSON/LIST? This would eliminate the need for switching here. In that case for LIST/JSON, parallelism does not help much as we keep locking in the middle of something.

Actually I guess maybe it'd okay, because that was only a problem when we were not using background workers to do the actual reading. And this threading logic only kicks in when we do use background workers right? Still it seems nice to align behaviour for threaded and non threaded code for easy maintainability and understanding.

YES, threading comes along with bg scan worker. Let me try re-org these two part of code.

JelteF · 2025-05-30T14:09:15Z

src/scan/postgres_table_reader.cpp

+ * GlobalProcessLock should be held before calling this.
+ */
+bool
+PostgresTableReader::GetNextMinimalTuple(std::vector<uint8_t> &minimal_tuple_buffer) {


It's unclear that this requires using background workers for the reading. Let's change the name and update the comment.

Suggested change

PostgresTableReader::GetNextMinimalTuple(std::vector<uint8_t> &minimal_tuple_buffer) {

PostgresTableReader::GetNextMinimalWorkerTuple(std::vector<uint8_t> &minimal_tuple_buffer) {

YuweiXiao · 2025-05-31T11:53:41Z

src/pgduckdb_types.cpp

@@ -1868,6 +1869,7 @@ ConvertPostgresToDuckValue(Oid attr_type, Datum value, duckdb::Vector &result, i
 		break;
 	}
 	case duckdb::LogicalTypeId::LIST: {
+		std::lock_guard<std::recursive_mutex> lock(GlobalProcessLock::GetLock());


maybe it is not necessary since we fallback to single-thread for LIST/JSON?

JelteF · 2025-06-02T09:43:34Z

src/pgduckdb_guc.cpp

@@ -145,9 +146,12 @@ InitGUC() {
 	DefineCustomVariable("duckdb.log_pg_explain", "Logs the EXPLAIN plan of a Postgres scan at the NOTICE log level",
 	                     &duckdb_log_pg_explain);

+	DefineCustomVariable("duckdb.threads_for_postgres_scan",
+	                     "Maximum number of DuckDB threads used for a single Postgres scan",
+	                     &duckdb_threads_for_postgres_scan, 1, MAX_PARALLEL_WORKER_LIMIT, PGC_SUSET);


No need to make this PGC_SUSET.

Suggested change

&duckdb_threads_for_postgres_scan, 1, MAX_PARALLEL_WORKER_LIMIT, PGC_SUSET);

&duckdb_threads_for_postgres_scan, 1, MAX_PARALLEL_WORKER_LIMIT);

JelteF · 2025-06-02T09:48:03Z

src/scan/postgres_scan.cpp

+	}
+
+	SlotGetAllAttrs(slot);
+	// This memory context is use as a scratchpad space for any allocation required to add the tuple


Suggested change

// This memory context is use as a scratchpad space for any allocation required to add the tuple

// This memory context is used as a scratchpad space for any allocation required to add the tuple

JelteF · 2025-06-02T10:05:08Z

src/scan/postgres_table_reader.cpp

 MinimalTuple
 PostgresTableReader::GetNextWorkerTuple() {
 	int nvisited = 0;
 	TupleQueueReader *reader = NULL;
 	MinimalTuple minimal_tuple = NULL;
 	bool readerdone = false;
-	for (;;) {
+	for (; next_parallel_reader < nreaders;) {


Why is this suddenly needed?

In the multi-threaded scenario, one thread might read all worker tuples, perform cleanup, and then release the global lock. When other threads call this function afterward, they will attempt to index an empty array, potentially causing a segmentation fault.

Y-- · 2025-06-02T12:28:24Z

include/pgduckdb/scan/postgres_scan.hpp

@@ -36,15 +36,18 @@ struct PostgresScanGlobalState : public duckdb::GlobalTableFunctionState {
 	std::ostringstream scan_query;
 	duckdb::shared_ptr<PostgresTableReader> table_reader_global_state;
 	MemoryContext duckdb_scan_memory_ctx;
+	int max_threads;


nit- should we use idx_t here? (since this is what we return in MaxThreads above?)

src/scan/postgres_scan.cpp

Y-- · 2025-06-02T12:53:58Z

src/scan/postgres_table_reader.cpp

 	if (cleaned_up) {
-		return;
+		return NULL;


In which case can the InitTupleSlot be called when the reader was cleaned up?

I encountered a case where one thread read all tuples and performed cleanup while another thread was still initializing.

Thanks - maybe worth a comment? I'm sure I will forget in approximatively 10 minutes :-)

JelteF · 2025-06-06T09:19:52Z

src/scan/postgres_scan.cpp

+	//   - The scan includes JSON or LIST columns, since parallelism is inefficient for these types. This is because
+	//     converting these types requires calling Postgres functions, which use the Postgres memory context and
+	//     require holding the global lock, limiting parallel efficiency.


I'm wondering whether this really makes sense. If such columns are NULL there's no need for locking. And even if they are not, there might still be enough other columns that don't need locking. Also the locking is only needed for JSONB columns, not for regular JSON columns.

Also, whichever route we go. We should do the same for the varbit/bit type too. That one also allocates internally.

Uh, you are right. varbit/bit should be handled. Not sure if there is a way to guard us from unexpected palloc calls in a multi-threading setup. It is hard to maintain for cases like adding new types or changing the impl for conversion.

The alternative solution, passing down the memory context and doing locking for these types makes the whole protect logic scattered around the code. It is not maintainable either.

I am having an idea to make the conversion column-based (only when multi-threading) and do locking if necessary based on type checking. Something looks like:

for (size_t i = 0; i < valid_slots; ++i) { // construct slots from tuple buffer buffer_slots[i] = ... } for (size_t i = 0; i < num_columns; ++i) { bool unsafe = is_type_unsafe(desc[i]) if (unsafe) locking & setup memory InsertTupleIntoChunkColumns(output, local_state, buffer_slots, valid_slots, i); if (unsafe) unlock & reset memory }

JelteF

I think this is close to being merge-able. I left a final comment about the JSON/LIST/VARBIT stuff. And apart from that this needs merge conflicts resolved. But other than that I think this is good.

…lism controlled by guc `duckdb.threads_for_postgres_scan`

YuweiXiao · 2025-06-09T07:14:29Z

@JelteF Hey, restrictions on unsafe types like JSON/LIST have been removed by converting Postgres slots into DuckDB data chunks in a columnar fashion. If any other unsafe type is supported in the future, one only needs to add it to IsThreadSafeTypeForPostgresToDuckDB.

btw, the columnar conversion can be optimized by eliminating if-else branch (also switch statement). This may involve a large amount of code refactoring.

JelteF requested changes May 7, 2025

View reviewed changes

JelteF added this to the 1.1.0 milestone May 7, 2025

YuweiXiao force-pushed the issue_parallel_postgres_scan branch from 5adfaf6 to 88b7d36 Compare May 30, 2025 13:26

YuweiXiao requested a review from JelteF May 30, 2025 13:34

JelteF reviewed May 30, 2025

View reviewed changes

YuweiXiao requested a review from JelteF May 31, 2025 09:19

YuweiXiao commented May 31, 2025

View reviewed changes

JelteF reviewed Jun 2, 2025

View reviewed changes

Y-- reviewed Jun 2, 2025

View reviewed changes

YuweiXiao requested a review from JelteF June 3, 2025 10:58

JelteF modified the milestones: 1.1.0, 1.0.0 Jun 3, 2025

YuweiXiao force-pushed the issue_parallel_postgres_scan branch 2 times, most recently from d65fb9d to 1fd2abf Compare June 4, 2025 11:55

YuweiXiao requested a review from Y-- June 4, 2025 11:55

JelteF reviewed Jun 6, 2025

View reviewed changes

JelteF requested changes Jun 6, 2025

View reviewed changes

Yuwei Xiao added 7 commits June 9, 2025 14:15

Support multiple DuckDB threads for postgres table scan, with paralle…

2635fc9

…lism controlled by guc `duckdb.threads_for_postgres_scan`

acquire lock for JSON/LIST type

9e00eb6

fix

7619d86

Address review

201cd84

revert lock removal

a308506

fallback to single thread if JSON/LIST columns

5c4f524

more comments

f76c7c6

Yuwei Xiao added 7 commits June 9, 2025 14:18

fix comment

4a6b2d9

Add regress & fix lint

37b0b6d

fix lint

755756a

address review

cb89571

rebase & add comments

fa358eb

convert batch tuples into duckdb data chunk in a column manner

49268c4

fix rebase

47f6805

YuweiXiao force-pushed the issue_parallel_postgres_scan branch from 1fd2abf to 47f6805 Compare June 9, 2025 07:03

Yuwei Xiao added 2 commits June 9, 2025 15:07

minor fix

00d5754

cleanup comments

50eb9a6

YuweiXiao requested a review from JelteF June 10, 2025 00:43

JelteF approved these changes Jun 16, 2025

View reviewed changes

JelteF merged commit d8f548b into duckdb:main Jun 16, 2025
6 checks passed

	PostgresTableReader::GetNextMinimalTuple(std::vector<uint8_t> &minimal_tuple_buffer) {
	PostgresTableReader::GetNextMinimalWorkerTuple(std::vector<uint8_t> &minimal_tuple_buffer) {

	&duckdb_threads_for_postgres_scan, 1, MAX_PARALLEL_WORKER_LIMIT, PGC_SUSET);
	&duckdb_threads_for_postgres_scan, 1, MAX_PARALLEL_WORKER_LIMIT);

	// This memory context is use as a scratchpad space for any allocation required to add the tuple
	// This memory context is used as a scratchpad space for any allocation required to add the tuple

Support parallel DuckDB threads for Postgres table scan #762

Support parallel DuckDB threads for Postgres table scan #762

Uh oh!

Conversation

YuweiXiao commented May 7, 2025

Uh oh!

JelteF left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YuweiXiao commented May 7, 2025

Uh oh!

JelteF commented May 30, 2025

Uh oh!

YuweiXiao commented May 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

JelteF Jun 6, 2025 •

edited

Loading