Skip to content

Support parallel DuckDB threads for Postgres table scan #762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions include/pgduckdb/pg/relations.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ Form_pg_attribute GetAttr(const TupleDesc tupleDesc, int i);
bool TupleIsNull(TupleTableSlot *slot);

void SlotGetAllAttrs(TupleTableSlot *slot);
void SlotGetAllAttrsUnsafe(TupleTableSlot *slot);

TupleTableSlot *ExecStoreMinimalTupleUnsafe(MinimalTuple minmal_tuple, TupleTableSlot *slot, bool shouldFree);

double EstimateRelSize(Relation rel);

Expand Down
1 change: 1 addition & 0 deletions include/pgduckdb/pgduckdb_guc.h
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ extern bool duckdb_allow_community_extensions;
extern bool duckdb_allow_unsigned_extensions;
extern bool duckdb_autoinstall_known_extensions;
extern bool duckdb_autoload_known_extensions;
extern int duckdb_threads_for_postgres_scan;
extern int duckdb_max_workers_per_postgres_scan;
extern char *duckdb_postgres_role;
extern char *duckdb_motherduck_session_hint;
Expand Down
7 changes: 5 additions & 2 deletions include/pgduckdb/scan/postgres_scan.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ struct PostgresScanGlobalState : public duckdb::GlobalTableFunctionState {
~PostgresScanGlobalState();
idx_t
MaxThreads() const override {
return 1;
return max_threads;
}
void ConstructTableScanQuery(const duckdb::TableFunctionInitInput &input);

Expand All @@ -35,15 +35,18 @@ struct PostgresScanGlobalState : public duckdb::GlobalTableFunctionState {
std::atomic<std::uint32_t> total_row_count;
std::ostringstream scan_query;
duckdb::shared_ptr<PostgresTableReader> table_reader_global_state;
int max_threads;
};

// Local State

#define LOCAL_STATE_SLOT_BATCH_SIZE 32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 32? Maybe we should we make this configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was concerned about burdening users with another GUC hyperparameter.

I tested batch sizes of 8, 16, 32, and 64, and found that 32 performs the best. BTW, the batch size helps to amortize the lock overhead across threads.

struct PostgresScanLocalState : public duckdb::LocalTableFunctionState {
PostgresScanLocalState(PostgresScanGlobalState *global_state);
~PostgresScanLocalState() override;

PostgresScanGlobalState *global_state;
TupleTableSlot *slot;
std::vector<uint8_t> minimal_tuple_buffer[LOCAL_STATE_SLOT_BATCH_SIZE];

size_t output_vector_size;
bool exhausted_scan;
Expand Down
8 changes: 8 additions & 0 deletions include/pgduckdb/scan/postgres_table_reader.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

#include "pgduckdb/pg/declarations.hpp"

#include <vector>

#include "pgduckdb/utility/cpp_only_file.hpp" // Must be last include.

namespace pgduckdb {
Expand All @@ -11,7 +13,13 @@ class PostgresTableReader {
PostgresTableReader(const char *table_scan_query, bool count_tuples_only);
~PostgresTableReader();
TupleTableSlot *GetNextTuple();
bool GetNextMinimalTuple(std::vector<uint8_t> &minimal_tuple_buffer);
void PostgresTableReaderCleanup();
TupleTableSlot *InitTupleSlot();
bool
IsParallelScan() const {
return nworkers_launched > 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find the place where you change nworkers_launched.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is assigned during the initialization of Postgres parallel workers, which is part of the original code logic. I simply added 0 initialization and the interface.

}

private:
MinimalTuple GetNextWorkerTuple();
Expand Down
10 changes: 10 additions & 0 deletions src/pg/relations.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,16 @@ SlotGetAllAttrs(TupleTableSlot *slot) {
PostgresFunctionGuard(slot_getallattrs, slot);
}

void
SlotGetAllAttrsUnsafe(TupleTableSlot *slot) {
slot_getallattrs(slot);
}
Comment on lines +72 to +75
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed scary to me, but looking closely at the implementation of slot_getallattrs it doesn't use memory contexts nor can it throw an error. The only place where it throws an error is:

	if (unlikely(attnum > slot->tts_tupleDescriptor->natts))
		elog(ERROR, "invalid attribute number %d", attnum);

But that can condition can never be false, due to the fact that slot->tts_tupleDescriptor->natts is passed as attnum.

Could you merge this function function with SlotGetAllAttrs? And add the above information in a code comment for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure. I am using the term "unsafe" to refer to the fact that it is not protected by PostgresFunctionGuard, even though it does not actually require that protection :)


TupleTableSlot *
ExecStoreMinimalTupleUnsafe(MinimalTuple minmal_tuple, TupleTableSlot *slot, bool shouldFree) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the comment I left above. Let's add a comment why this is safe to use without the lock. Something like:

It's safe to call ExecStoreMinimalTuple without the PostgresFunctionGuard because it does not allocate in memory contexts and the only error it can throw is when the slot is not a minimal slot. That error is an obvious programming error so we can ignore it here.

And just like the function above let's drop the Unsafe from the name. (you probably need to change the body to call the original like ::ExecStoreMinimalTuple(...))

return ExecStoreMinimalTuple(minmal_tuple, slot, shouldFree);
}

Relation
OpenRelation(Oid relationId) {
if (PostgresFunctionGuard(check_enable_rls, relationId, InvalidOid, false) == RLS_ENABLED) {
Expand Down
4 changes: 4 additions & 0 deletions src/pgduckdb.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ bool duckdb_force_execution = false;
bool duckdb_unsafe_allow_mixed_transactions = false;
bool duckdb_log_pg_explain = false;
int duckdb_max_workers_per_postgres_scan = 2;
int duckdb_threads_for_postgres_scan = 2;
char *duckdb_motherduck_session_hint = strdup("");
char *duckdb_postgres_role = strdup("");

Expand Down Expand Up @@ -164,6 +165,9 @@ DuckdbInitGUC() {
DefineCustomVariable("duckdb.log_pg_explain", "Logs the EXPLAIN plan of a Postgres scan at the NOTICE log level",
&duckdb_log_pg_explain);

DefineCustomVariable("duckdb.threads_for_postgres_scan",
"Maximum number of DuckDB threads used for a single Postgres scan",
&duckdb_threads_for_postgres_scan, 1, MAX_PARALLEL_WORKER_LIMIT, PGC_SUSET);
DefineCustomVariable("duckdb.max_workers_per_postgres_scan",
"Maximum number of PostgreSQL workers used for a single Postgres scan",
&duckdb_max_workers_per_postgres_scan, 0, MAX_PARALLEL_WORKER_LIMIT);
Expand Down
1 change: 1 addition & 0 deletions src/pgduckdb_detoast.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@ ToastFetchDatum(struct varlena *attr) {
return result;
}

// This function is thread-safe and does not utilize the PostgreSQL memory context.
Datum
DetoastPostgresDatum(struct varlena *attr, bool *should_free) {
struct varlena *toasted_value = nullptr;
Expand Down
2 changes: 2 additions & 0 deletions src/pgduckdb_types.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1505,6 +1505,7 @@ AppendString(duckdb::Vector &result, Datum value, idx_t offset, bool is_bpchar)

static void
AppendJsonb(duckdb::Vector &result, Datum value, idx_t offset) {
std::lock_guard<std::recursive_mutex> lock(GlobalProcessLock::GetLock());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these the only types that need this additional locking now? Would be good to explicitely state in a comment for each other type why they are thread safe. That way we won't forget to check when introducing support for new types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only JSON and LIST. Additionally, as mentioned in #750, both of these types have memory issues.

I will add a comment for ConvertPostgresToDuckValue. The overall rule is that it is thread-safe as long as it does not use Postgres MemContext.

auto jsonb = DatumGetJsonbP(value);
auto jsonb_str = JsonbToCString(NULL, &jsonb->root, VARSIZE(jsonb));
duckdb::string_t str(jsonb_str);
Expand Down Expand Up @@ -1810,6 +1811,7 @@ ConvertPostgresToDuckValue(Oid attr_type, Datum value, duckdb::Vector &result, i
break;
}
case duckdb::LogicalTypeId::LIST: {
std::lock_guard<std::recursive_mutex> lock(GlobalProcessLock::GetLock());
// Convert Datum to ArrayType
auto array = DatumGetArrayTypeP(value);

Expand Down
60 changes: 51 additions & 9 deletions src/scan/postgres_scan.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
#include "pgduckdb/pgduckdb_types.hpp"
#include "pgduckdb/pgduckdb_utils.hpp"
#include "pgduckdb/pg/relations.hpp"
#include "pgduckdb/pgduckdb_guc.h"

#include "pgduckdb/pgduckdb_process_lock.hpp"
#include "pgduckdb/logger.hpp"
Expand Down Expand Up @@ -170,6 +171,12 @@ PostgresScanGlobalState::PostgresScanGlobalState(Snapshot _snapshot, Relation _r
ConstructTableScanQuery(input);
table_reader_global_state =
duckdb::make_shared_ptr<PostgresTableReader>(scan_query.str().c_str(), count_tuples_only);
// Default to a single thread if `count_tuples_only` is true or if the reader does not initialize a parallel scan.
max_threads = 1;
if (table_reader_global_state->IsParallelScan() && !count_tuples_only) {
max_threads = duckdb_threads_for_postgres_scan;
}

pd_log(DEBUG2, "(DuckDB/PostgresSeqScanGlobalState) Running %" PRIu64 " threads -- ", (uint64_t)MaxThreads());
}

Expand All @@ -182,6 +189,7 @@ PostgresScanGlobalState::~PostgresScanGlobalState() {

PostgresScanLocalState::PostgresScanLocalState(PostgresScanGlobalState *_global_state)
: global_state(_global_state), exhausted_scan(false) {
slot = global_state->table_reader_global_state->InitTupleSlot();
}

PostgresScanLocalState::~PostgresScanLocalState() {
Expand Down Expand Up @@ -256,17 +264,51 @@ PostgresScanTableFunction::PostgresScanFunction(duckdb::ClientContext &, duckdb:

local_state.output_vector_size = 0;

std::lock_guard<std::recursive_mutex> lock(GlobalProcessLock::GetLock());
for (size_t i = 0; i < STANDARD_VECTOR_SIZE; i++) {
TupleTableSlot *slot = local_state.global_state->table_reader_global_state->GetNextTuple();
if (pgduckdb::TupleIsNull(slot)) {
local_state.global_state->table_reader_global_state->PostgresTableReaderCleanup();
local_state.exhausted_scan = true;
break;
// Normal scan without parallelism
bool is_parallel_scan = local_state.global_state->MaxThreads() > 1;
if (!is_parallel_scan) {
std::lock_guard<std::recursive_mutex> lock(GlobalProcessLock::GetLock());
Comment on lines +268 to +270
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this difference in behaviour between 1 and more than one threads doesn't completely make sense. Even if max_threads_per_postgres_scan is 1, it's still possible to have two different postgres scans running in parallel. Those two concurrent postgres scans would still benefit from not holding a lock during InsertTupleIntoChunk.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading more I now realize this is probably important for the case where we don't use backgroundworkers for the scan. So maybe we should keep this functionality. But I think it's worth refactoring and/or commenting this a bit more, because now the logic is quite hard to follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay. I was trying to keep the original code unchanged for the single-thread case.

for (size_t i = 0; i < STANDARD_VECTOR_SIZE; i++) {
TupleTableSlot *slot = local_state.global_state->table_reader_global_state->GetNextTuple();
if (pgduckdb::TupleIsNull(slot)) {
local_state.global_state->table_reader_global_state->PostgresTableReaderCleanup();
local_state.exhausted_scan = true;
break;
}

SlotGetAllAttrs(slot);
InsertTupleIntoChunk(output, local_state, slot);
}
SetOutputCardinality(output, local_state);
return;
}

D_ASSERT(STANDARD_VECTOR_SIZE % LOCAL_STATE_SLOT_BATCH_SIZE == 0);
for (size_t i = 0; i < STANDARD_VECTOR_SIZE / LOCAL_STATE_SLOT_BATCH_SIZE; i++) {
int valid_slots = 0;
{
std::lock_guard<std::recursive_mutex> lock(GlobalProcessLock::GetLock());
for (size_t j = 0; j < LOCAL_STATE_SLOT_BATCH_SIZE; j++) {
bool ret = local_state.global_state->table_reader_global_state->GetNextMinimalTuple(
local_state.minimal_tuple_buffer[j]);
if (!ret) {
local_state.global_state->table_reader_global_state->PostgresTableReaderCleanup();
local_state.exhausted_scan = true;
break;
}
valid_slots++;
}
}

SlotGetAllAttrs(slot);
InsertTupleIntoChunk(output, local_state, slot);
for (size_t j = 0; j < valid_slots; j++) {
MinimalTuple minmal_tuple = reinterpret_cast<MinimalTuple>(local_state.minimal_tuple_buffer[j].data());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Y-- I need your C++ knowledge. Is this a good way to keep a buffer of MinimalTuples?

One thought I had is that now we do two copies of the minimal tuple:

  1. Once from the stack into the buffer (in GetnextMinimalTuple)
  2. Once from the buffer back to the stack (here).

I think if we instead have an array of MinimalTuple that we realloc instead of using vectors of bytes, then we only need to copy once and we can pass the minimal tuple from the buffer directly into ExecStoreMinimalTupleUnsafe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't do any copy (and thus doesn't extend/modify its lifetime).
It forces the compiler to accept that the bytes stored in the minimal_tuple_buffer[j] vector are a MinimalTuple (aka MinimalTupleData *, where MinimalTupleData is itself a struct).

I haven't read the code yet, but my first question would be: why are they vector<uint8_t> instead of vector<MinimalTuple> in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @Y-- pointed out, only one copy occurs (from Postgres parallel workers' shared memory to the buffer).

One benefit of using vector<uint8_t> is that we have an off-the-shelf API to enlarge or shrink the buffer (i.e., resize). Additionally, there is no need to worry about memory leaks, as they are handled by RAII.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It forces the compiler to accept that the bytes stored in the minimal_tuple_buffer[j] vector are a MinimalTuple (aka MinimalTupleData *, where MinimalTupleData is itself a struct).

Sounds like that could cause alignment problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. Let me confirm the alignment issue.

local_state.slot = ExecStoreMinimalTupleUnsafe(minmal_tuple, local_state.slot, false);
SlotGetAllAttrsUnsafe(local_state.slot);
InsertTupleIntoChunk(output, local_state, local_state.slot);
}

if (valid_slots == 0)
break;
}

SetOutputCardinality(output, local_state);
Expand Down
44 changes: 38 additions & 6 deletions src/scan/postgres_table_reader.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ extern "C" {
namespace pgduckdb {

PostgresTableReader::PostgresTableReader(const char *table_scan_query, bool count_tuples_only)
: parallel_executor_info(nullptr), parallel_worker_readers(nullptr), nreaders(0), next_parallel_reader(0),
entered_parallel_mode(false), cleaned_up(false) {
: parallel_executor_info(nullptr), parallel_worker_readers(nullptr), nworkers_launched(0), nreaders(0),
next_parallel_reader(0), entered_parallel_mode(false), cleaned_up(false) {

std::lock_guard<std::recursive_mutex> lock(GlobalProcessLock::GetLock());
PostgresScopedStackReset scoped_stack_reset;
Expand Down Expand Up @@ -119,17 +119,27 @@ PostgresTableReader::PostgresTableReader(const char *table_scan_query, bool coun
table_scan_planstate->ps_ResultTupleDesc, &TTSOpsMinimalTuple);
}

PostgresTableReader::~PostgresTableReader() {
// The caller should hold GlobalProcessLock to ensure thread-safety
TupleTableSlot *
PostgresTableReader::InitTupleSlot() {
if (cleaned_up) {
return;
return NULL;
}
return PostgresFunctionGuard(MakeTupleTableSlot, table_scan_planstate->ps_ResultTupleDesc, &TTSOpsMinimalTuple);
}

PostgresTableReader::~PostgresTableReader() {
std::lock_guard<std::recursive_mutex> lock(GlobalProcessLock::GetLock());
PostgresTableReaderCleanup();
}

// The caller should hold GlobalProcessLock to ensure thread-safety
void
PostgresTableReader::PostgresTableReaderCleanup() {
D_ASSERT(!cleaned_up);
if (cleaned_up) {
return;
}

cleaned_up = true;
PostgresScopedStackReset scoped_stack_reset;

Expand Down Expand Up @@ -279,13 +289,33 @@ PostgresTableReader::GetNextTuple() {
return PostgresFunctionGuard(ExecClearTuple, slot);
}

/*
* Get the next minimal tuple from the table scan into the provided buffer.
* Returns true if a tuple was read, false if the scan is finished.
* GlobalProcessLock should be held before calling this.
*/
bool
PostgresTableReader::GetNextMinimalTuple(std::vector<uint8_t> &minimal_tuple_buffer) {
MinimalTuple worker_minmal_tuple = GetNextWorkerTuple();
if (HeapTupleIsValid(worker_minmal_tuple)) {
// deep copy worker_minmal_tuple to destination buffer
Size tuple_size = worker_minmal_tuple->t_len + MINIMAL_TUPLE_DATA_OFFSET;
minimal_tuple_buffer.resize(tuple_size);
memcpy(minimal_tuple_buffer.data(), worker_minmal_tuple, tuple_size);
return true;
}

minimal_tuple_buffer.resize(0);
return false;
}

MinimalTuple
PostgresTableReader::GetNextWorkerTuple() {
int nvisited = 0;
TupleQueueReader *reader = NULL;
MinimalTuple minimal_tuple = NULL;
bool readerdone = false;
for (;;) {
for (; next_parallel_reader < nreaders;) {
reader = (TupleQueueReader *)parallel_worker_readers[next_parallel_reader];

minimal_tuple = PostgresFunctionGuard(TupleQueueReaderNext, reader, true, &readerdone);
Expand Down Expand Up @@ -325,6 +355,8 @@ PostgresTableReader::GetNextWorkerTuple() {
nvisited = 0;
}
}

return NULL;
}

} // namespace pgduckdb
Loading