Async, order-preserving batch INSERTs#191
Open
guillesd wants to merge 9 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
#189 moved the server-side append onto a real
INSERT INTO t SELECT * FROM scan_data_from_quack_client('<id>')driven byPhysicalBatchInsert, but left the client deliberately dumb: it sent one chunk perQUACK_SEND_DATA_REQUESTand blocked the execution thread on each POST. That serializes the whole upload behind the network, and it throws away source order the moment more than one thread is producing.This PR is the async/parallel client #189 said would come later, plus the server-side machinery needed to put the rows back in order. The upload now runs off the execution threads, multiple requests are in flight at once, and a remote
INSERT ... SELECTlands rows in the same order it would locally. This supersedes #179 and is also 2 to 2.5x faster in localhost.This builds on #189 and keeps the read path untouched.
Async upload
The sink no longer POSTs inline. A regular execution thread serializes its buffered chunks into a
QUACK_SEND_DATA_REQUESTand registers the payload with a per-statementManagedAsyncTaskQueue; an ASYNC-pool thread then does the blocking POST and checks the ack. So the producer threads keep scanning and serializing while the network work happens elsewhere, and several sends are outstanding at the same time.To make that split clean I added
QuackClient::PostRaw, which takes already-serialized bytes and returns the raw response body. The serialize step runs on the producer thread (where theClientContextis live), and the POST runs on the async pool withcontext=nullptr, falling back to database-level params and logging.FINALIZEdrains the queue before it returns, so an autocommit insert is still atomic on the server.Ordering modes
Order preservation is decided at plan time in
ConfigureOrdering, mirroring core'splan_insert.cpp, and stored on the operator as anAppendOrderMode:UNORDERED(preserve_insertion_order=false): the fast path. No stamps, the server inserts chunks as they arrive.PARALLEL_ORDERED: parallel sources (table/parquet scans). Each thread stamps its chunks with(batch_index, sequence_index, is_last_in_batch)taken from the executor's batch index, and the server reassembles.SERIAL_ORDERED: single-threaded sources likerange()that have no executor batch index. The lone producer mints a fresh batch per flush.RequiredPartitionInfoonly asks for executor batch indices inPARALLEL_ORDERED, so the executor's batch-index assertion never fires for sources that don't supply one.ParallelSinkis true for everything exceptSERIAL_ORDERED.Wire protocol
SEND_DATA_REQUESTnow carries a vector of chunks instead of a single one (one message is one flush), plus the ordering stamp and abatch_watermark.FINALIZEcarries amin_batch_watermark.The watermark is the piece that makes ordered delivery safe. It is the minimum batch index that will ever appear in the stream, piggybacked on every ordered message. Without it the server cannot tell whether a low batch is still coming or simply never existed, so it could deliver batch 5 before batch 4 ever arrives and trip
PhysicalBatchInsert's monotonicity check. With the watermark the server initialises its delivery cursor and only drains a batch once every batch below it is accounted for.FINALIZErepeats it as a safety net in case all messages arrived before any per-message watermark was valid.Server side: reassembly and parallel insert
QuackDataStreamgrew an ordered-assembly stage.PushOrderedbuffers(batch, seq)fragments and, as soon as a batch's full sequence range has arrived, drains it as one unit into aready_batches_queue in ascending batch order. Unordered streams skip all of that:PushUnordereddrops one wire message's chunks straight onto the same queue.On the scan side,
scan_data_from_quack_clientnow raisesMaxThreadsto the live thread count. Each scan task claims one complete batch fromready_batches_viaTryPopBatchand owns it exclusively until it is drained, soPhysicalBatchInsertsees a stable, unique batch index per task and can write row groups in parallel without interleaving rows across a batch boundary. When no batch is ready the task returnsBLOCKEDwith an ASYNC-pool wait task, same as before.Unordered streams take a different operator entirely: the bind signals
NO_ORDER, which makes the planner pickPhysicalInsert(parallel=true)instead ofPhysicalBatchInsert. That sink ignores batch indices, so concurrent tasks can append freely. Both modes share the oneready_batches_queue and the one scan loop; the only thing that differs is how the producer fills the queue and which insert operator drains it.Follow-ups
ready_batches_is unbounded for now. Real flow control / backpressure management (the server handing the client an accept budget through theSEND_DATA_RESPONSEplaceholder) is out of scope for this PR.QuackDataStreamis not really tracked by the buffer manager. Maybe we'd like this to be the case although I am not sure. Maybe just doing a good job on the backpressure side is enough. Or as Mark mentioned, make CollumnDataCollection serializable and use this in the transport (ColumnDataCollection is buffer managed)