Async appends for quack#179
Conversation
…pend # Conflicts: # src/storage/quack_insert.cpp
|
Awesome performance numbers! Changes look clean too :) Codex found two issues: # name: test/temp.test
# description: quack async append should preserve insert order and client-side append logging
# group: [temp]
require quack
require httpfs
test-env QUACK_SERVER quack:localhost
set ignore_error_messages
statement ok
LOAD quack;
statement ok quack_db:quack_server_con
set variable actual_listen_uri = (select listen_uri from quack_serve('{QUACK_SERVER}:0', token='asdf'));
query I quack_db:quack_server_con
set variable listen_uri = concat('''', getvariable('actual_listen_uri'), '''');
set variable server <variable:quack_server_con:listen_uri>
statement ok
ATTACH {server} AS rpc (token 'asdf');
# Client-side append requests should be logged. The async append path must keep
# using the Quack request logging path, not only the raw HTTP transport path.
statement ok
CALL enable_logging('Quack', storage='memory');
statement ok
CREATE TABLE local_log_probe(i INTEGER);
statement ok
INSERT INTO local_log_probe VALUES (1), (2), (3);
statement ok
CREATE TABLE rpc.append_log_probe(i INTEGER);
statement ok
INSERT INTO rpc.append_log_probe SELECT * FROM local_log_probe;
statement ok
CREATE TEMP TABLE append_log_result AS
SELECT count(*) AS client_append_logs
FROM duckdb_logs_parsed('Quack')
WHERE message_type = 'APPEND_REQUEST'
AND server IS NOT NULL
AND server != '';
# Parallel inserts into Quack tables must respect preserve_insertion_order. The
# low flush threshold forces multiple async append batches to expose reordering.
statement ok
SET preserve_insertion_order=true;
statement ok
SET threads=4;
statement ok
SET quack_append_flush_rows=2048;
statement ok
CREATE TABLE rpc.order_probe(i BIGINT);
statement ok
INSERT INTO rpc.order_probe SELECT * FROM range(1000000) t(i);
statement ok
CREATE TEMP TABLE order_result AS
SELECT count(*) AS order_inversions
FROM (
SELECT i, row_number() OVER () - 1 AS rn
FROM rpc.order_probe
)
WHERE i != rn;
statement ok quack_db:quack_server_con
call quack_stop({server});
# Expected baseline behavior: the small append emits one client-side
# APPEND_REQUEST log, and the parallel append preserves row order exactly.
query II
SELECT client_append_logs, order_inversions
FROM append_log_result, order_result;
----
1 0 |
263dd87 to
64e30d5
Compare
64e30d5 to
52e1954
Compare
|
Quick follow-up on this: while testing the async path I noticed we weren't actually preserving insertion order. With My first instinct was to stamp each Then I went and looked at how Couple of nice side effects: the scan stays fully parallel (the re-map is just a quick locked hand-off, the heavy serialize/send stays outside the lock), and we only ever buffer the in-flight window on both sides instead of the whole insert. Performance is roughly the same as the unordered path for a 20M parquet file copy over the wire.
|
|
Thanks for the changes! Codex still finds some issues with regards to ordering and an issue for failed append leaving partial rows: # name: test/temp.test
# description: quack async append order and failed-statement atomicity repros
# group: [temp]
require quack
require httpfs
test-env QUACK_SERVER quack:localhost
set ignore_error_messages
statement ok
LOAD quack;
statement ok quack_db:quack_server_con
set variable actual_listen_uri = (select listen_uri from quack_serve('{QUACK_SERVER}:0', token='asdf'));
query I quack_db:quack_server_con
set variable listen_uri = concat('''', getvariable('actual_listen_uri'), '''');
set variable server <variable:quack_server_con:listen_uri>
statement ok
ATTACH {server} AS rpc (token 'asdf');
statement ok
CREATE TEMP TABLE repro_results(name VARCHAR, failures BIGINT);
# Parallel table-scan appends should preserve source order when preserve_insertion_order is true.
statement ok
SET preserve_insertion_order=true;
statement ok
SET threads=4;
statement ok
CREATE TABLE src AS SELECT * FROM range(300000) t(i);
statement ok
CREATE TABLE rpc.t_scan(i BIGINT);
statement ok
INSERT INTO rpc.t_scan SELECT i FROM src;
# This currently records 300000: the server table starts at row 245760, not 0.
statement ok
INSERT INTO repro_results
SELECT 'table_scan_order_inversions', count(*)
FROM (
SELECT i, row_number() OVER () - 1 AS rn
FROM rpc.t_scan
)
WHERE i != rn;
# A failed append statement should not leave earlier async batches committed.
statement ok
CREATE TABLE rpc.checked(i BIGINT CHECK (i < 250000));
statement error
INSERT INTO rpc.checked SELECT * FROM range(500000) t(i);
----
CHECK constraint failed
# This currently records 204800: one default append flush batch remains visible.
statement ok
INSERT INTO repro_results
SELECT 'partial_rows_after_failed_insert', count(*)
FROM rpc.checked;
statement ok quack_db:quack_server_con
call quack_stop({server});
query TI
SELECT name, failures
FROM repro_results
ORDER BY name;
----
partial_rows_after_failed_insert 0
table_scan_order_inversions 0 |
|
@lnkuiper the atomicity bug is pre-existing in main. I can take a look if it's an easy fix The other one seems fair! |
|
I can't reproduce bug 1 locally. It does preserve order. I added the test! |
e6347fd to
3b38ba6
Compare
| statement ok | ||
| SET threads=4; | ||
|
|
||
| # --- EXECUTOR path: parallel table-scan source must land in source order despite parallel producers + async. |
This PR introduces the
ManagedAsyncTaskQueuefrom duckdb/duckdb#23404 as to offload theAPPEND_REQUESTmessages to our Async IO thread pool. This follows up on #171 which was batchingAPPEND_REQUESTmessages in vectors of data chunks.After implementing this, I optimised it further by parallelising the requests using our regular thread pool. This was already a massive speed up vs main. This is what I call the
optimizedbranch in the follow up benchmarks that I will show.This PR introduces our async thread pool, which gets tasks handed off by
ManagedAsyncTaskQueue. Regular threads read data, crunch and serialize, and the async threads just posts over the wire. This is a clean way of getting "async like" performance without going purely async on the protocol side (non-blocking event loop or multiplexed protocol). So to clarify, the protocol is not becoming async, it is just being more efficient.The numbers
I ran two main benchmarks, one with multiple smaller batches (micro-batching) and one with a bigger batch of 1M rows.
Single 1 M row batch
mainoptimizeasyncAs you can see the append path was not really optimized, so right out of the back we get massive performance gains. Async is considerably better over WAN (which is great, shows that we are covering the network costs). Also better in EC2 to EC2 which I believe is the common use case (ETL using DuckDB inserting into another DuckDB over Quack).
Multi-batch (20 x 50k row inserts)
mainoptimizeasyncThis is considerably better than main too (8x-13x), but as you can see for relatively small inserts (almost no round trips) the batching of serialization costs (multiple chunks in one go) and parallelization is what matters. For a couple of round trips the network cost is negligible.
Future plans
I want to do a similar optimization in the read path, decoupling execution threads from async threads doing the fetch requests. This will make pre-fetching easier to implement, via using the machinery exposed in
ManagedAsyncTaskQueue.Then potentially we can think for both paths to use a similar cost model to what pdet implemented for parquet async (fetch optimal sizes based on network bandwidth).
cc: @lnkuiper