Description
What would you like help with?
- Why is
executemany()
much slower thanadbc_ingest()
? - What is the best way and most performant way to pass data with a complex query/operation?
- Is there anything I am doing wrong?
I want to insert and update records in a table using Python API of adbc_driver_postgres, let's say, I have 10k rows:
import pyarrow as pa
a = pa.array(range(10000))
table = pa.Table.from_arrays(arrays=[a,a,a], names=['col1', 'col2', 'col3'])
I noticed that executemany()
is much slower than adbc_ingest()
for ingesting data. Let's say I have 10k rows:
# 0.1 s
with adbc_driver_postgresql.dbapi.connect(uri) as conn:
with conn.cursor() as cursor:
cursor.execute("TRUNCATE TABLE test_table;")
cursor.adbc_ingest('test_table', table, mode="replace")
cursor.execute('ALTER TABLE test_table ADD PRIMARY KEY ("col1");')
conn.commit()
as compared to
# ~7.5 sec
with adbc_driver_postgresql.dbapi.connect(uri) as conn:
with conn.cursor() as cursor:
cursor.execute("TRUNCATE TABLE test_table;")
query = 'INSERT INTO test_table ("col1", "col2", "col3") VALUES ($1, $2, $3);'
cursor.executemany(query, table)
conn.commit()
I don't mind using adbc_ingest()
to populate my database, but later in its lifecycle I need to upsert records and more. For example, I need to do something like:
# ~7.5 sec
query = (
'INSERT INTO test_table ("col1", "col2", "col3") VALUES ($1, $2, $3)'
'ON CONFLICT ("col1") DO UPDATE SET "col2" = EXCLUDED."col2", "col3" = 0;'
)
with adbc_driver_postgresql.dbapi.connect(uri) as conn:
with conn.cursor() as cursor:
cursor.executemany(query, table)
which is too slow. Apparently executemany()
is extremely inefficient for this ask. What is the cause of such a poor performance? What is the bottleneck?
The same outcome could be achieved much faster by first ingesting data into a temporary table and then making Postgres run a more complex operation from it rather than from input:
# 0.2s
with adbc_driver_postgresql.dbapi.connect(uri) as conn:
with conn.cursor() as cursor:
cursor.adbc_ingest('test_table2', table, mode="replace")
query = (
'INSERT INTO test_table ("col1", "col2", "col3")\n'
'SELECT "col1", "col2", "col3" FROM test_table2\n'
'ON CONFLICT ("col1") DO UPDATE SET "col2" = EXCLUDED."col2", "col3" = 0;'
)
cursor.execute(query)
conn.commit()
This approach gives a reasonable performance, but is this how one supposed to do this? Is there anything that can be easily improved? I do not know much about Postgres's backend operation and what optimisations it does for ingestion, but I suspect that it is not best practice to create temporary tables (which are not even TEMPORARY) when we just want to stream data.