Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions authors.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,9 @@
"ian-cook": {
"name": "Ian Cook",
"avatar": "https://columnar.tech/people/thumbnails/ian_cook.jpg"
},
"bryce-mecum": {
"name": "Bryce Mecum",
"avatar": "https://columnar.tech/people/thumbnails/bryce_mecum.jpg"
}
}
Binary file added data/penguins/penguins.parquet
Binary file not shown.
344 changes: 344 additions & 0 deletions notebooks/moving-data-between-databases-with-adbc.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,344 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "intro-cell",
"metadata": {},
"source": [
"Connecting to databases with ADBC is great, but what do you do when you want to get your data out of one database and into another?\n",
"\n",
"The usual approach is to use your database's built-in export to CSV or some other format, but now you have new problems: this is slow, not necessarily type-safe, and you still have to figure out how to get the export into your other database.\n",
"\n",
"Newer databases like [DuckDB](https://duckdb.org) partially improve this situation with extensions, but it requires a DuckDB-specific extension for your source database, and it only moves your data into DuckDB.\n",
"\n",
"It turns out that ADBC solves this problem very well! If your source and target database have ADBC drivers, you can efficiently move data between them, and you can do it all while never materializing all of the data.\n",
"\n",
"The key aspect of ADBC that makes this efficient is that all your data can move through what's called an Arrow [RecordBatchReader](https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatchReader.html), which is a data structure for reading chunks of a potentially larger-than-memory result set. ADBC drivers can work directly with `RecordBatchReader` when pulling data from one database and inserting into another, and only one chunk of your data needs to be materialized in memory at any given time.\n",
"\n",
"This notebook demonstrates the pattern end-to-end: a Parquet file is loaded into a [Flight SQL](https://arrow.apache.org/docs/format/FlightSql.html) compatible database ([GizmoSQL](https://gizmodata.com/gizmosql)), then streamed into [DuckDB](https://duckdb.org/), all through Apache Arrow's `RecordBatchReader` so the full dataset is never fully materialized in memory. We use GizmoSQL as the Flight SQL server because it's fast and easy to get running locally, but the same pattern will work with any combination of databases.\n",
"\n",
"Requirements:\n",
"\n",
"- Python 3\n",
"- Docker"
]
},
{
"cell_type": "markdown",
"id": "setup-header",
"metadata": {},
"source": [
"## Setup"
]
},
{
"cell_type": "markdown",
"id": "setup-pip-desc",
"metadata": {},
"source": [
"Install the required dependencies:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "setup-pip",
"metadata": {},
"outputs": [],
"source": [
"%pip install -q adbc-driver-manager pyarrow dbc"
]
},
{
"cell_type": "markdown",
"id": "setup-flightsql-desc",
"metadata": {},
"source": [
"Install the Flight SQL ADBC driver with [dbc](https://columnar.tech/dbc):"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "setup-flightsql",
"metadata": {},
"outputs": [],
"source": [
"!dbc install -q flightsql"
]
},
{
"cell_type": "markdown",
"id": "setup-duckdb-desc",
"metadata": {},
"source": [
"Install the DuckDB ADBC driver:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "setup-duckdb",
"metadata": {},
"outputs": [],
"source": [
"!dbc install -q duckdb"
]
},
{
"cell_type": "markdown",
"id": "setup-docker-desc",
"metadata": {},
"source": [
"If you don't already have a GizmoSQL instance running, start one with Docker:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "setup-docker",
"metadata": {},
"outputs": [],
"source": [
"!docker run -d --rm -it --init -p 31337:31337 --name gizmosql \\\n",
" -e TLS_ENABLED=0 \\\n",
" -e GIZMOSQL_PASSWORD=gizmosql_password \\\n",
" --pull always gizmodata/gizmosql:latest-slim"
]
},
{
"cell_type": "markdown",
"id": "setup-imports-desc",
"metadata": {},
"source": [
"Import the required modules:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "setup-imports",
"metadata": {},
"outputs": [],
"source": [
"import pyarrow.parquet as pq\n",
"from adbc_driver_manager import dbapi"
]
},
{
"cell_type": "markdown",
"id": "parquet-header",
"metadata": {},
"source": [
"## Load the Parquet File"
]
},
{
"cell_type": "markdown",
"id": "parquet-desc",
"metadata": {},
"source": [
"The next few steps are just setup.\n",
"\n",
"First, an example Parquet file as a PyArrow Table. For this example, we'll use [Palmer Penguins](https://allisonhorst.github.io/palmerpenguins/):"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "parquet-open",
"metadata": {},
"outputs": [],
"source": [
"penguins = pq.read_table(\"../data/penguins/penguins.parquet\")"
]
},
{
"cell_type": "markdown",
"id": "gizmosql-header",
"metadata": {},
"source": [
"## Connect"
]
},
{
"cell_type": "markdown",
"id": "gizmosql-desc",
"metadata": {},
"source": [
"Open connections to both GizmoSQL and DuckDB. Keeping them as plain variables (instead of context managers) lets both connections stay open across cells:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "gizmosql-ingest",
"metadata": {},
"outputs": [],
"source": [
"con_gizmosql = dbapi.connect(\n",
" driver=\"flightsql\",\n",
" db_kwargs={\n",
" \"uri\": \"grpc+tcp://localhost:31337\",\n",
" \"username\": \"gizmosql_user\",\n",
" \"password\": \"gizmosql_password\",\n",
" },\n",
")\n",
"con_duckdb = dbapi.connect(driver=\"duckdb\")"
]
},
{
"cell_type": "markdown",
"id": "duckdb-header",
"metadata": {},
"source": [
"## Ingest into GizmoSQL"
]
},
{
"cell_type": "markdown",
"id": "duckdb-desc",
"metadata": {},
"source": [
"Ingest the table into GizmoSQL:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "duckdb-ingest",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GizmoSQL: 344 rows\n"
]
}
],
"source": [
"cur_gizmosql = con_gizmosql.cursor()\n",
"cur_gizmosql.adbc_ingest(\"penguins\", penguins, mode=\"create\")\n",
"cur_gizmosql.execute(\"SELECT count(1) FROM penguins\")\n",
"print(f\"GizmoSQL: {cur_gizmosql.fetchone()[0]:,} rows\")"
]
},
{
"cell_type": "markdown",
"id": "89360aba",
"metadata": {},
"source": [
"## Transfer to DuckDB"
]
},
{
"cell_type": "markdown",
"id": "5be16869",
"metadata": {},
"source": [
"Execute a `SELECT` on GizmoSQL and pipe the result directly into DuckDB via `fetch_record_batch()`. This returns a lazy `RecordBatchReader` that streams rows from GizmoSQL as DuckDB consumes them—nothing is fully materialized:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "6b7dbf2b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"DuckDB: 344 rows\n"
]
}
],
"source": [
"cur_gizmosql.execute(\"SELECT * FROM penguins\")\n",
"cur_duckdb = con_duckdb.cursor()\n",
"cur_duckdb.adbc_ingest(\"penguins\", cur_gizmosql.fetch_record_batch(), mode=\"create\")\n",
"cur_duckdb.execute(\"SELECT count(1) FROM penguins\")\n",
"print(f\"DuckDB: {cur_duckdb.fetchone()[0]:,} rows\")"
]
},
{
"cell_type": "markdown",
"id": "d9e9e7af",
"metadata": {},
"source": [
"## Takeaways\n",
"\n",
"While the above example moved only 344 rows of data between our two databases, it illustrates an important point that can be extrapolated to data of any scale: ADBC is a fast but also uniform API for not only connecting to databases but also connecting databases together. Without a standard interface like ADBC, every pair of databases you wanted to move data between would need a separate solution. In a world with ADBC, you only need two ADBC drivers and a little bit of Python (or your language of choice) code."
]
},
{
"cell_type": "markdown",
"id": "cleanup-header",
"metadata": {},
"source": [
"## Cleanup"
]
},
{
"cell_type": "markdown",
"id": "cleanup-docker-desc",
"metadata": {},
"source": [
"Close the cursors and connections:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "cleanup-docker",
"metadata": {},
"outputs": [],
"source": [
"cur_gizmosql.close()\n",
"cur_duckdb.close()\n",
"con_gizmosql.close()\n",
"con_duckdb.close()"
]
},
{
"cell_type": "markdown",
"id": "3165ee53",
"metadata": {},
"source": [
"Stop the GizmoSQL container:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f50933a2",
"metadata": {},
"outputs": [],
"source": [
"!docker stop gizmosql"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.14.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
8 changes: 8 additions & 0 deletions registry.json
Original file line number Diff line number Diff line change
Expand Up @@ -134,5 +134,13 @@
"authors": ["ian-cook"],
"description": "Define reusable, named connection configurations in TOML files and use them to connect to databases with ADBC, just like ODBC DSNs.",
"categories": ["Database Connections"]
},
{
"title": "Move Data Between Databases Efficiently with ADBC",
"path": "notebooks/moving-data-between-databases-with-adbc.ipynb",
"date": "2026-04-29",
"authors": ["bryce-mecum"],
"description": "Move data between a Flight SQL database and DuckDB using ADBC and Apache Arrow's RecordBatchReader, without materializing the full dataset in memory.",
"categories": ["Data Loading"]
}
]