-
Notifications
You must be signed in to change notification settings - Fork 67
Description
DuckDB Cache Does Not Release Database Connections
Description
PyAirbyte's DuckDBCache
maintains an open connection to the DuckDB database file after source.read()
completes, preventing other tools and processes from accessing the same database file.
Impact
This creates a significant limitation when using DuckDB's file-based databases:
- Connection Conflicts: DuckDB does not allow multiple connections to the same database file with different configurations, resulting in
ConnectionException: Can't open a connection to same database file with a different configuration
- Blocked Downstream Access: After PyAirbyte writes cached data, other tools (Ibis, DuckDB CLI, pandas, etc.) cannot read the cached data without encountering connection errors
- Workflow Limitations: Users cannot easily inspect cached data or integrate PyAirbyte's cache with other data processing tools in the same workflow
Reproduction Steps
Here's a test workflow to reproduce and validate fixes:
-
Test script reads from source-faker into DuckDB cache
import airbyte as ab source = ab.get_source("source-faker", config={"count": 100}) cache = ab.caches.DuckDBCache(db_path="test.duckdb") source.select_all_streams() result = source.read(cache=cache)
-
Test script attempts to close the connection (doesn't work as of now)
# Currently no way to explicitly close connection del result del cache # Connection still held open
-
Test script uses
uvx harlequin
to start a new process to read from the DBuvx harlequin test.duckdb
Expected: Harlequin successfully connects and displays cached data
Actual: Connection fails withConnectionException
The test passes when step 3 succeeds without connection errors.
Expected Behavior
The cache should release database connections after source.read()
completes, allowing other processes to access the cached data. Ideally:
- Connections should be closed when the cache object is deleted or goes out of scope
- Users should have an explicit method to close cache connections (e.g.,
cache.close()
) - Context manager support (e.g.,
with ab.caches.DuckDBCache(...) as cache:
) would provide clean connection lifecycle management
Current Workaround
Currently, there's no clean way to release the connection without terminating the Python process entirely.
Why This Matters for DuckDB Specifically
Unlike traditional client-server databases, DuckDB's file-based architecture requires careful connection management. The database file is locked by the first connection, and subsequent connections must use compatible configurations. This design optimizes for single-process analytical workloads but requires proper connection lifecycle management in multi-tool scenarios.
Context
Reported by @aaronsteers while building a Kedro+Ibis integration POC that uses PyAirbyte for data extraction and Ibis for transformations on the same DuckDB database.
Related Code
airbyte/caches/duckdb.py
- DuckDBCache implementationairbyte/_processors/sql/duckdb.py
- DuckDBSqlProcessor connection management