Skip to content

DuckDB Cache Does Not Release Database Connections #807

@devin-ai-integration

Description

@devin-ai-integration

DuckDB Cache Does Not Release Database Connections

Description

PyAirbyte's DuckDBCache maintains an open connection to the DuckDB database file after source.read() completes, preventing other tools and processes from accessing the same database file.

Impact

This creates a significant limitation when using DuckDB's file-based databases:

  1. Connection Conflicts: DuckDB does not allow multiple connections to the same database file with different configurations, resulting in ConnectionException: Can't open a connection to same database file with a different configuration
  2. Blocked Downstream Access: After PyAirbyte writes cached data, other tools (Ibis, DuckDB CLI, pandas, etc.) cannot read the cached data without encountering connection errors
  3. Workflow Limitations: Users cannot easily inspect cached data or integrate PyAirbyte's cache with other data processing tools in the same workflow

Reproduction Steps

Here's a test workflow to reproduce and validate fixes:

  1. Test script reads from source-faker into DuckDB cache

    import airbyte as ab
    
    source = ab.get_source("source-faker", config={"count": 100})
    cache = ab.caches.DuckDBCache(db_path="test.duckdb")
    source.select_all_streams()
    result = source.read(cache=cache)
  2. Test script attempts to close the connection (doesn't work as of now)

    # Currently no way to explicitly close connection
    del result
    del cache
    # Connection still held open
  3. Test script uses uvx harlequin to start a new process to read from the DB

    uvx harlequin test.duckdb

    Expected: Harlequin successfully connects and displays cached data
    Actual: Connection fails with ConnectionException

The test passes when step 3 succeeds without connection errors.

Expected Behavior

The cache should release database connections after source.read() completes, allowing other processes to access the cached data. Ideally:

  • Connections should be closed when the cache object is deleted or goes out of scope
  • Users should have an explicit method to close cache connections (e.g., cache.close())
  • Context manager support (e.g., with ab.caches.DuckDBCache(...) as cache:) would provide clean connection lifecycle management

Current Workaround

Currently, there's no clean way to release the connection without terminating the Python process entirely.

Why This Matters for DuckDB Specifically

Unlike traditional client-server databases, DuckDB's file-based architecture requires careful connection management. The database file is locked by the first connection, and subsequent connections must use compatible configurations. This design optimizes for single-process analytical workloads but requires proper connection lifecycle management in multi-tool scenarios.

Context

Reported by @aaronsteers while building a Kedro+Ibis integration POC that uses PyAirbyte for data extraction and Ibis for transformations on the same DuckDB database.

Related Code

  • airbyte/caches/duckdb.py - DuckDBCache implementation
  • airbyte/_processors/sql/duckdb.py - DuckDBSqlProcessor connection management

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions