Skip to content

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Sep 30, 2025

fix(cache): Properly dispose DuckDB connections to prevent file locking

Summary

Resolves issue #807 where PyAirbyte's DuckDB cache does not properly release database connections, preventing external tools from accessing the same database file.

Key Changes:

  • Modified SqlConfig to cache SQLAlchemy engines instead of creating new ones each time, and added dispose_engine() method for proper cleanup
  • Added close() method to CacheBase that disposes all SQLAlchemy engines from processors and backends
  • Implemented context manager protocol (__enter__/__exit__) for clean resource management
  • Added __del__ method for cleanup during garbage collection
  • Added comprehensive tests covering explicit close(), context manager, and garbage collection cleanup scenarios

The fix ensures that when a cache object is cleaned up (via close(), context manager, or garbage collection), all underlying SQLAlchemy engines are properly disposed, releasing file locks on DuckDB databases.

Review & Testing Checklist for Human

⚠️ Risk Level: Medium - Engine caching behavioral change

  • Test original issue scenario end-to-end: Create a DuckDB cache, write data, clean up, then verify external tools (like uvx harlequin test.duckdb) can access the database file without errors
  • Verify no regressions in existing DuckDB cache functionality: Run existing tests and spot-check that normal cache operations still work correctly
  • Test concurrent cache usage: Verify that engine caching doesn't cause issues when multiple processors or threads use the same cache simultaneously
  • Validate context manager pattern: Test with DuckDBCache() as cache: usage to ensure proper cleanup occurs

Test Plan Recommendation

import airbyte as ab

# 1. Test the original issue is fixed
with ab.caches.DuckDBCache(db_path="test.duckdb") as cache:
    source = ab.get_source("source-faker", config={"count": 100})
    source.select_all_streams()
    source.read(cache=cache)

# 2. Verify external tool can now access the file
# Run: uvx harlequin test.duckdb
# Should open without "Can't open a connection to same database file" error

Notes

  • Engine caching change: SqlConfig.get_sql_engine() now returns cached engines instead of creating new ones each time - this is a significant behavioral change that could affect concurrent usage patterns
  • Error suppression: Cleanup methods use contextlib.suppress(Exception) to prevent cleanup failures from breaking application flow, but this could potentially hide real issues during development
  • Python 3.10 compatibility: Added fallback import for Self type hint to support Python 3.10

Link to Devin run: https://app.devin.ai/sessions/9333803942584401a53d0f92dcbafa45
Requested by: @aaronsteers

Summary by CodeRabbit

  • New Features

    • Caches support context-manager usage plus a deterministic close/dispose lifecycle for automatic resource cleanup.
    • SQL engine is cached per configuration with an explicit dispose method to release engine resources.
  • Bug Fixes

    • Prevents lingering database connections/locks after cache operations, improving stability (e.g., DuckDB).
  • Tests

    • Added integration tests verifying cleanup via explicit close, context-manager exit, and garbage-collection disposal.

- Add close() method to CacheBase to dispose all SQLAlchemy engines
- Cache SQLAlchemy engine in SqlConfig and add dispose_engine() method
- Implement context manager protocol (__enter__/__exit__) for clean resource management
- Add __del__ for cleanup when cache is garbage collected
- Add comprehensive tests demonstrating connection cleanup works

Fixes #807

Co-Authored-By: AJ Steers <[email protected]>
Copy link
Contributor

Original prompt from AJ Steers
@Devin - Can you try to resolve this issue? <https://github.com/airbytehq/PyAirbyte/issues/807>
Thread URL: https://airbytehq-team.slack.com/archives/D089P0UPVT4/p1759267249353239?thread_ts=1759267249.353239

Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This PyAirbyte Version

You can test this version of PyAirbyte using the following:

# Run PyAirbyte CLI from this branch:
uvx --from 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1759272656-fix-duckdb-connection-leak' pyairbyte --help

# Install PyAirbyte from this branch for development:
pip install 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1759272656-fix-duckdb-connection-leak'

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /fix-pr - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test-pr - Runs tests with the updated PyAirbyte

Community Support

Questions? Join the #pyairbyte channel in our Slack workspace.

📝 Edit this welcome message.

Copy link
Contributor

coderabbitai bot commented Sep 30, 2025

📝 Walkthrough

Walkthrough

Adds lifecycle/resource-management to CacheBase (close, context-manager, destructor), caches and exposes disposal for the SqlConfig SQL engine, declares SqlConfig-typed attributes on backend bases, and adds integration tests verifying DuckDB connection cleanup via close, context manager exit, and GC-triggered del.

Changes

Cohort / File(s) Summary
Cache resource management
airbyte/caches/base.py
Adds close(), __enter__(), __exit__(), and __del__() to CacheBase for idempotent cleanup; imports contextlib, Self, and TracebackType to support context manager and destructor signatures.
SQL engine caching & disposal
airbyte/shared/sql_processor.py
Adds a private cached _engine on SqlConfig, refactors get_sql_engine() to lazily create/cache the engine, and adds dispose_engine() to dispose and clear the cached engine.
Backend type annotations
airbyte/caches/_catalog_backend.py, airbyte/caches/_state_backend_base.py
Adds class-level _sql_config: SqlConfig type-annotated attributes to CatalogBackendBase and StateBackendBase (TYPE_CHECKING import added); no runtime behavior change.
Integration tests for DuckDB cleanup
tests/integration_tests/test_duckdb_cache.py
Adds tests covering cleanup via explicit close(), context manager exit, and __del__/GC; imports duckdb and pytest.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Test
  participant Cache as CacheBase / DuckDBCache
  participant SQL as SqlConfig
  participant DB as DuckDB Engine

  rect rgb(240,248,255)
    note over Test,Cache: Usage phase
    Test->>Cache: create or use (with ...)
    Cache->>SQL: get_sql_engine()
    alt engine not cached
      SQL->>DB: create_engine(...)
      SQL-->>SQL: cache engine
    else engine cached
      SQL-->>Cache: return cached engine
    end
    Test->>Cache: perform read/write
  end

  rect rgb(245,255,240)
    note over Cache,DB: Cleanup phase
    Test->>Cache: close() / __exit__() / GC -> __del__()
    Cache->>SQL: dispose_engine()
    SQL->>DB: engine.dispose()
    SQL-->>SQL: clear cached engine
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • bleonard — wdyt?

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and clearly conveys the primary change—disposing DuckDB connections to prevent file locking—and follows the conventional commit style without unnecessary detail.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1759272656-fix-duckdb-connection-leak

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
airbyte/caches/base.py (1)

129-131: Consider checking for disposal method existence.

Currently you're checking if _read_processor exists and is not None, but you're not verifying that sql_config.dispose_engine exists before calling it. While it should always exist based on the SqlConfig base class, defensive programming might suggest checking. Wdyt?

 if hasattr(self, "_read_processor") and self._read_processor is not None:
     with contextlib.suppress(Exception):
-        self._read_processor.sql_config.dispose_engine()
+        if hasattr(self._read_processor.sql_config, "dispose_engine"):
+            self._read_processor.sql_config.dispose_engine()

Similar pattern for lines 134-136.

tests/integration_tests/test_duckdb_cache.py (2)

103-103: Ensure schema exists is redundant but harmless.

Line 103 explicitly calls _ensure_schema_exists(), but line 105's CREATE TABLE IF NOT EXISTS already ensures the schema exists as a side effect of the cache initialization in line 101. The explicit call doesn't hurt, but it's not necessary. Wdyt about removing it for clarity?

     cache = DuckDBCache(db_path=db_path)
 
-    cache.processor._ensure_schema_exists()
-
     cache.processor._execute_sql(
         "CREATE TABLE IF NOT EXISTS test_table (id INTEGER, name VARCHAR)"
     )

86-161: Consider adding a test for concurrent cleanup robustness.

All three tests verify cleanup works in happy-path scenarios. Given the PR mentions "potential concurrent-usage implications," consider adding a test that exercises concurrent access or verifies cleanup is safe when engines are in use. Wdyt?

Example test sketch:

def test_duckdb_cleanup_with_active_connection(tmp_path):
    """Test that cleanup is safe even with active queries."""
    import duckdb
    
    db_path = tmp_path / "test_concurrent.duckdb"
    cache = DuckDBCache(db_path=db_path)
    
    # Get a connection but don't close it yet
    conn = cache.get_sql_engine().connect()
    
    # Close cache while connection is still active
    cache.close()
    
    # Connection should still work (engine.dispose() doesn't force-close)
    # but new connections should be rejected or create fresh engines
    conn.close()
    
    # Verify external access works after cleanup
    ext_conn = duckdb.connect(str(db_path))
    ext_conn.close()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c64e73d and fe8316d.

📒 Files selected for processing (3)
  • airbyte/caches/base.py (3 hunks)
  • airbyte/shared/sql_processor.py (2 hunks)
  • tests/integration_tests/test_duckdb_cache.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
airbyte/shared/sql_processor.py (5)
airbyte/_processors/sql/snowflake.py (2)
  • get_sql_alchemy_url (134-149)
  • get_sql_alchemy_connect_args (112-116)
airbyte/_processors/sql/bigquery.py (1)
  • get_sql_alchemy_url (65-81)
airbyte/caches/motherduck.py (1)
  • get_sql_alchemy_url (49-63)
airbyte/_processors/sql/postgres.py (1)
  • get_sql_alchemy_url (29-33)
airbyte/_processors/sql/duckdb.py (1)
  • get_sql_alchemy_url (41-50)
tests/integration_tests/test_duckdb_cache.py (3)
airbyte/caches/duckdb.py (1)
  • DuckDBCache (43-54)
airbyte/caches/base.py (2)
  • processor (193-195)
  • close (119-139)
airbyte/shared/sql_processor.py (2)
  • _ensure_schema_exists (519-544)
  • _execute_sql (818-833)
airbyte/caches/base.py (1)
airbyte/shared/sql_processor.py (2)
  • sql_config (385-387)
  • dispose_engine (152-156)
🪛 GitHub Actions: Run Linters
airbyte/caches/base.py

[error] 12-12: Command 'poetry run mypy .' failed: Module 'typing' has no attribute 'Self'. Use 'from typing_extensions import Self' instead.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (No Creds)
🔇 Additional comments (10)
airbyte/caches/base.py (3)

119-140: LGTM! Comprehensive cleanup with proper safety measures.

The close() method properly disposes engines from the read processor, both backends, and the cache itself. Using contextlib.suppress(Exception) ensures cleanup continues even if individual disposal operations fail. The idempotent design allows safe repeated calls.


141-153: LGTM! Context manager protocol correctly implemented.

The context manager implementation follows the standard protocol: __enter__ returns self, and __exit__ delegates cleanup to close(). The __exit__ signature correctly accepts exception info but doesn't suppress exceptions (returns None implicitly), which is the expected behavior.


154-158: LGTM! Destructor safely handles cleanup during garbage collection.

The __del__ method wraps close() in contextlib.suppress(Exception) to prevent exceptions during interpreter shutdown. This is the correct pattern for cleanup in destructors.

airbyte/shared/sql_processor.py (4)

95-96: LGTM! Private engine cache attribute properly declared.

The _engine attribute is correctly typed as Engine | None and initialized to None, following the pattern for lazy initialization.


138-150: LGTM! Engine caching correctly implemented.

The lazy initialization pattern is clean: create and cache the engine on first access, then return the cached instance on subsequent calls. The engine configuration properly includes schema translation, debug echo, and connect args.


152-156: LGTM! Engine disposal properly releases resources.

The dispose_engine() method correctly calls engine.dispose() to release all connections in the pool, then clears the cache to ensure a fresh engine on next access.


138-150: Consider guarding get_sql_engine initialization with a lock

I didn’t find any direct multi-threaded calls to SqlConfig, but SnowflakeConfig spawns threads via ThreadPoolExecutor. If a shared SqlConfig hits get_sql_engine simultaneously, you could end up creating multiple engines. Could you confirm whether SqlConfig instances are shared across threads and, if so, wrap _engine initialization in a threading.Lock? wdyt?

tests/integration_tests/test_duckdb_cache.py (3)

86-120: LGTM! Explicit cleanup test properly verifies connection release.

This test correctly exercises the explicit close() path:

  1. Creates cache and writes data
  2. Calls close() to release connections
  3. Opens new connection to verify data persists and file isn't locked

The schema qualification (main.test_table) is correct for DuckDB.


122-138: LGTM! Context manager test properly verifies exit cleanup.

This test correctly exercises the context manager protocol:

  1. Uses cache within with block
  2. Writes data
  3. Exits context (triggers __exit__close())
  4. Verifies external connection succeeds

140-161: LGTM! Garbage collection test verifies del cleanup.

This test correctly exercises the __del__ path:

  1. Creates cache in nested function scope (no persistent reference)
  2. Function exits, making cache eligible for GC
  3. Forces garbage collection with gc.collect()
  4. Verifies external connection succeeds

The nested function scope is a good pattern to ensure the cache reference is truly released.

Copy link

github-actions bot commented Sep 30, 2025

PyTest Results (Fast Tests Only, No Creds)

304 tests  +3   304 ✅ +3   4m 17s ⏱️ -12s
  1 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit c1bd2f8. ± Comparison against base commit c64e73d.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Sep 30, 2025

PyTest Results (Full)

367 tests  +3   351 ✅ +3   19m 43s ⏱️ - 6m 8s
  1 suites ±0    16 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit c1bd2f8. ± Comparison against base commit c64e73d.

♻️ This comment has been updated with latest results.

- Use typing_extensions.Self directly instead of try-except pattern for MyPy compatibility
- Add noqa comment for PLR0904 (too many public methods) since adding close() pushed class over limit
- Let Ruff auto-fix import ordering

Co-Authored-By: AJ Steers <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe8316d and 7b8979a.

📒 Files selected for processing (1)
  • airbyte/caches/base.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte/caches/base.py (1)
airbyte/shared/sql_processor.py (3)
  • SqlConfig (86-167)
  • sql_config (385-387)
  • dispose_engine (152-156)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (No Creds)
  • GitHub Check: Pytest (Fast)

- Remove contextlib.suppress from close() to let callers know about failures
- Keep exception suppression only in __del__ for safe garbage collection
- Add docstring explaining exception propagation behavior

Addresses code review feedback from coderabbitai[bot]

Co-Authored-By: AJ Steers <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
airbyte/caches/base.py (1)

131-134: Guard backend disposal when attributes might be missing.

If construction bails out before _catalog_backend or _state_backend are set, an explicit close() will raise AttributeError, which breaks the “idempotent and safe” contract and leaks the underlying engine until GC kicks in. Could we pivot to getattr(..., None) (or similar) so we only attempt dispose_engine() on objects that actually exist, wdyt?

-        for backend in [self._catalog_backend, self._state_backend]:
-            if backend is not None and hasattr(backend, "_sql_config"):
+        for backend in (
+            getattr(self, "_catalog_backend", None),
+            getattr(self, "_state_backend", None),
+        ):
+            if backend is not None and hasattr(backend, "_sql_config"):
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b8979a and e4c0290.

📒 Files selected for processing (1)
  • airbyte/caches/base.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
airbyte/caches/base.py (2)
airbyte/shared/sql_processor.py (3)
  • SqlConfig (86-167)
  • sql_config (385-387)
  • dispose_engine (152-156)
airbyte/_writers/base.py (1)
  • AirbyteWriterInterface (20-88)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (No Creds)
  • GitHub Check: Pytest (Fast)

- Add _sql_config declaration to CatalogBackendBase and StateBackendBase
- Refactor close() to access backends individually for better type inference
- This makes the code cleaner and more Pythonic while maintaining identical behavior

All tests pass and MyPy/Ruff checks pass.

Co-Authored-By: AJ Steers <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
airbyte/caches/_state_backend_base.py (1)

28-29: Scope _sql_config to SqlStateBackend?
Only SqlStateBackend consumes SqlConfig, so consider moving its declaration (or type annotation) into that subclass—or make it optional on the base if you expect non-SQL implementations. WDYT?

airbyte/caches/_catalog_backend.py (1)

58-59: Consider moving _sql_config to the SQL‐specific subclass
The _sql_config: SqlConfig annotation on CatalogBackendBase currently only applies to SqlCatalogBackend, so relocating it avoids implying that future non‐SQL backends must provide this field. WDYT?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e4c0290 and c1bd2f8.

📒 Files selected for processing (3)
  • airbyte/caches/_catalog_backend.py (1 hunks)
  • airbyte/caches/_state_backend_base.py (2 hunks)
  • airbyte/caches/base.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte/caches/base.py
🧰 Additional context used
🧬 Code graph analysis (2)
airbyte/caches/_catalog_backend.py (1)
airbyte/shared/sql_processor.py (1)
  • SqlConfig (86-167)
airbyte/caches/_state_backend_base.py (1)
airbyte/shared/sql_processor.py (1)
  • SqlConfig (86-167)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (No Creds)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Windows)

@aaronsteers aaronsteers merged commit f5f0f47 into main Oct 1, 2025
22 checks passed
@aaronsteers aaronsteers deleted the devin/1759272656-fix-duckdb-connection-leak branch October 1, 2025 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant