Skip to content

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Jul 31, 2025

Summary by CodeRabbit

  • New Features

    • Added the ability to list cached datasets and view metadata about the default cache.
    • Introduced support for running safe, read-only SQL queries against the default cache, with detailed error reporting.
    • Enabled running SQL queries with record limits directly on caches.
  • Bug Fixes

    • Improved resource management to ensure caches are properly closed after use.

Important

Auto-merge enabled.

This PR is set to merge automatically when all requirements are met.

@Copilot Copilot AI review requested due to automatic review settings July 31, 2025 19:02
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This PyAirbyte Version

You can test this version of PyAirbyte using the following:

# Run PyAirbyte CLI from this branch:
uvx --from 'git+https://github.com/airbytehq/PyAirbyte.git@aj/feat/add-cache-ops-to-mcp' pyairbyte --help

# Install PyAirbyte from this branch for development:
pip install 'git+https://github.com/airbytehq/PyAirbyte.git@aj/feat/add-cache-ops-to-mcp'

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /fix-pr - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test-pr - Runs tests with the updated PyAirbyte

Community Support

Questions? Join the #pyairbyte channel in our Slack workspace.

📝 Edit this welcome message.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds new MCP (Model Context Protocol) tools for working with cached data in the Airbyte system. The changes introduce functionality to inspect and describe cached datasets stored in the default DuckDB cache.

  • Adds two new functions to query cached data: list_cached_datasets() and describe_default_cache()
  • Introduces a CachedDatasetInfo data model to structure cached dataset information
  • Registers the new tools with the FastMCP application and reorders the tool registration list

Copy link
Contributor

coderabbitai bot commented Jul 31, 2025

📝 Walkthrough

Walkthrough

This change introduces new cache inspection and SQL query execution tools to the local operations module, including safe SQL validation and detailed error reporting. It augments the cache base class with a method for executing SQL queries and updates cache management practices to ensure proper closure. Minor test cleanup is also performed.

Changes

Cohort / File(s) Change Summary
Local Ops Enhancements
airbyte/mcp/_local_ops.py
Adds CachedDatasetInfo Pydantic model; introduces list_cached_streams, describe_default_cache, _is_safe_sql, and run_sql_query functions; updates sync_source_to_cache for explicit cache closure; modifies register_local_ops_tools to register new tools and reorder registration.
Cache Base SQL Support
airbyte/caches/base.py
Adds run_sql_query method to CacheBase for executing SQL queries with error handling and result limiting.
Test Cleanup
tests/integration_tests/test_duckdb_cache.py
Adds a blank line after imports; no functional changes.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant FastMCP App
    participant LocalOps
    participant CacheBase

    User->>FastMCP App: Call run_sql_query(sql_query, max_records)
    FastMCP App->>LocalOps: run_sql_query(sql_query, max_records)
    LocalOps->>CacheBase: run_sql_query(sql_query, max_records)
    CacheBase-->>LocalOps: Query results or error
    LocalOps-->>FastMCP App: Results or error details
    FastMCP App-->>User: Results or error details
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Would you like me to elaborate with some examples of the SQL validation rules or error handling scenarios, or does this overview provide enough clarity for your review, wdyt?


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 57679e7 and e7e8af9.

📒 Files selected for processing (2)
  • airbyte/mcp/_local_ops.py (3 hunks)
  • tests/integration_tests/test_duckdb_cache.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/integration_tests/test_duckdb_cache.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte/mcp/_local_ops.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch aj/feat/add-cache-ops-to-mcp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
airbyte/mcp/_local_ops.py (2)

281-290: Clean Pydantic model design!

Love the simple, focused approach with room for future expansion. The TODO comment needs an issue link to satisfy the linter though - could you add one? Something like # TODO(#123): add later: wdyt?


301-309: Great cache metadata function!

Really useful for debugging and inspection. The implementation is solid. One tiny thought - would it be worth making the return type a bit more specific with a TypedDict, or is the flexibility of dict[str, Any] preferred here? Either way works, just curious about your preference!

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bfed5ef and f74d5fe.

📒 Files selected for processing (1)
  • airbyte/mcp/_local_ops.py (2 hunks)
🧰 Additional context used
🪛 GitHub Actions: Run Linters
airbyte/mcp/_local_ops.py

[warning] 15-15: Ruff TC001: Move application import airbyte.caches.duckdb.DuckDBCache into a type-checking block.


[warning] 286-286: Ruff TD003: Missing issue link on the line following this TODO.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (No Creds)
🔇 Additional comments (4)
airbyte/mcp/_local_ops.py (4)

12-12: LGTM on the BaseModel import!

This import is necessary for the new CachedDatasetInfo model. Nice and clean!


15-15: Import looks correct, but let's double-check the linter warning?

The pipeline shows a warning about moving this import to a TYPE_CHECKING block, but you're actually using DuckDBCache at runtime in list_cached_datasets(). The linter might be confused here - wdyt? Should we keep it as is since it's used at runtime?


292-298: Excellent implementation of list_cached_datasets!

Clean list comprehension and proper use of the cache interface. The function design aligns perfectly with the PR objectives for cached data tools. Well done!


316-322: Perfect tool registration updates!

Both new cache tools are properly registered and the reordering looks intentional. The registration follows the existing pattern nicely - everything looks great here!

Copy link

github-actions bot commented Jul 31, 2025

PyTest Results (Fast Tests Only, No Creds)

256 tests  ±0   256 ✅ ±0   3m 38s ⏱️ +18s
  1 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit e7e8af9. ± Comparison against base commit bfed5ef.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Jul 31, 2025

PyTest Results (Full)

318 tests  ±0   304 ✅ ±0   17m 32s ⏱️ + 1m 15s
  1 suites ±0    14 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit e7e8af9. ± Comparison against base commit bfed5ef.

♻️ This comment has been updated with latest results.

@aaronsteers aaronsteers enabled auto-merge (squash) July 31, 2025 20:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
airbyte/caches/base.py (1)

156-202: Solid SQL query execution with good error handling!

I really like the implementation here - proper connection context management, specific SQLAlchemy exception handling, and the distinction between row-returning vs non-row-returning queries. The use of strict=True in the zip is a nice safety touch too.

One small question: would it be worth adding a docstring note about which types of SQL statements this is intended for vs when to use the processor directly, wdyt? The current docstring mentions "single DML statements" but some users might not be familiar with that terminology.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f74d5fe and 05bcdc3.

📒 Files selected for processing (4)
  • airbyte/caches/base.py (3 hunks)
  • airbyte/mcp/_local_ops.py (2 hunks)
  • airbyte/shared/sql_processor.py (1 hunks)
  • tests/integration_tests/test_duckdb_cache.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • airbyte/shared/sql_processor.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte/mcp/_local_ops.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: in `examples/run_perf_test_reads.py`, the code for setting up snowflake configuration in `get_cache`...
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#415
File: examples/run_perf_test_reads.py:117-127
Timestamp: 2024-10-09T19:21:45.994Z
Learning: In `examples/run_perf_test_reads.py`, the code for setting up Snowflake configuration in `get_cache` and `get_destination` cannot be refactored into a shared helper function because there are differences between them.

Applied to files:

  • tests/integration_tests/test_duckdb_cache.py
📚 Learning: test fixtures in the pyairbyte project do not need to align with real docker repositories....
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#347
File: tests/integration_tests/fixtures/registry.json:48-48
Timestamp: 2024-08-31T01:20:08.405Z
Learning: Test fixtures in the PyAirbyte project do not need to align with real Docker repositories.

Applied to files:

  • tests/integration_tests/test_duckdb_cache.py
📚 Learning: the `bigquerycache.get_arrow_dataset` method should have a docstring that correctly states the reaso...
Learnt from: aaronsteers
PR: airbytehq/PyAirbyte#281
File: airbyte/caches/bigquery.py:40-43
Timestamp: 2024-10-08T15:34:31.026Z
Learning: The `BigQueryCache.get_arrow_dataset` method should have a docstring that correctly states the reason for the `NotImplementedError` as BigQuery not supporting `to_arrow`, instead of incorrectly mentioning `pd.read_sql_table`.

Applied to files:

  • airbyte/caches/base.py
🧬 Code Graph Analysis (1)
tests/integration_tests/test_duckdb_cache.py (5)
airbyte/_util/venv_util.py (1)
  • get_bin_dir (15-20)
airbyte/caches/util.py (1)
  • new_local_cache (39-77)
tests/integration_tests/test_source_faker_integration.py (1)
  • duckdb_cache (87-100)
airbyte/caches/base.py (2)
  • close (148-154)
  • processor (144-146)
airbyte/shared/sql_processor.py (1)
  • close (778-789)
🪛 GitHub Actions: Run Linters
airbyte/caches/base.py

[error] 1-1: Ruff formatting check failed. File would be reformatted. Run 'ruff format' to fix code style issues.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (No Creds)
🔇 Additional comments (3)
tests/integration_tests/test_duckdb_cache.py (2)

17-18: LGTM on the import reordering!

Nice organizational improvement moving the airbyte import below pytest.


86-95: Excellent test coverage for the new close functionality!

I like how this test verifies both the cache-level close() method and the underlying processor's close() method, plus checks that multiple calls are idempotent. This gives good confidence that the resource cleanup works as expected. The test is simple but thorough - well done!

airbyte/caches/base.py (1)

148-154: Clean and simple delegation pattern!

The close() method provides a nice convenience interface at the cache level while properly delegating to the underlying processor. This aligns well with the test coverage I saw in the other file.

@aaronsteers
Copy link
Contributor Author

aaronsteers commented Jul 31, 2025

/fix-pr

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.
(This job requires that the PR author has "Allow edits from maintainers" enabled.)

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

@aaronsteers
Copy link
Contributor Author

aaronsteers commented Jul 31, 2025

/fix-pr

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.
(This job requires that the PR author has "Allow edits from maintainers" enabled.)

PR auto-fix job started... Check job output.

🟦 Job completed successfully (no changes).

@aaronsteers aaronsteers merged commit 8629d3e into main Jul 31, 2025
22 checks passed
@aaronsteers aaronsteers deleted the aj/feat/add-cache-ops-to-mcp branch July 31, 2025 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant