Skip to content

Conversation

@aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Nov 12, 2025

feat(registry): Add KeyError validation for externalDocumentationUrls

Summary

Adds validation to raise clear PyAirbyteInputError exceptions when externalDocumentationUrls entries are missing required title or url fields, instead of allowing KeyError to propagate. This applies to both manifest.yaml parsing and registry metadata parsing.

Changes:

  • Modified ApiDocsUrl.from_manifest_dict() to wrap dict access in try-except and raise PyAirbyteInputError with clear message
  • Modified _extract_docs_from_registry() with same validation pattern
  • Added 2 unit tests verifying exception behavior for missing fields
  • Updated docstrings to document the new exception

Related PRs:

Review & Testing Checklist for Human

This is a yellow risk change (behavior change with potential production impact):

  • Verify this won't break existing connectors: Check if any connectors in production have malformed externalDocumentationUrls metadata (missing title/url fields). This change will cause those to fail instead of being silently skipped.
  • Test with source-faker: Manually test get_api_docs_urls("source-faker") to ensure it still works correctly with valid metadata
  • Consider error message clarity: The error messages are simple ("missing required field 'title'") - verify this provides enough context for debugging. May want to include connector name or manifest URL in future iteration.

Notes

Summary by CodeRabbit

  • New Features

    • Documentation ingestion now reads connector metadata (metadata.yaml) in addition to registry entries.
  • Bug Fixes

    • Stronger validation with clearer errors for missing documentation fields.
    • Connector presence is validated before fetching docs; registry and metadata docs are returned together (duplicates may appear).
  • Tests

    • Added tests ensuring errors for incomplete or malformed metadata.
  • Chores

    • Renamed public function for retrieving connector documentation URLs for clearer intent.

- Add validation in ApiDocsUrl.from_manifest_dict to raise PyAirbyteInputError when 'title' or 'url' fields are missing
- Add validation in _extract_docs_from_registry for registry entries
- Add unit tests for missing field scenarios
- Error messages clearly indicate which required field is missing

Co-Authored-By: AJ Steers <[email protected]>
@devin-ai-integration
Copy link
Contributor

Original prompt from AJ Steers
Received message in Slack channel #ask-devin-ai:

@Devin - In the PyAirbyte MCP server we need a new tool called 'Get API Docs URLs' in the registry domain:

For a request regarding specific a connector id (source-foo or destination-foo), the tool should return a set of docs urls relevant to the connector's upstream API. In place of a canonical connector name the tool should also accept a string representing the API name (i.e. should accept "source-facebook-marketing" OR "Facebook Marketing API").

If a canonical connector ID is provided, the tool should return any links stored in the connector's manifest.yaml file, pulling from the <http://connectors.airbyte.com|connectors.airbyte.com> registry endpoint. (You may need to search a bit for the exact url patterns.)
1. I don't think this exists yet in the metadata.yaml files, so please plan for the docs list to exist in this format roughly:
2. In `data.docs` entry, we'll have map of docs title or friendly name to the url. (Search for any precedent that already exists for providing docs urls in manifest.yaml - and propose an approach you think makes sense. I care less about a canonical "type" of docs URL, so long as there is space for both a title and url. E.g. a yaml entry could be "- Shopify API Deprecations Page": <https://docs.shopify.com/api/something/something>, or else the same info split out as - {title: "...", url: "...", type: "api-deprecation-docs"})
Let's also make a plan for a parallel tool called "Find API Docs URLs" - and for this we may need to use a websearch and we _might_ need to either inject an MCP prompt or else leverage LLM Sampling: <https://gofastmcp.com/clients/sampling>. Sampling is helpful (I think) in cases where the MCP server doesn't have its own API key for LLM processing. I don't have experience with that feature in MCP servers, so we'll need to evaluate carefully if this approach makes sense.

Our approach will be in two phases: the first phase will do the first tool. You will need limi... (563 chars truncated...)

@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This PyAirbyte Version

You can test this version of PyAirbyte using the following:

# Run PyAirbyte CLI from this branch:
uvx --from 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1762978720-add-keyerror-validation' pyairbyte --help

# Install PyAirbyte from this branch for development:
pip install 'git+https://github.com/airbytehq/PyAirbyte.git@devin/1762978720-add-keyerror-validation'

Helpful Resources

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /fix-pr - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test-pr - Runs tests with the updated PyAirbyte

Community Support

Questions? Join the #pyairbyte channel in our Slack workspace.

📝 Edit this welcome message.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 12, 2025

📝 Walkthrough

Walkthrough

Replaces manifest-based ingestion with metadata-based ingestion: adds metadata.yaml URL fetching and parsing, new ApiDocsUrl.from_metadata_* helpers, stricter validation that raises PyAirbyteInternalError for missing required fields, renames public API to get_connector_docs_urls, and updates tests and URL helpers accordingly.

Changes

Cohort / File(s) Summary
Registry: metadata docs parsing & validation
airbyte/registry.py
Replaces manifest-based logic with metadata-based endpoints (_DEFAULT_METADATA_URL, _metadata_url_for, _fetch_metadata_dict); adds ApiDocsUrl.from_metadata_docs_list and ApiDocsUrl.from_metadata_dict; validates required metadata fields and raises PyAirbyteInternalError on parse errors; registry-origin docs marked with doc_type="internal"; combined results returned without deduplication.
Public API wrapper / rename
airbyte/mcp/connector_registry.py
Public function renamed from get_api_docs_urlsget_connector_docs_urls; now delegates to airbyte.registry.get_connector_docs_urls (aliased internally); updated docstrings and tests targets; URL helper names updated to metadata variants.
Tests: metadata model + API rename
tests/unit_tests/test_mcp_connector_registry.py
Tests updated to use metadata fixture shapes and new function names (get_connector_docs_urls, _metadata_url_for, _fetch_metadata_dict, from_metadata_dict); added negative tests asserting PyAirbyteInternalError when title or url are missing; adapted expectations for metadata.yaml paths and externalDocumentationUrls keys.

Sequence Diagram(s)

sequenceDiagram
  participant Caller as Caller
  participant MCP as mcp.connector_registry
  participant Registry as airbyte.registry
  participant ApiDocs as ApiDocsUrl

  Caller->>MCP: get_connector_docs_urls(connector_name)
  MCP->>Registry: _get_connector_docs_urls(connector_name)
  Registry->>Registry: check connector exists (dockers)
  alt connector not found
    Registry-->>Caller: raise AirbyteConnectorNotRegisteredError
  else connector found
    Registry->>Registry: _metadata_url_for(connector_name)
    Registry->>Registry: _fetch_metadata_dict(metadata_url)
    Registry->>Registry: ApiDocsUrl.from_metadata_dict(metadata_dict)
    Registry->>ApiDocs: ApiDocsUrl.from_metadata_docs_list(docs, source, context="Registry")
    alt missing required fields
      ApiDocs-->>Registry: raise PyAirbyteInternalError
      Registry-->>Caller: propagate PyAirbyteInternalError
    else valid metadata
      ApiDocs-->>Registry: return list[ApiDocsUrl] (registry entries tagged doc_type="internal")
      Registry-->>Caller: return combined list[ApiDocsUrl]
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Mixed new parsing logic and validation, public API rename, and test updates.
  • Pay special attention to:
    • airbyte/registry.py — metadata URL construction, 404 behavior in _fetch_metadata_dict, and consistent raising/typing of PyAirbyteInternalError.
    • ApiDocsUrl.from_metadata_* implementations — required-field checks and error messages.
    • airbyte/mcp/connector_registry.py — correct delegation to the new registry function and preserved error semantics for missing connectors.
    • tests/unit_tests/test_mcp_connector_registry.py — negative tests precision and fixtures matching actual metadata schema.

Would you like a short deprecation shim for the old get_api_docs_urls that calls get_connector_docs_urls with a warning to ease migration, wdyt?

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title claims to add KeyError validation, but the actual changes reveal a major refactor replacing manifest-based ingestion with metadata-based ingestion and renaming the public API from get_connector_api_docs_urls to get_connector_docs_urls. Update the title to reflect the primary change (e.g., 'refactor(registry): Migrate from manifest to metadata-based documentation ingestion') or clarify if this PR should focus solely on validation. The current title undersells the scope.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1762978720-add-keyerror-validation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Nov 12, 2025

PyTest Results (Fast Tests Only, No Creds)

321 tests  +1   321 ✅ +1   5m 54s ⏱️ ±0s
  1 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 8af8085. ± Comparison against base commit 7eb746b.

This pull request removes 8 and adds 9 tests. Note that renamed tests count towards both.
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromManifestDict ‑ test_empty_manifest
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromManifestDict ‑ test_manifest_with_external_docs_no_type
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromManifestDict ‑ test_manifest_with_external_docs_urls
tests.unit_tests.test_mcp_connector_registry.TestFetchManifestDict ‑ test_fetch_manifest_dict
tests.unit_tests.test_mcp_connector_registry.TestFetchManifestDict ‑ test_manifest_not_found
tests.unit_tests.test_mcp_connector_registry.TestGetApiDocsUrls ‑ test_connector_not_found
tests.unit_tests.test_mcp_connector_registry.TestGetApiDocsUrls ‑ test_deduplication_of_urls
tests.unit_tests.test_mcp_connector_registry.TestManifestUrlFor ‑ test_manifest_url_for
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromMetadataDict ‑ test_empty_metadata
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromMetadataDict ‑ test_metadata_missing_title_raises_error
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromMetadataDict ‑ test_metadata_missing_url_raises_error
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromMetadataDict ‑ test_metadata_with_external_docs_no_type
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromMetadataDict ‑ test_metadata_with_external_docs_urls
tests.unit_tests.test_mcp_connector_registry.TestFetchMetadataDict ‑ test_fetch_metadata_dict
tests.unit_tests.test_mcp_connector_registry.TestFetchMetadataDict ‑ test_metadata_not_found
tests.unit_tests.test_mcp_connector_registry.TestGetConnectorDocsUrls ‑ test_connector_not_found
tests.unit_tests.test_mcp_connector_registry.TestMetadataUrlFor ‑ test_metadata_url_for

♻️ This comment has been updated with latest results.

- Rename get_connector_api_docs_urls to get_connector_docs_urls in registry.py
- Rename MCP tool from get_api_docs_urls to get_connector_docs_urls
- Update internal Airbyte docs to use doc_type='internal' instead of default 'other'
- Update all tests to use new function names
- This is a breaking change but acceptable since the feature was just released today

Co-Authored-By: AJ Steers <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
airbyte/registry.py (1)

416-423: Add test coverage for the new doc_type="internal" value.

The change at line 421 introduces a new doc_type="internal" value for registry documentationUrl entries. This semantic distinction makes sense (internal Airbyte docs vs. external API docs), but the code path at lines 416-423 lacks test coverage.

Currently, tests verify other doc_type values like "api_reference", "api_release_history", and "api_deprecations" in TestApiDocsUrlFromManifestDict, but the registry documentationUrl path is untested. Consider adding a test case to TestApiDocsUrlFromManifestDict that asserts doc_type=="internal" when ApiDocsUrl is constructed with the registry source—or verify if this path is tested indirectly elsewhere. Wdyt?

🧹 Nitpick comments (2)
airbyte/registry.py (2)

333-347: Solid validation implementation!

The try/except pattern correctly catches KeyError on required fields and raises a clear PyAirbyteInputError. The exception chaining with from e preserves the stack trace nicely.

One thought: the error message could be even more helpful with additional context like f"Manifest parsing error for {manifest_data.get('metadata', {}).get('name', 'unknown')}: missing required field {e}". But I see from the PR objectives that improving error messages is noted for future iteration, so this is totally fine as-is. wdyt about deferring that enhancement?


428-442: Validation logic mirrors manifest parsing nicely!

The try/except pattern here is consistent with the manifest parsing logic (lines 333-347), which is great for maintainability.

However, I noticed from the PR objectives that "registry parsing path uses the same validation pattern but isn't explicitly covered by new tests." The new tests (test_manifest_missing_title_raises_error, test_manifest_missing_url_raises_error) only cover the manifest path via ApiDocsUrl.from_manifest_dict().

Should we add equivalent tests for _extract_docs_from_registry() to ensure both validation paths are covered? wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9cf2b90 and 12de9c8.

📒 Files selected for processing (3)
  • airbyte/mcp/connector_registry.py (3 hunks)
  • airbyte/registry.py (3 hunks)
  • tests/unit_tests/test_mcp_connector_registry.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit_tests/test_mcp_connector_registry.py
🧰 Additional context used
🧬 Code graph analysis (2)
airbyte/registry.py (2)
airbyte/exceptions.py (1)
  • PyAirbyteInputError (201-210)
airbyte/mcp/connector_registry.py (1)
  • get_connector_docs_urls (183-203)
airbyte/mcp/connector_registry.py (1)
airbyte/registry.py (1)
  • get_connector_docs_urls (447-489)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (5)
airbyte/mcp/connector_registry.py (3)

27-27: Clean import aliasing!

The underscore-prefixed alias makes it clear this is for internal routing. Nice pattern for keeping the MCP tool layer independent from registry naming.


183-199: LGTM on the rename and docstring updates!

The shift from "API documentation" to just "documentation" better reflects that this retrieves all connector documentation URLs (Airbyte docs + external docs), not just API-specific docs.


200-203: Should validation errors be caught here too?

The new validation logic in _get_connector_docs_urls() raises PyAirbyteInputError when required fields are missing (lines 345-347, 440-442 in registry.py). Currently, only AirbyteConnectorNotRegisteredError is caught and converted to a friendly "Connector not found." message.

Should PyAirbyteInputError also be caught here to return a user-friendly message like "Invalid connector documentation format" instead of letting it propagate? Or is it intentional to let validation errors bubble up to expose data quality issues? wdyt?

airbyte/registry.py (2)

323-326: Good documentation of the new exception!

Documenting the PyAirbyteInputError in the docstring helps API consumers understand when validation will fail.


397-399: Consistent documentation across parsing functions!

Good to see the same error documentation pattern applied to both from_manifest_dict() and _extract_docs_from_registry().

devin-ai-integration bot and others added 3 commits November 12, 2025 20:35
…ication_of_urls

- Updated error messages to include the malformed doc dict for better debugging
- Fixed test_deduplication_of_urls to properly mock get_available_connectors
- Test now correctly verifies URL deduplication functionality

Co-Authored-By: AJ Steers <[email protected]>
- Removed deduplication logic from get_connector_docs_urls function
- Removed test_deduplication_of_urls test
- Updated docstring to remove mention of deduplication

The deduplication feature was previously introduced in PR #858 but is being
removed per product direction. Duplicate URLs from registry and manifest
sources will now be preserved in the returned list.

Co-Authored-By: AJ Steers <[email protected]>
@devin-ai-integration
Copy link
Contributor

Removed URL deduplication logic and the associated test per user request.

The deduplication feature was previously introduced in PR #858 but is being removed per product direction. Duplicate URLs from registry and manifest sources will now be preserved in the returned list.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
airbyte/registry.py (1)

398-442: Add tests for registry parsing error handling to match manifest coverage

The test coverage gap is valid—while ApiDocsUrl.from_manifest_dict() has dedicated tests for missing 'title' and 'url' fields, the parallel registry parsing path in _extract_docs_from_registry() (lines 440-442) lacks equivalent error tests. Both raise PyAirbyteInputError consistently, but only manifest errors are verified.

Would you be open to adding tests like test_registry_missing_title_raises_error and test_registry_missing_url_raises_error that mock the registry response with malformed externalDocumentationUrls entries? This would ensure both paths have parity in error coverage and catch regressions on either side.

On the exception type point: the pattern mirrors your manifest implementation, so if that was reviewed and approved, this should be consistent. But if there's concern about PyAirbyteInputError being misleading for external data failures (since users can't fix registry issues), that's worth flagging separately—wdyt?

♻️ Duplicate comments (2)
airbyte/registry.py (2)

447-458: Update the docstring to match the function rename?

Line 447 renames the function to get_connector_docs_urls, but line 448's docstring still says "Get API documentation URLs for a connector." A past review comment suggested updating this to "Get documentation URLs for a connector" (or otherwise remove "API") for consistency with the new name. wdyt?

Apply this diff:

 def get_connector_docs_urls(connector_name: str) -> list[ApiDocsUrl]:
-    """Get API documentation URLs for a connector.
+    """Get documentation URLs for a connector.
 
-    This function retrieves documentation URLs for a connector's upstream API from multiple sources:
+    This function retrieves documentation URLs for a connector from multiple sources:

Based on past review comment at lines 447-448.


333-347: Reconsider the exception type for malformed manifest and registry data—should they be PyAirbyteInternalError instead?

The manifest and registry data parsing errors (lines 345 and 440) currently raise PyAirbyteInputError, but that exception is defined as "input provided to PyAirbyte did not match expected validation rules." However, manifest data and registry data are external data sources—not user inputs—so raising PyAirbyteInputError is semantically incorrect.

These are data quality issues external to the library, which aligns better with PyAirbyteInternalError ("An internal error occurred in PyAirbyte"). This is also consistent with line 215, which already uses PyAirbyteInternalError for a registry loading failure. Would it make sense to switch both parsing errors to use PyAirbyteInternalError for consistency and accuracy?

🧹 Nitpick comments (1)
tests/unit_tests/test_mcp_connector_registry.py (1)

131-168: Nice test coverage for manifest parsing errors, but consider adding registry path tests?

These tests properly verify that PyAirbyteInputError is raised when manifest entries are missing required fields. However, the registry parsing path in _extract_docs_from_registry() (lines 428-442 in registry.py) uses the same validation pattern but isn't explicitly covered by tests. Would it be valuable to add similar error tests for the registry path? wdyt?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 12de9c8 and 60e43b4.

📒 Files selected for processing (2)
  • airbyte/registry.py (5 hunks)
  • tests/unit_tests/test_mcp_connector_registry.py (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2024-10-06T23:44:31.534Z
Learnt from: aaronsteers
Repo: airbytehq/PyAirbyte PR: 411
File: airbyte/cli.py:111-160
Timestamp: 2024-10-06T23:44:31.534Z
Learning: In PyAirbyte, error messages in functions like `_resolve_source_job` in `airbyte/cli.py` are designed to decouple the message text from dynamic values, following a structlog-inspired design. Dynamic values are provided via parameters like `input_value`. This approach helps avoid including PII in the message strings, which may be used in telemetry.

Applied to files:

  • airbyte/registry.py
📚 Learning: 2024-10-08T15:34:31.026Z
Learnt from: aaronsteers
Repo: airbytehq/PyAirbyte PR: 347
File: tests/integration_tests/fixtures/registry.json:48-48
Timestamp: 2024-10-08T15:34:31.026Z
Learning: Test fixtures in the PyAirbyte project do not need to align with real Docker repositories.

Applied to files:

  • tests/unit_tests/test_mcp_connector_registry.py
🧬 Code graph analysis (2)
airbyte/registry.py (2)
airbyte/exceptions.py (1)
  • PyAirbyteInputError (201-210)
airbyte/mcp/connector_registry.py (1)
  • get_connector_docs_urls (183-203)
tests/unit_tests/test_mcp_connector_registry.py (3)
airbyte/registry.py (3)
  • get_connector_docs_urls (447-482)
  • ApiDocsUrl (303-349)
  • from_manifest_dict (315-349)
airbyte/mcp/connector_registry.py (1)
  • get_connector_docs_urls (183-203)
airbyte/exceptions.py (2)
  • PyAirbyteInputError (201-210)
  • AirbyteConnectorNotRegisteredError (288-297)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (No Creds)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (3)
airbyte/registry.py (2)

421-421: Good addition of doc_type="internal" for registry documentation!

Tagging the Airbyte Documentation entry with doc_type="internal" helps distinguish it from external API docs. Nice touch!


482-482: LGTM on removing deduplication!

Returning the combined list directly simplifies the logic. If deduplication was needed, it could be handled by the caller.

tests/unit_tests/test_mcp_connector_registry.py (1)

170-184: LGTM on test renames and updates!

The test class rename and updated patch target/function calls correctly reflect the API change from get_api_docs_urls to get_connector_docs_urls.

devin-ai-integration bot and others added 2 commits November 12, 2025 21:14
…nalError

Address PR feedback from r2519730767: Registry/manifest parsing errors should
raise PyAirbyteInternalError instead of PyAirbyteInputError because malformed
registry/manifest data is not a user input issue - it's an internal data issue
that users cannot resolve themselves.

Changes:
- Updated ApiDocsUrl.from_manifest_dict() to raise PyAirbyteInternalError
- Updated _extract_docs_from_registry() to raise PyAirbyteInternalError
- Updated docstrings to reflect the new exception type
- Updated unit tests to expect PyAirbyteInternalError

Ref: #863 (comment)
Co-Authored-By: AJ Steers <[email protected]>
Address PR feedback from 2519832304: The code in _extract_docs_from_registry()
was duplicating the logic from from_manifest_dict(). Refactored to use a shared
from_docs_list() classmethod that both functions can call.

Changes:
- Added ApiDocsUrl.from_docs_list() classmethod with source and context parameters
- Refactored from_manifest_dict() to delegate to from_docs_list()
- Refactored _extract_docs_from_registry() to use from_docs_list()
- Updated docstring to remove 'API' from 'Get API documentation URLs' per CodeRabbit suggestion

This makes the code more DRY and easier to maintain.

Ref: #863 (comment)
Co-Authored-By: AJ Steers <[email protected]>
@devin-ai-integration
Copy link
Contributor

Addressed feedback from r2519730767: Changed exception type from PyAirbyteInputError to PyAirbyteInternalError because malformed registry/manifest data is not a user input issue - it's an internal data issue that users cannot resolve themselves.

Changes:

  • Updated ApiDocsUrl.from_manifest_dict() to raise PyAirbyteInternalError
  • Updated _extract_docs_from_registry() to raise PyAirbyteInternalError
  • Updated docstrings to reflect the new exception type
  • Updated unit tests to expect PyAirbyteInternalError

All tests pass locally ✅

@devin-ai-integration
Copy link
Contributor

Addressed feedback from 2519832304: Eliminated code duplication by adding a shared from_docs_list() classmethod.

Changes:

  • Added ApiDocsUrl.from_docs_list() classmethod with source and context parameters
  • Refactored from_manifest_dict() to delegate to from_docs_list()
  • Refactored _extract_docs_from_registry() to use from_docs_list() instead of manually constructing ApiDocsUrl objects
  • Updated docstring to remove "API" from "Get API documentation URLs" per CodeRabbit suggestion

This makes the code more DRY and easier to maintain. All tests pass locally ✅

@github-actions
Copy link

github-actions bot commented Nov 12, 2025

PyTest Results (Full)

390 tests  +1   374 ✅ +2   24m 41s ⏱️ - 5m 43s
  1 suites ±0    16 💤 ±0 
  1 files   ±0     0 ❌  - 1 

Results for commit 8af8085. ± Comparison against base commit 7eb746b.

This pull request removes 8 and adds 9 tests. Note that renamed tests count towards both.
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromManifestDict ‑ test_empty_manifest
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromManifestDict ‑ test_manifest_with_external_docs_no_type
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromManifestDict ‑ test_manifest_with_external_docs_urls
tests.unit_tests.test_mcp_connector_registry.TestFetchManifestDict ‑ test_fetch_manifest_dict
tests.unit_tests.test_mcp_connector_registry.TestFetchManifestDict ‑ test_manifest_not_found
tests.unit_tests.test_mcp_connector_registry.TestGetApiDocsUrls ‑ test_connector_not_found
tests.unit_tests.test_mcp_connector_registry.TestGetApiDocsUrls ‑ test_deduplication_of_urls
tests.unit_tests.test_mcp_connector_registry.TestManifestUrlFor ‑ test_manifest_url_for
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromMetadataDict ‑ test_empty_metadata
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromMetadataDict ‑ test_metadata_missing_title_raises_error
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromMetadataDict ‑ test_metadata_missing_url_raises_error
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromMetadataDict ‑ test_metadata_with_external_docs_no_type
tests.unit_tests.test_mcp_connector_registry.TestApiDocsUrlFromMetadataDict ‑ test_metadata_with_external_docs_urls
tests.unit_tests.test_mcp_connector_registry.TestFetchMetadataDict ‑ test_fetch_metadata_dict
tests.unit_tests.test_mcp_connector_registry.TestFetchMetadataDict ‑ test_metadata_not_found
tests.unit_tests.test_mcp_connector_registry.TestGetConnectorDocsUrls ‑ test_connector_not_found
tests.unit_tests.test_mcp_connector_registry.TestMetadataUrlFor ‑ test_metadata_url_for

♻️ This comment has been updated with latest results.

This commit fixes a fundamental conflation between manifest.yaml and
metadata.yaml files throughout the codebase:

- manifest.yaml: Declarative connector manifest (behavior/config)
- metadata.yaml: Connector metadata that the registry caches/exposes

Changes made:
- Renamed _DEFAULT_MANIFEST_URL to _DEFAULT_METADATA_URL
- Updated URL pattern from manifest.yaml to metadata.yaml
- Renamed _manifest_url_for() to _metadata_url_for()
- Renamed _fetch_manifest_dict() to _fetch_metadata_dict()
- Renamed ApiDocsUrl.from_manifest_dict() to from_metadata_dict()
- Renamed from_docs_list() to from_metadata_docs_list() (made public)
- Updated context parameter from 'Manifest' to 'Metadata'
- Updated source parameter from 'data_external_docs' to 'metadata_external_docs'
- Updated all test classes, methods, and error message expectations
- Updated MCP connector_registry.py imports and docstrings

The code now correctly parses metadata.yaml files (and registry caches
of those metadata.yaml files), not manifest.yaml files.

Co-Authored-By: AJ Steers <[email protected]>
@devin-ai-integration
Copy link
Contributor

I've addressed the manifest/metadata conflation issue you identified in #863 (comment).

The Issue

The code was conflating two different file types:

  • manifest.yaml: Declarative connector manifest (behavior/config) - we should NOT be parsing this
  • metadata.yaml: Connector metadata that the registry caches/exposes - we SHOULD be parsing this

Both the registry JSON and metadata.yaml files contain the same metadata structure (including externalDocumentationUrls), so the code should treat them both as "metadata" sources.

Changes Made

Systematically renamed all "manifest" references to "metadata" throughout the codebase:

  1. Constants & URLs: _DEFAULT_MANIFEST_URL_DEFAULT_METADATA_URL, updated URL to point to metadata.yaml instead of manifest.yaml
  2. Functions: _manifest_url_for()_metadata_url_for(), _fetch_manifest_dict()_fetch_metadata_dict()
  3. Class methods: ApiDocsUrl.from_manifest_dict()from_metadata_dict(), from_docs_list()from_metadata_docs_list() (made public to fix lint)
  4. Parameters: Updated context from "Manifest" to "Metadata", source from "data_external_docs" to "metadata_external_docs"
  5. Tests: Renamed all test classes, methods, and updated error message expectations
  6. MCP module: Updated imports and docstrings

Verification

✅ All 9 unit tests passing
✅ All ruff format and lint checks passing
✅ Error messages now correctly say "Metadata parsing error" instead of "Manifest parsing error"

The code now correctly parses metadata.yaml files (and registry caches of those metadata.yaml files), not manifest.yaml files.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60e43b4 and 8af8085.

📒 Files selected for processing (3)
  • airbyte/mcp/connector_registry.py (4 hunks)
  • airbyte/registry.py (5 hunks)
  • tests/unit_tests/test_mcp_connector_registry.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte/mcp/connector_registry.py
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2024-10-08T15:34:31.026Z
Learnt from: aaronsteers
Repo: airbytehq/PyAirbyte PR: 347
File: tests/integration_tests/fixtures/registry.json:48-48
Timestamp: 2024-10-08T15:34:31.026Z
Learning: Test fixtures in the PyAirbyte project do not need to align with real Docker repositories.

Applied to files:

  • tests/unit_tests/test_mcp_connector_registry.py
📚 Learning: 2024-10-06T23:44:31.534Z
Learnt from: aaronsteers
Repo: airbytehq/PyAirbyte PR: 411
File: airbyte/cli.py:111-160
Timestamp: 2024-10-06T23:44:31.534Z
Learning: In PyAirbyte, error messages in functions like `_resolve_source_job` in `airbyte/cli.py` are designed to decouple the message text from dynamic values, following a structlog-inspired design. Dynamic values are provided via parameters like `input_value`. This approach helps avoid including PII in the message strings, which may be used in telemetry.

Applied to files:

  • airbyte/registry.py
🧬 Code graph analysis (2)
tests/unit_tests/test_mcp_connector_registry.py (3)
airbyte/mcp/connector_registry.py (1)
  • get_connector_docs_urls (183-203)
airbyte/registry.py (5)
  • get_connector_docs_urls (459-494)
  • ApiDocsUrl (303-371)
  • _fetch_metadata_dict (389-408)
  • _metadata_url_for (374-386)
  • from_metadata_dict (351-371)
airbyte/exceptions.py (2)
  • PyAirbyteInternalError (190-194)
  • AirbyteConnectorNotRegisteredError (288-297)
airbyte/registry.py (2)
airbyte/exceptions.py (1)
  • PyAirbyteInternalError (190-194)
airbyte/mcp/connector_registry.py (1)
  • get_connector_docs_urls (183-203)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Pytest (All, Python 3.11, Windows)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Windows)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (No Creds)
🔇 Additional comments (10)
tests/unit_tests/test_mcp_connector_registry.py (3)

8-27: LGTM! Clean test updates for the metadata migration.

The pytest import and renamed test class correctly reflect the shift from manifest-based to metadata-based API. The test expectations now check for "metadata.yaml" instead of "manifest.yaml", which aligns with the underlying implementation changes.


30-61: LGTM! Comprehensive metadata fetching tests.

The tests correctly verify both the 404 case (returns empty dict) and successful parsing of metadata.yaml content.


131-168: LGTM! Excellent validation coverage.

These new tests verify that missing required fields ('title' and 'url') correctly raise PyAirbyteInternalError with descriptive error messages. This addresses the core objective of the PR—preventing unhandled KeyError propagation.

airbyte/registry.py (7)

43-45: Note: This PR's scope extends beyond the title suggests.

The PR title mentions "Add KeyError validation" but the changes include a complete rename from manifest-based to metadata-based ingestion (renamed constants, functions, and URL paths from manifest.yaml to metadata.yaml). This is fine if intentional, but wanted to flag it since the broader refactoring wasn't mentioned in the PR objectives. The validation additions are well-implemented, though!


315-348: LGTM! Solid error handling with helpful context.

The try/except wrapper correctly catches KeyError and converts it to PyAirbyteInternalError with a descriptive message that includes the malformed doc dict. This makes debugging much easier when validation fails.

The context parameter is a nice touch—it allows different callers (Metadata vs Registry parsing) to provide specific context in error messages.


350-371: LGTM! Clean separation of concerns.

The method correctly extracts documentation URLs from the metadata structure and delegates the validation to from_metadata_docs_list. The defensive isinstance checks ensure graceful handling of missing or malformed metadata sections.


374-386: LGTM! Clean rename with updated docstring.

The function correctly references metadata.yaml in the docstring and uses the renamed _DEFAULT_METADATA_URL constant.


389-408: LGTM! Appropriate 404 handling.

Returning an empty dict on 404 is the right approach here, since not all connectors may have a metadata.yaml file yet. Other HTTP errors are properly raised.


411-456: Good validation reuse, but could we add test coverage for registry errors?

The function now correctly uses from_metadata_docs_list with context="Registry" to ensure consistent error handling. The docstring also documents that it raises PyAirbyteInternalError for missing fields.

However, as noted in the PR objectives, the registry parsing path isn't explicitly tested for error cases. While the underlying from_metadata_docs_list is tested via the metadata path, it would be valuable to add a test that verifies registry entries with malformed externalDocumentationUrls also raise the appropriate error. wdyt?


459-494: LGTM! Clean aggregation from multiple sources.

The function correctly fetches documentation URLs from both the registry and the connector's metadata.yaml file, then combines them. The updated docstring clearly documents both sources. Returning the combined list without deduplication seems intentional—different sources providing the same URL could be useful context for callers.

Comment on lines 173 to 184
def test_connector_not_found(self) -> None:
"""Test handling when connector is not found."""
with patch(
"airbyte.mcp.connector_registry.get_connector_api_docs_urls"
"airbyte.mcp.connector_registry.get_connector_docs_urls"
) as mock_get_docs:
mock_get_docs.side_effect = exc.AirbyteConnectorNotRegisteredError(
connector_name="nonexistent-connector",
context={},
)

result = get_api_docs_urls("nonexistent-connector")
result = get_connector_docs_urls("nonexistent-connector")
assert result == "Connector not found."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Could we verify the patching approach here?

The test patches "airbyte.mcp.connector_registry.get_connector_docs_urls" to raise an exception, but I'm seeing that this is also the function imported at line 11 and called at line 183.

Looking at the MCP wrapper implementation, it seems like this might be testing the wrong layer—the wrapper catches exceptions from the underlying _get_connector_docs_urls call. Should this test patch "airbyte.mcp.connector_registry._get_connector_docs_urls" instead to ensure the exception handling path is actually exercised?

I noticed a past comment mentioned fixing this test, so it might be working as-is, but wanted to double-check the approach. wdyt?


🏁 Script executed:

#!/bin/bash
# Verify the import structure and patching target in the MCP module

# Check how _get_connector_docs_urls is imported in the MCP module
rg -n "import.*_get_connector_docs_urls|from.*import.*_get_connector_docs_urls" --type=py

# Check the implementation of get_connector_docs_urls in the MCP module
ast-grep --pattern $'def get_connector_docs_urls($$$) {
  $$$
}'

Length of output: 117



---

🏁 Script executed:

```shell
#!/bin/bash
# Find the wrapper function definition in the MCP module
ast-grep --pattern $'def get_connector_docs_urls($_) {
  $$$
}' airbyte/mcp/connector_registry.py

# Also check the broader context around line 27 to see the wrapper
rg -n "def get_connector_docs_urls" --type=py -A 15

Length of output: 2220



---

🏁 Script executed:

```shell
#!/bin/bash
# Get the full wrapper implementation to see exception handling
sed -n '183,220p' airbyte/mcp/connector_registry.py

Length of output: 1118


Patch _get_connector_docs_urls instead to properly test the wrapper's exception handling.

The test currently patches the wrapper function itself, which bypasses the exception handling logic inside it. Since the wrapper at line 201 catches AirbyteConnectorNotRegisteredError from _get_connector_docs_urls and returns the fallback string, the test should patch the underlying function (_get_connector_docs_urls imported at line 27) to properly exercise that exception handling path. Otherwise, the test verifies only that the mock was called, not that the wrapper's error handling works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants