Skip to content

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Aug 8, 2025

feat: Switch connector feature index to CSV output with improved filtering

Summary

This PR updates the connector feature index script with significant changes to output format and filtering logic based on user feedback:

  1. Output format change: Switched from JSON to CSV format with "FeatureUsage" and "ConnectorName" columns
  2. Improved filtering: Added requirement for at least one lowercase character to exclude all-caps matches like 'A13V1IB3VIYZZH'
  3. Script rename: bin/index_build.pybin/build_connector_feature_index.py
  4. Output location: connector_component_index.jsongenerated/connector-feature-index.csv
  5. Added poe task: New "build" task in pyproject.toml to invoke the script

The script processes 478 manifest.yaml files and now generates 14,010 feature-connector pairs (previously 2,271 unique class names). The CSV format is sorted by feature name first, then connector name, making it easier to scan and maintain.

Review & Testing Checklist for Human

  • End-to-end testing: Run poe build and verify it completes without errors and generates the CSV file
  • CSV output verification: Open generated/connector-feature-index.csv and spot-check that the data looks correct with proper headers and sorting
  • Filtering validation: Manually verify that all-caps identifiers (like Amazon marketplace IDs) are excluded while legitimate class names like "DeclarativeSource" are included
  • Class name extraction accuracy: Sample a few manifest.yaml files and manually verify that the regex patterns are correctly extracting class names without major false positives/negatives
  • Data integrity: Compare a subset of results with the previous JSON output to ensure no legitimate class names were lost in the format conversion

Recommended test plan: Run the script, examine the CSV output structure, and manually verify 5-10 connectors' extracted class names against their actual manifest.yaml files.


Diagram

%%{ init : { "theme" : "default" }}%%
graph TD
    PyProject["pyproject.toml<br/>poe build task"]:::minor-edit --> Script
    Script["bin/build_connector_feature_index.py<br/>Main script logic"]:::major-edit
    Script --> TempRepo["Temporary airbyte repo<br/>Shallow clone"]:::context
    TempRepo --> Manifests["478 manifest.yaml files<br/>Source connectors"]:::context
    Manifests --> Extraction["Class name extraction<br/>Regex + filtering"]:::major-edit
    Extraction --> CSVOutput["generated/connector-feature-index.csv<br/>14,010 rows"]:::major-edit
    
    
    subgraph Legend
        L1[Major Edit]:::major-edit
        L2[Minor Edit]:::minor-edit  
        L3[Context/No Edit]:::context
    end

    classDef major-edit fill:#90EE90
    classDef minor-edit fill:#87CEEB
    classDef context fill:#FFFFFF
Loading

Notes

  • The filtering change reduced unique class names from 2,271 to 2,084, indicating the lowercase requirement successfully filtered out unwanted all-caps matches
  • CI checks mostly passed (10/11) with only docs preview failing (likely unrelated)
  • The CSV format creates a flattened structure that may be easier for analysis tools to consume
  • Session requested by: AJ Steers (@aaronsteers)
  • Devin session: https://app.devin.ai/sessions/dbb4d0c14cf34055b67b00dfa8f5386c

…mponent index

- Creates searchable index mapping class names to connectors that use them
- Shallow-checkouts airbytehq/airbyte repo to temp directory
- Scans all manifest.yaml files in airbyte-integrations/connectors/source-*/
- Extracts ClassName-formatted identifiers using regex patterns
- Filters out common false positives (HTTP methods, acronyms, etc.)
- Generates JSON index with 2,271+ unique class names from 478+ connectors
- Provides summary statistics and usage examples
- Enables discovery of connectors using specific features/components

Co-Authored-By: AJ Steers <[email protected]>
@Copilot Copilot AI review requested due to automatic review settings August 8, 2025 21:47
Copy link
Contributor

Original prompt from AJ Steers
@Devin - For the connector builder MCP project, could you create an bin/index_build.py script that does the following:
1. Create a list of all class names (cased like ClassName) in the manifest connector json schema yaml file. (In the CDK master branch. See tool that gets this already in the builder.)
2. Shallow-checkout the airgbytehq/airbyte repo to a temp directory.
3. Create a word cloud dataset of all class names (cased like ClassName) within all manifest.yaml files in the airbyte repo, specifically if they exactly match the glob: airbyte-integrations/connectors/source-{something}/manifest.yaml
4. We want a mapping of all class names possible - mapped to all connectors that use them.
An alternative approach is to skip step 1 and just get presumed class names from regex against the manifest.yaml files - this might be easier actually.

Extra results in the index that we don't end up using won't hurt us. Basically we're creating a searchable index of connectors using specific features or components.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.

Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions github-actions bot added the enhancement New feature or request label Aug 8, 2025
Copy link

github-actions bot commented Aug 8, 2025

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This Branch via MCP

To test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration:

{
  "mcpServers": {
    "connector-builder-mcp-dev": {
      "command": "uvx",
      "args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1723158344-index-build-script", "connector-builder-mcp"]
    }
  }
}

Testing This Branch via CLI

You can test this version of the MCP Server using the following CLI snippet:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1723158344-index-build-script#egg=airbyte-connector-builder-mcp' --help

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poe <command> - Runs any poe command in the uv virtual environment

📝 Edit this welcome message.

Copy link

github-actions bot commented Aug 8, 2025

PyTest Results (Fast)

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
0 files   ±0   0 ❌ ±0 

Results for commit 82ff059. ± Comparison against base commit 620ad19.

♻️ This comment has been updated with latest results.

devin-ai-integration bot and others added 2 commits August 9, 2025 00:10
- Rename bin/index_build.py to bin/build_connector_feature_index.py
- Update output path from connector_component_index.json to generated/connector-feature-index.json
- Add poe task 'build' to invoke the script
- Create generated/ directory for output files
- Verify script works correctly with new configuration

Co-Authored-By: AJ Steers <[email protected]>
Copy link

github-actions bot commented Aug 9, 2025

PyTest Results (Full)

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
0 files   ±0   0 ❌ ±0 

Results for commit 82ff059. ± Comparison against base commit 620ad19.

♻️ This comment has been updated with latest results.

- Add lowercase character requirement to exclude all-caps matches like 'A13V1IB3VIYZZH'
- Switch from JSON to CSV format with 'FeatureUsage' and 'ConnectorName' columns
- Sort output by feature name first, then by connector name
- Update output filename to connector-feature-index.csv
- Reduce unique class names from 2,271 to 2,084 with better filtering

Co-Authored-By: AJ Steers <[email protected]>
class_names = extract_class_names_from_yaml(yaml_content)

filtered_class_names = set()
false_positives = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin, just dynamically exclude any keywords that are ALLCAPS or alllower.

devin-ai-integration bot and others added 4 commits August 9, 2025 00:46
…iltering

- Remove hardcoded false_positives set with specific keywords
- Use dynamic filtering to exclude ALLCAPS and alllower strings
- Keep mixed-case class names like 'DeclarativeSource'
- Simplifies maintenance and improves filtering robustness

Addresses GitHub comment from @aaronsteers

Co-Authored-By: AJ Steers <[email protected]>
…ucture

- Move CSV output to connector_builder_mcp/resources/generated/
- Add find_connectors_by_feature() MCP tool for exact feature matching
- Tool accepts comma-separated features and returns connectors with ALL features
- Update build script to output to new location
- Remove old JSON output file
- Add csv import for proper CSV file handling

Co-Authored-By: AJ Steers <[email protected]>
- Delete generated/connector-feature-index.json (replaced by CSV)
- Delete generated/connector-feature-index.csv (moved to connector_builder_mcp/resources/generated/)

Co-Authored-By: AJ Steers <[email protected]>
pyproject.toml Outdated
[tool.deptry]
ignore = ["DEP002"]

[tool.poe.tasks]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Poe commands should be in the dedicated tasks file

- Move build task from pyproject.toml to poe_tasks.toml per @aaronsteers feedback
- Verified task still works correctly with 'poe build'
- Addresses GitHub PR comment about using dedicated tasks file

Co-Authored-By: AJ Steers <[email protected]>
@aaronsteers aaronsteers merged commit 2510724 into main Aug 9, 2025
13 checks passed
@aaronsteers aaronsteers deleted the devin/1723158344-index-build-script branch August 9, 2025 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant