Skip to content

Conversation

aaronsteers
Copy link
Contributor

feat: Add OpenAI evals integration for connector readiness evaluation

Summary

This PR implements OpenAI evals framework integration for the poe build-connector task to automatically evaluate connector readiness reports against predefined golden examples. The integration focuses on stream enumeration and record count validation as requested.

Key Components:

  • ConnectorReadinessEvaluator: Parses markdown readiness reports and evaluates them against golden examples using weighted scoring (40% stream enumeration, 40% record counts, 20% warnings)
  • Automatic Workflow Integration: Hooks into mark_job_success/mark_job_failed to trigger evaluation and display summary results
  • CLI Management: airbyte-connector-evals command for manual evaluation management
  • Golden Examples: Pre-defined baselines for Rick & Morty API and JSONPlaceholder API
  • Comprehensive Testing: 12 test cases covering parsing, evaluation, and workflow integration

Workflow Integration:
When poe build-connector completes, the system automatically finds the generated readiness report, determines the appropriate golden example, runs evaluation, and displays a formatted summary with pass/fail criteria and component scores.

Review & Testing Checklist for Human

Risk Level: 🟡 Medium - New functionality with regex parsing and workflow hooks

  • Report Parsing Logic: Verify the regex patterns in parse_readiness_report() correctly extract stream names and record counts from various readiness report formats (test with actual generated reports)
  • Golden Example Matching: Test the determine_golden_example() logic with different API names and report content to ensure proper golden example selection
  • Workflow Integration Safety: Confirm that the automatic evaluation hooks don't break existing mark_job_success/mark_job_failed functionality, especially when OpenAI API is unavailable
  • CLI Commands: Test the new airbyte-connector-evals CLI commands work correctly: list-golden, create-test-data, evaluate

Recommended Test Plan:

  1. Run poe build-connector "Rick and Morty API" and verify automatic evaluation appears in output
  2. Test CLI: uv run python -m connector_builder_agents.src.eval_cli list-golden
  3. Generate a readiness report manually and test evaluation: airbyte-connector-evals evaluate --api-name "Rick and Morty API"
  4. Verify graceful handling when OpenAI API key is missing

Notes

  • The integration is designed to be non-blocking - evaluation failures won't break the main workflow
  • Only 2 golden examples are currently defined (Rick & Morty, JSONPlaceholder) - more can be added as needed
  • Evaluation results are saved to eval_results/ directory for later analysis
  • Requested by: @aaronsteers in Slack channel #ask-devin-ai
  • Devin Session: https://app.devin.ai/sessions/b5d18021c71c404f9d00d3eba16a9d74

- Add ConnectorReadinessEvaluator class for parsing and evaluating readiness reports
- Integrate automatic evaluation into mark_job_success/mark_job_failed workflow hooks
- Define golden examples for Rick & Morty and JSONPlaceholder APIs
- Add CLI commands for manual eval management (airbyte-connector-evals)
- Include comprehensive test suite for evals functionality
- Focus on stream enumeration and record count validation as specified

Co-Authored-By: AJ Steers <[email protected]>
Copy link
Contributor

Original prompt from AJ Steers
Received message in Slack channel #ask-devin-ai:

@Devin - We're wanting to use evals for the connector-builder-mcp tool, and specifically for the new `poe build-connector` task, which wraps the tools in a multi-agent workflow. Can you advise how we can plug this into an "evals" framework like the one described here? As of now, the eval _only_ needs to look at the output of the "connector readiness check" - which is a markdown file enumerating the streams and number of records per stream. We will define the correct answers that should be in this report (a successful report text for instance), and the eval should grade the result based on that golden example.
Thread URL: https://airbytehq-team.slack.com/archives/C08BHPUMEPJ/p1758217363818419?thread_ts=1758217363.818419

Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions github-actions bot added the enhancement New feature or request label Sep 21, 2025
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This Branch via MCP

To test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration:

{
  "mcpServers": {
    "connector-builder-mcp-dev": {
      "command": "uvx",
      "args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1758424597-add-openai-evals-integration", "connector-builder-mcp"]
    }
  }
}

Testing This Branch via CLI

You can test this version of the MCP Server using the following CLI snippet:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1758424597-add-openai-evals-integration#egg=airbyte-connector-builder-mcp' --help

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poe <command> - Runs any poe command in the uv virtual environment
  • /poe build-connector prompt="Star Wars API" - Run the connector builder using the Star Wars API.

📝 Edit this welcome message.

Copy link

github-actions bot commented Sep 21, 2025

PyTest Results (Fast)

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
0 files   ±0   0 ❌ ±0 

Results for commit 9550981. ± Comparison against base commit abdc1fc.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Sep 21, 2025

PyTest Results (Full)

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit 9550981.

♻️ This comment has been updated with latest results.

devin-ai-integration bot and others added 5 commits September 21, 2025 15:33
- Fix ruff lint error I001 for import block organization
- All evals integration tests pass locally

Co-Authored-By: AJ Steers <[email protected]>
- Wrap agents import in try/catch to handle CI environments without openai-agents
- Replace hardcoded stream names with generic parsing logic
- Update tests to work with generic stream naming approach
- Resolves pytest failures in CI

Co-Authored-By: AJ Steers <[email protected]>
- Wrap AsyncOpenAI import in try/catch to handle CI without openai package
- Add null check in initialize_models to prevent errors when OpenAI unavailable
- Fix import sorting issues identified by ruff lint check
- Resolves ModuleNotFoundError in CI pytest and lint failures

Co-Authored-By: AJ Steers <[email protected]>
- Wrap openai import in try/catch to handle CI environments without openai package
- Update ConnectorReadinessEvaluator constructor to handle None openai gracefully
- Fix import sorting issue identified by ruff
- Resolves remaining ModuleNotFoundError in CI pytest failures

Co-Authored-By: AJ Steers <[email protected]>
@aaronsteers aaronsteers marked this pull request as draft September 24, 2025 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🧪 Tests: Implement an evals framework to measure success and performance

1 participant