feat: Add OpenAI evals integration for connector readiness evaluation #87

aaronsteers · 2025-09-21T15:31:34Z

feat: Add OpenAI evals integration for connector readiness evaluation

Summary

This PR implements OpenAI evals framework integration for the poe build-connector task to automatically evaluate connector readiness reports against predefined golden examples. The integration focuses on stream enumeration and record count validation as requested.

Key Components:

ConnectorReadinessEvaluator: Parses markdown readiness reports and evaluates them against golden examples using weighted scoring (40% stream enumeration, 40% record counts, 20% warnings)
Automatic Workflow Integration: Hooks into mark_job_success/mark_job_failed to trigger evaluation and display summary results
CLI Management: airbyte-connector-evals command for manual evaluation management
Golden Examples: Pre-defined baselines for Rick & Morty API and JSONPlaceholder API
Comprehensive Testing: 12 test cases covering parsing, evaluation, and workflow integration

Workflow Integration:
When poe build-connector completes, the system automatically finds the generated readiness report, determines the appropriate golden example, runs evaluation, and displays a formatted summary with pass/fail criteria and component scores.

Review & Testing Checklist for Human

Risk Level: 🟡 Medium - New functionality with regex parsing and workflow hooks

Report Parsing Logic: Verify the regex patterns in parse_readiness_report() correctly extract stream names and record counts from various readiness report formats (test with actual generated reports)
Golden Example Matching: Test the determine_golden_example() logic with different API names and report content to ensure proper golden example selection
Workflow Integration Safety: Confirm that the automatic evaluation hooks don't break existing mark_job_success/mark_job_failed functionality, especially when OpenAI API is unavailable
CLI Commands: Test the new airbyte-connector-evals CLI commands work correctly: list-golden, create-test-data, evaluate

Recommended Test Plan:

Run poe build-connector "Rick and Morty API" and verify automatic evaluation appears in output
Test CLI: uv run python -m connector_builder_agents.src.eval_cli list-golden
Generate a readiness report manually and test evaluation: airbyte-connector-evals evaluate --api-name "Rick and Morty API"
Verify graceful handling when OpenAI API key is missing

Notes

The integration is designed to be non-blocking - evaluation failures won't break the main workflow
Only 2 golden examples are currently defined (Rick & Morty, JSONPlaceholder) - more can be added as needed
Evaluation results are saved to eval_results/ directory for later analysis
Requested by: @aaronsteers in Slack channel #ask-devin-ai
Devin Session: https://app.devin.ai/sessions/b5d18021c71c404f9d00d3eba16a9d74

- Add ConnectorReadinessEvaluator class for parsing and evaluating readiness reports - Integrate automatic evaluation into mark_job_success/mark_job_failed workflow hooks - Define golden examples for Rick & Morty and JSONPlaceholder APIs - Add CLI commands for manual eval management (airbyte-connector-evals) - Include comprehensive test suite for evals functionality - Focus on stream enumeration and record count validation as specified Co-Authored-By: AJ Steers <[email protected]>

devin-ai-integration · 2025-09-21T15:31:38Z

Original prompt from AJ Steers

Received message in Slack channel #ask-devin-ai:

@Devin - We're wanting to use evals for the connector-builder-mcp tool, and specifically for the new `poe build-connector` task, which wraps the tools in a multi-agent workflow. Can you advise how we can plug this into an "evals" framework like the one described here? As of now, the eval _only_ needs to look at the output of the "connector readiness check" - which is a markdown file enumerating the streams and number of records per stream. We will define the correct answers that should be in this report (a successful report text for instance), and the eval should grade the result based on that golden example.
Thread URL: https://airbytehq-team.slack.com/archives/C08BHPUMEPJ/p1758217363818419?thread_ts=1758217363.818419

devin-ai-integration · 2025-09-21T15:31:39Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

github-actions · 2025-09-21T15:31:49Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This Branch via MCP

To test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration:

{
  "mcpServers": {
    "connector-builder-mcp-dev": {
      "command": "uvx",
      "args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1758424597-add-openai-evals-integration", "connector-builder-mcp"]
    }
  }
}

Testing This Branch via CLI

You can test this version of the MCP Server using the following CLI snippet:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1758424597-add-openai-evals-integration#egg=airbyte-connector-builder-mcp' --help

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poe <command> - Runs any poe command in the uv virtual environment
/poe build-connector prompt="Star Wars API" - Run the connector builder using the Star Wars API.

📝 Edit this welcome message.

github-actions · 2025-09-21T15:32:09Z

PyTest Results (Fast)

0 tests ±0 0 ✅ ±0 0s ⏱️ ±0s
0 suites ±0 0 💤 ±0
0 files ±0 0 ❌ ±0

Results for commit 9550981. ± Comparison against base commit abdc1fc.

♻️ This comment has been updated with latest results.

github-actions · 2025-09-21T15:32:15Z

PyTest Results (Full)

0 tests 0 ✅ 0s ⏱️
0 suites 0 💤
0 files 0 ❌

Results for commit 9550981.

♻️ This comment has been updated with latest results.

- Fix ruff lint error I001 for import block organization - All evals integration tests pass locally Co-Authored-By: AJ Steers <[email protected]>

- Wrap agents import in try/catch to handle CI environments without openai-agents - Replace hardcoded stream names with generic parsing logic - Update tests to work with generic stream naming approach - Resolves pytest failures in CI Co-Authored-By: AJ Steers <[email protected]>

- Wrap AsyncOpenAI import in try/catch to handle CI without openai package - Add null check in initialize_models to prevent errors when OpenAI unavailable - Fix import sorting issues identified by ruff lint check - Resolves ModuleNotFoundError in CI pytest and lint failures Co-Authored-By: AJ Steers <[email protected]>

Co-Authored-By: AJ Steers <[email protected]>

- Wrap openai import in try/catch to handle CI environments without openai package - Update ConnectorReadinessEvaluator constructor to handle None openai gracefully - Fix import sorting issue identified by ruff - Resolves remaining ModuleNotFoundError in CI pytest failures Co-Authored-By: AJ Steers <[email protected]>

devin-ai-integration bot assigned aaronsteers Sep 21, 2025

github-actions bot added the enhancement New feature or request label Sep 21, 2025

devin-ai-integration bot and others added 5 commits September 21, 2025 15:33

fix: Resolve import sorting issue in evals integration tests

4b63a41

- Fix ruff lint error I001 for import block organization - All evals integration tests pass locally Co-Authored-By: AJ Steers <[email protected]>

chore: Update uv.lock after dependency resolution

6e4536e

Co-Authored-By: AJ Steers <[email protected]>

aaronsteers linked an issue Sep 23, 2025 that may be closed by this pull request

🧪 Tests: Implement an evals framework to measure success and performance #68

Closed

aaronsteers marked this pull request as draft September 24, 2025 03:43

aaronsteers closed this Sep 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add OpenAI evals integration for connector readiness evaluation #87

feat: Add OpenAI evals integration for connector readiness evaluation #87

Uh oh!

aaronsteers commented Sep 21, 2025

Uh oh!

devin-ai-integration bot commented Sep 21, 2025

Uh oh!

devin-ai-integration bot commented Sep 21, 2025

Uh oh!

github-actions bot commented Sep 21, 2025

Uh oh!

github-actions bot commented Sep 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: Add OpenAI evals integration for connector readiness evaluation #87

feat: Add OpenAI evals integration for connector readiness evaluation #87

Uh oh!

Conversation

aaronsteers commented Sep 21, 2025

feat: Add OpenAI evals integration for connector readiness evaluation

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration bot commented Sep 21, 2025

Uh oh!

devin-ai-integration bot commented Sep 21, 2025

🤖 Devin AI Engineer

Uh oh!

github-actions bot commented Sep 21, 2025

👋 Greetings, Airbyte Team Member!

Testing This Branch via MCP

Testing This Branch via CLI

PR Slash Commands

Uh oh!

github-actions bot commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

github-actions bot commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Full)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Sep 21, 2025 •

edited

Loading

github-actions bot commented Sep 21, 2025 •

edited

Loading