-
Notifications
You must be signed in to change notification settings - Fork 2
feat: Add OpenAI evals integration for connector readiness evaluation #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add ConnectorReadinessEvaluator class for parsing and evaluating readiness reports - Integrate automatic evaluation into mark_job_success/mark_job_failed workflow hooks - Define golden examples for Rick & Morty and JSONPlaceholder APIs - Add CLI commands for manual eval management (airbyte-connector-evals) - Include comprehensive test suite for evals functionality - Focus on stream enumeration and record count validation as specified Co-Authored-By: AJ Steers <[email protected]>
Original prompt from AJ Steers
|
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Testing This Branch via MCPTo test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration: {
"mcpServers": {
"connector-builder-mcp-dev": {
"command": "uvx",
"args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1758424597-add-openai-evals-integration", "connector-builder-mcp"]
}
}
} Testing This Branch via CLIYou can test this version of the MCP Server using the following CLI snippet: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1758424597-add-openai-evals-integration#egg=airbyte-connector-builder-mcp' --help PR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
PyTest Results (Full)0 tests 0 ✅ 0s ⏱️ Results for commit 9550981. ♻️ This comment has been updated with latest results. |
- Fix ruff lint error I001 for import block organization - All evals integration tests pass locally Co-Authored-By: AJ Steers <[email protected]>
- Wrap agents import in try/catch to handle CI environments without openai-agents - Replace hardcoded stream names with generic parsing logic - Update tests to work with generic stream naming approach - Resolves pytest failures in CI Co-Authored-By: AJ Steers <[email protected]>
- Wrap AsyncOpenAI import in try/catch to handle CI without openai package - Add null check in initialize_models to prevent errors when OpenAI unavailable - Fix import sorting issues identified by ruff lint check - Resolves ModuleNotFoundError in CI pytest and lint failures Co-Authored-By: AJ Steers <[email protected]>
Co-Authored-By: AJ Steers <[email protected]>
- Wrap openai import in try/catch to handle CI environments without openai package - Update ConnectorReadinessEvaluator constructor to handle None openai gracefully - Fix import sorting issue identified by ruff - Resolves remaining ModuleNotFoundError in CI pytest failures Co-Authored-By: AJ Steers <[email protected]>
feat: Add OpenAI evals integration for connector readiness evaluation
Summary
This PR implements OpenAI evals framework integration for the
poe build-connector
task to automatically evaluate connector readiness reports against predefined golden examples. The integration focuses on stream enumeration and record count validation as requested.Key Components:
ConnectorReadinessEvaluator
: Parses markdown readiness reports and evaluates them against golden examples using weighted scoring (40% stream enumeration, 40% record counts, 20% warnings)mark_job_success
/mark_job_failed
to trigger evaluation and display summary resultsairbyte-connector-evals
command for manual evaluation managementWorkflow Integration:
When
poe build-connector
completes, the system automatically finds the generated readiness report, determines the appropriate golden example, runs evaluation, and displays a formatted summary with pass/fail criteria and component scores.Review & Testing Checklist for Human
Risk Level: 🟡 Medium - New functionality with regex parsing and workflow hooks
parse_readiness_report()
correctly extract stream names and record counts from various readiness report formats (test with actual generated reports)determine_golden_example()
logic with different API names and report content to ensure proper golden example selectionmark_job_success
/mark_job_failed
functionality, especially when OpenAI API is unavailableairbyte-connector-evals
CLI commands work correctly:list-golden
,create-test-data
,evaluate
Recommended Test Plan:
poe build-connector "Rick and Morty API"
and verify automatic evaluation appears in outputuv run python -m connector_builder_agents.src.eval_cli list-golden
airbyte-connector-evals evaluate --api-name "Rick and Morty API"
Notes
eval_results/
directory for later analysis