feat: add evals using arize phoenix #91

pedroslopez · 2025-09-23T03:43:48Z

Evaluation framework and experiment automation:

Added a new phoenix_run.py module to automate connector builder evaluations using the Phoenix framework, including experiment orchestration, dataset management, and evaluator integration.
Added dataset.py to load and manage evaluation datasets from a YAML config, including Phoenix dataset creation and hashing for versioning.
Introduced evaluators.py with LLM-based readiness and stream coverage evaluators
Added task.py to define the connector build task for experiments, including artifact collection and result formatting.
Added helpers.py for reading artifacts from the workspace directory.
Created a YAML dataset listing connectors and expected streams for evaluation in data/connectors.yaml.

Agent build pipeline refactor:

Refactored run_connector_build and run_manager_developer_build in run.py to return lists of RunResult objects instead of None, enabling collection and evaluation of build results. Now handles errors gracefully and returns partial results if interrupted. [1] [2] [3] [4] [5] [6] [7]

Dependency and CLI updates:

Added required dependencies for Phoenix evaluation, pandas, YAML, and OpenInference instrumentation in pyproject.toml.
Added a new CLI task run-evals in poe_tasks.toml to run the evaluation workflow.

github-actions · 2025-09-23T03:44:01Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This Branch via MCP

To test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration:

{
  "mcpServers": {
    "connector-builder-mcp-dev": {
      "command": "uvx",
      "args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@pedro/evals", "connector-builder-mcp"]
    }
  }
}

Testing This Branch via CLI

You can test this version of the MCP Server using the following CLI snippet:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@pedro/evals#egg=airbyte-connector-builder-mcp' --help

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poe <command> - Runs any poe command in the uv virtual environment
/poe build-connector prompt="Star Wars API" - Run the connector builder using the Star Wars API.

📝 Edit this welcome message.

github-actions · 2025-09-23T03:45:38Z

PyTest Results (Full)

0 tests 0 ✅ 0s ⏱️
0 suites 0 💤
0 files 0 ❌

Results for commit 8866540.

♻️ This comment has been updated with latest results.

github-actions · 2025-09-23T03:45:46Z

PyTest Results (Fast)

0 tests ±0 0 ✅ ±0 0s ⏱️ ±0s
0 suites ±0 0 💤 ±0
0 files ±0 0 ❌ ±0

Results for commit 8866540. ± Comparison against base commit 6624d45.

♻️ This comment has been updated with latest results.

# Conflicts: # connector_builder_agents/src/run.py

connector_builder_agents/src/evals/data/connectors.yaml

aaronsteers

Everything here looks good to me - especially for a first iteration, I think this is strong. 💪

Approving for merge when ready - or re-request my review if you make substantive changes that need another pair of eyes. Thanks!

aaronsteers · 2025-09-26T20:09:10Z

connector_builder_agents/src/evals/__init__.py

Not a blocker for the PR, but we can use this space for eval module-level docs. Specifically, the docstring we add here will get rendered in the autogenerated pdoc docs.

# Conflicts: # connector_builder_agents/src/run.py

pedroslopez added 3 commits September 23, 2025 16:06

chore: refactor to remove global state

8c5f444

format/lint

4c2f11e

wip

83fa599

pedroslopez force-pushed the pedro/evals branch from 31ab486 to 83fa599 Compare September 23, 2025 20:10

pedroslopez changed the base branch from main to pedro/no-globals September 23, 2025 20:10

pedroslopez added 5 commits September 23, 2025 16:33

update to work with multiple

95e1f18

wip, phoenix

af0f496

expected streams eval

ab94fb4

cleanup

1fbb187

add rick/morty, proper parsing of output keys

60ce721

pedroslopez changed the title ~~wip~~ feat: add evals using arize phoenix Sep 26, 2025

add metadata, trim output

a1292be

github-actions bot added the enhancement New feature or request label Sep 26, 2025

pedroslopez added 6 commits September 26, 2025 01:11

set eval model metadata correctly

ac5f477

logger

5b35082

Merge branch 'main' into pedro/no-globals

295aa4d

# Conflicts: # connector_builder_agents/src/run.py

Merge branch 'pedro/no-globals' into pedro/evals

81e91bc

cleanup

0dc7a20

boop

8b06985

aaronsteers reviewed Sep 26, 2025

View reviewed changes

connector_builder_agents/src/evals/data/connectors.yaml Show resolved Hide resolved

aaronsteers approved these changes Sep 26, 2025

View reviewed changes

pedroslopez marked this pull request as ready for review September 26, 2025 22:24

Base automatically changed from pedro/no-globals to main September 30, 2025 03:31

Merge branch 'main' into pedro/evals

8866540

# Conflicts: # connector_builder_agents/src/run.py

pedroslopez merged commit 0dacc73 into main Sep 30, 2025
15 checks passed

pedroslopez deleted the pedro/evals branch September 30, 2025 16:02

aaronsteers linked an issue Sep 30, 2025 that may be closed by this pull request

🧪 Tests: Implement an evals framework to measure success and performance #68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add evals using arize phoenix #91

feat: add evals using arize phoenix #91

Uh oh!

pedroslopez commented Sep 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

github-actions bot commented Sep 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

aaronsteers left a comment

Uh oh!

aaronsteers Sep 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add evals using arize phoenix #91

feat: add evals using arize phoenix #91

Uh oh!

Conversation

pedroslopez commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 23, 2025

👋 Greetings, Airbyte Team Member!

Testing This Branch via MCP

Testing This Branch via CLI

PR Slash Commands

Uh oh!

github-actions bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Full)

Uh oh!

github-actions bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

Uh oh!

aaronsteers left a comment

Choose a reason for hiding this comment

Uh oh!

aaronsteers Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pedroslopez commented Sep 23, 2025 •

edited

Loading

github-actions bot commented Sep 23, 2025 •

edited

Loading

github-actions bot commented Sep 23, 2025 •

edited

Loading