feat: find prior experiment run across datasets #117

pedroslopez · 2025-10-08T18:47:22Z

This pull request enhances the evaluation CLI and dataset management by allowing users to specify a custom dataset prefix, improving experiment lookup logic, and making the code more robust when searching for prior experiments. The main focus is on increasing flexibility for dataset naming and ensuring reliable retrieval of previous experiment results, even when datasets change due to test set updates.

Summary by CodeRabbit

New Features
- Added a --dataset-prefix CLI option (default: "builder-connectors") used when creating evaluation datasets.
- Dataset names now include the chosen prefix and indicate when connector filtering is applied.
- Prior experiments can be discovered across datasets that share the same prefix when none exist locally.
Bug Fixes
- More resilient prior-experiment lookup with per-dataset error handling to skip failed fetches.
Documentation
- CLI help updated to explain the dataset-prefix option.

coderabbitai · 2025-10-08T18:47:31Z

📝 Walkthrough

Walkthrough

Adds a dataset_prefix CLI option forwarded into the eval runtime and Phoenix dataset creation; renames connector filtering parameter to filtered_connectors; dataset naming now includes the prefix; prior-experiment lookup falls back to datasets sharing the same prefix when none exist on the current dataset.

Changes

Cohort / File(s)	Summary of Changes
CLI: dataset prefix arg `connector_builder_agents/src/evals/cli.py`	Adds `--dataset-prefix` (`dataset_prefix`, default `"builder-connectors"`); logs the prefix and forwards it to the evaluation runner (`run_evals_main`).
Dataset creation & filtering `connector_builder_agents/src/evals/dataset.py`	Renames `connectors` → `filtered_connectors`; `get_dataset_with_hash(filtered_connectors: ...)` updated; `get_or_create_phoenix_dataset(filtered_connectors: ... , *, dataset_prefix: str)` builds dataset name as `"filtered-{dataset_prefix}-{hash}"` when filtered, or `"{dataset_prefix}-{hash}"` otherwise; attaches `is_filtered`/`filtered_connectors` metadata and updates docstrings/logs.
Entrypoint propagation `connector_builder_agents/src/evals/phoenix_run.py`	`async def main(connectors: list[str]
Prior-experiment cross-dataset fallback `connector_builder_agents/src/evals/summary.py`	`find_prior_experiment` extended to, when no priors on current dataset, derive a dataset prefix, list datasets with that prefix (excluding filtered datasets), aggregate experiments across matches with per-dataset error handling, and resiliently fetch full prior experiments (try/except per prior).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant CLI as evals/cli.py
  participant PR as phoenix_run.main
  participant DS as dataset.get_or_create_phoenix_dataset
  participant PX as Phoenix API

  U->>CLI: run --dataset-prefix=<prefix> [--connectors ...]
  CLI->>PR: main(connectors, dataset_prefix=<prefix>)
  PR->>DS: get_or_create_phoenix_dataset(filtered_connectors=connectors, dataset_prefix=<prefix>)
  DS->>PX: find/create dataset "<prefix>-<hash>" or "filtered-<prefix>-<hash>"
  PX-->>DS: Dataset
  DS-->>PR: Dataset
  PR-->>U: Run evaluations using dataset

sequenceDiagram
  autonumber
  participant SUM as summary.find_prior_experiment
  participant PX as Phoenix API

  SUM->>PX: Get current dataset + experiments
  alt Experiments with eval runs exist
    PX-->>SUM: Experiments
  else No experiments found
    SUM->>SUM: Derive prefix from dataset name (before last "-")
    SUM->>PX: List datasets starting with prefix
    PX-->>SUM: Matching datasets
    loop For each matching dataset
      SUM->>PX: Get experiments (try/except)
      PX-->>SUM: Experiments or error
    end
    SUM->>SUM: Aggregate experiments with eval runs
  end
  SUM-->>SUM: Return selected prior experiment or None

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat: filter evals by connector #114 — Similar edits to CLI, phoenix_run, and dataset filtering/naming; overlaps on connector-filtering and dataset naming logic.

Suggested reviewers

aaronsteers

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title clearly and concisely describes the core enhancement of enabling prior experiment lookup across datasets, directly reflecting the main summary.py changes without extraneous detail, making it specific and understandable at a glance.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pedro/dataset-workaround

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 00e97b0 and 7873fa5.

📒 Files selected for processing (2)

connector_builder_agents/src/evals/dataset.py (3 hunks)
connector_builder_agents/src/evals/summary.py (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Test Connector Build (JSONPlaceholder)
GitHub Check: Test Connector Build (PokemonTGG)

🔇 Additional comments (5)

connector_builder_agents/src/evals/dataset.py (2)

18-23: LGTM: Clear parameter rename.

The rename from connectors to filtered_connectors improves clarity and accurately describes the parameter's purpose.

63-78: LGTM: Well-structured dataset naming and prefix support.

The addition of dataset_prefix as a keyword-only parameter and the conditional prefixing logic for filtered datasets (filtered-{dataset_prefix}-{dataset_hash} vs {dataset_prefix}-{dataset_hash}) provides clear differentiation between dataset types and enables the cross-dataset prior discovery feature.

connector_builder_agents/src/evals/summary.py (3)

74-93: LGTM: Cross-dataset discovery with appropriate filtered dataset handling.

The fallback to cross-dataset search when no prior experiments exist on the current dataset addresses the issue of dataset recreation when test sets change. The logic correctly skips filtered datasets (line 88-89) since they target specific connector subsets and cross-dataset comparison would be inappropriate. The prefix extraction using rsplit("-", 1)[0] works correctly since it splits from the right, preserving multi-part prefixes (e.g., "my-cool-prefix-abc123" → "my-cool-prefix").

100-125: LGTM: Robust dataset filtering and experiment aggregation.

The filtering logic correctly identifies datasets sharing the same prefix while excluding:

The current dataset (ds.get("id") != dataset_id)

Filtered datasets (implicitly, since they start with "filtered-" not {dataset_prefix}-)

The per-dataset error handling (lines 115-120) ensures the aggregation continues even if some datasets are inaccessible, with appropriate warning logs.

139-148: LGTM: Resilient prior experiment retrieval.

Wrapping individual prior experiment fetches in try-except blocks prevents a single fetch failure from blocking the entire prior discovery process. The warning logs provide visibility while allowing graceful degradation.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-10-08T18:47:37Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Testing This Branch via MCP

To test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration:

{
  "mcpServers": {
    "connector-builder-mcp-dev": {
      "command": "uvx",
      "args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@pedro/dataset-workaround", "connector-builder-mcp"]
    }
  }
}

Testing This Branch via CLI

You can test this version of the MCP Server using the following CLI snippet:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@pedro/dataset-workaround#egg=airbyte-connector-builder-mcp' --help

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poe <command> - Runs any poe command in the uv virtual environment
/poe build-connector prompt="Star Wars API" - Run the connector builder using the Star Wars API.

📝 Edit this welcome message.

github-actions · 2025-10-08T18:49:14Z

PyTest Results (Fast)

0 tests ±0 0 ✅ ±0 0s ⏱️ ±0s
0 suites ±0 0 💤 ±0
0 files ±0 0 ❌ ±0

Results for commit 7873fa5. ± Comparison against base commit 56e7f4a.

♻️ This comment has been updated with latest results.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 56e7f4a and b7fef0d.

📒 Files selected for processing (4)

connector_builder_agents/src/evals/cli.py (3 hunks)
connector_builder_agents/src/evals/dataset.py (1 hunks)
connector_builder_agents/src/evals/phoenix_run.py (1 hunks)
connector_builder_agents/src/evals/summary.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

connector_builder_agents/src/evals/phoenix_run.py (2)

connector_builder_agents/src/evals/cli.py (1)

main (72-116)

connector_builder_agents/src/evals/dataset.py (1)

get_or_create_phoenix_dataset (63-94)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)

GitHub Check: Build PokemonTGG Connector
GitHub Check: Build Hubspot Connector
GitHub Check: Build JSONPlaceholder Connector
GitHub Check: Test Connector Build (PokemonTGG)
GitHub Check: Test Connector Build (JSONPlaceholder)
GitHub Check: Test Connector Build (PokemonTGG)
GitHub Check: Test Connector Build (JSONPlaceholder)

🔇 Additional comments (5)

connector_builder_agents/src/evals/dataset.py (1)

63-65: LGTM! Clean parameterization of dataset prefix.

The function signature change properly introduces dataset_prefix as a keyword-only parameter, and the implementation consistently replaces the hardcoded prefix with the dynamic parameter. The docstring is updated appropriately.

Also applies to: 70-70, 77-77, 79-79

connector_builder_agents/src/evals/cli.py (1)

8-8: LGTM! Proper CLI argument integration.

The new --dataset-prefix argument is well-integrated with appropriate help text, sensible default, and proper propagation to the evaluation runner. The added logging provides visibility into the configuration being used.

Also applies to: 39-39, 44-45, 96-101

connector_builder_agents/src/evals/phoenix_run.py (1)

41-41: LGTM! Parameter properly threaded through.

The dataset_prefix parameter is correctly added to the function signature and passed to the dataset creation function, maintaining the keyword-only parameter pattern.

Also applies to: 48-48

connector_builder_agents/src/evals/summary.py (2)

110-116: Good error handling for robustness.

The try-except blocks around individual dataset and experiment fetches ensure the cross-dataset search continues even when some datasets or experiments are inaccessible. The warning logs provide visibility into failures without breaking the overall flow.

Also applies to: 134-143

124-126: Clear early return improves readability.

The explicit check and early return when no prior experiments are found (after cross-dataset search) makes the control flow easier to follow.

connector_builder_agents/src/evals/summary.py

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b7fef0d and fa8fe32.

📒 Files selected for processing (3)

connector_builder_agents/src/evals/dataset.py (4 hunks)
connector_builder_agents/src/evals/phoenix_run.py (1 hunks)
connector_builder_agents/src/evals/summary.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

connector_builder_agents/src/evals/phoenix_run.py (2)

connector_builder_agents/src/evals/cli.py (1)

main (72-116)

connector_builder_agents/src/evals/dataset.py (1)

get_or_create_phoenix_dataset (63-98)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Test Connector Build (JSONPlaceholder)
GitHub Check: Pytest (Fast)

connector_builder_agents/src/evals/phoenix_run.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

connector_builder_agents/src/evals/dataset.py (1)
63-65: Consider adding input validation for dataset_prefix.

The dataset_prefix parameter is well-designed as keyword-only, but there's no validation to ensure it:

Doesn't contain problematic characters (e.g., spaces, special chars)

Doesn't end with a dash (which could create dataset names like prefix--hash)

Isn't empty

Invalid prefixes could cause issues with the dataset matching logic in summary.py (lines 100-101) that assumes the format {prefix}-{hash}.

Apply this diff to add validation:
 def get_or_create_phoenix_dataset(
     filtered_connectors: list[str] | None = None, *, dataset_prefix: str
 ) -> Dataset:
     """Get or create a Phoenix dataset for the evals config.
 
     Args:
         filtered_connectors: Optional list of connector names to filter by.
         dataset_prefix: Prefix for the dataset name.
     """
+    # Validate dataset_prefix
+    if not dataset_prefix or not dataset_prefix.strip():
+        raise ValueError("dataset_prefix cannot be empty")
+    if dataset_prefix.endswith("-"):
+        raise ValueError("dataset_prefix cannot end with a dash")
+    if not dataset_prefix.replace("-", "").replace("_", "").isalnum():
+        raise ValueError("dataset_prefix can only contain alphanumeric characters, dashes, and underscores")
+    
     dataframe, dataset_hash = get_dataset_with_hash(filtered_connectors=filtered_connectors)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fa79ae1 and 00e97b0.

📒 Files selected for processing (2)

connector_builder_agents/src/evals/dataset.py (3 hunks)
connector_builder_agents/src/evals/summary.py (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Pytest (Fast)
GitHub Check: Test Connector Build (PokemonTGG)
GitHub Check: Test Connector Build (JSONPlaceholder)

🔇 Additional comments (9)

connector_builder_agents/src/evals/summary.py (5)

42-43: LGTM!

The docstring accurately describes the new fallback behavior for cross-dataset search.

93-102: Prefix matching logic appears sound.

The filter ds.get("name", "").startswith(dataset_prefix + "-") correctly matches datasets with the extracted prefix followed by a dash, and excludes the current dataset.

108-122: LGTM! Good defensive programming.

The error handling ensures the search continues even if individual datasets cannot be accessed, and the logging provides visibility into failures.

136-145: LGTM! Improved resilience.

Wrapping each get_experiment call in a try/except block ensures that failures to fetch individual experiments don't prevent finding other prior experiments.

87-91: Ignore incorrect prefix parsing concern
The Phoenix dataset name is always constructed as {dataset_prefix}-{hash} (filtering only alters the hash), so current_dataset_name.rsplit("-", 1)[0] reliably recovers the original dataset_prefix.

Likely an incorrect or invalid review comment.

connector_builder_agents/src/evals/dataset.py (4)

18-18: LGTM! More descriptive parameter name.

Renaming connectors to filtered_connectors better communicates the parameter's purpose.

22-22: LGTM! Consistent parameter usage.

All references correctly updated to use filtered_connectors throughout the function, including in conditionals, dataframe operations, logging, and error messages.

Also applies to: 35-46

72-72: LGTM! Consistent with parameter rename.

The call correctly passes filtered_connectors using keyword argument syntax.

74-74: LGTM! Enables customizable dataset naming.

The dataset name construction now uses the provided dataset_prefix instead of the hardcoded "builder-connectors", enabling the flexibility described in the PR objectives.

pedroslopez added 3 commits October 8, 2025 12:02

fix

bb5d687

fmt

e8d7533

find prev experiment across datasets

c220fad

fix prefix extraction

b7fef0d

pedroslopez changed the title ~~Pedro/dataset workaround~~ feat: find prior experiment run across datasets Oct 8, 2025

pedroslopez marked this pull request as ready for review October 8, 2025 23:41

coderabbitai bot reviewed Oct 8, 2025

View reviewed changes

connector_builder_agents/src/evals/summary.py Outdated Show resolved Hide resolved

fix filter matching, add metadata

fa8fe32

github-actions bot added the enhancement New feature or request label Oct 9, 2025

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

connector_builder_agents/src/evals/phoenix_run.py Show resolved Hide resolved

pedroslopez added 2 commits October 8, 2025 20:05

rm old calling method

fa79ae1

remove all the fancy metadata stuff - it doesnt work

00e97b0

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

simpler short-circuit

7873fa5

aaronsteers approved these changes Oct 10, 2025

View reviewed changes

pedroslopez merged commit e5f25e6 into main Oct 10, 2025
16 checks passed

pedroslopez deleted the pedro/dataset-workaround branch October 10, 2025 16:18

This was referenced Oct 10, 2025

feat: run evals weekly with specific dataset #118

Merged

chore(evals): restructure YAML to use input/expected top-level keys #116

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: find prior experiment run across datasets #117

feat: find prior experiment run across datasets #117

pedroslopez commented Oct 8, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 8, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

github-actions bot commented Oct 8, 2025

Uh oh!

github-actions bot commented Oct 8, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: find prior experiment run across datasets #117

feat: find prior experiment run across datasets #117

Conversation

pedroslopez commented Oct 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Oct 8, 2025

👋 Greetings, Airbyte Team Member!

Testing This Branch via MCP

Testing This Branch via CLI

PR Slash Commands

Uh oh!

github-actions bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pedroslopez commented Oct 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 8, 2025 •

edited

Loading

github-actions bot commented Oct 8, 2025 •

edited

Loading