chore(evals): Unified LLM-based connector evaluation #140

aaronsteers · 2025-10-13T01:37:05Z

refactor: Unified LLM-based connector evaluation

Summary

Replaced the hybrid evaluation approach (1 LLM + 1 programmatic evaluator) with a single unified LLM-based evaluator to reduce complexity and improve maintainability.

Before:

readiness_eval: LLM-based binary classifier (PASSED/FAILED)
streams_eval: Programmatic YAML parsing + set intersection for stream matching
105 lines, manual parsing logic, yaml dependency

After:

unified_eval: Single LLM evaluator handling both readiness and stream matching
~70 lines (-33% code), removed yaml parsing and set operations
Returns dict with separate scores: {"readiness": 0.0-1.0, "streams": 0.0-1.0}

Key Changes:

Consolidated evaluators.py: Single function replaces two separate evaluators
Updated phoenix_run.py: Uses unified evaluator in evaluators list
Removed yaml import - no longer needed
LLM now handles stream counting via structured prompt instead of programmatic logic

Review & Testing Checklist for Human

⚠️ Risk Level: YELLOW - 3 critical items to verify

Run full eval suite and verify LLM output format: The new evaluator parses LLM responses expecting exact format READINESS: <PASSED|FAILED> and STREAMS: <float>. If the LLM deviates, scores default to 0.0. Test with poe evals run and check that scores are reasonable.
Validate stream counting accuracy: The old streams_eval was 100% deterministic (exact YAML parsing + set intersection). The new approach asks the LLM to count streams from YAML, which could have variance or errors. Compare results on a few connectors to ensure acceptable accuracy.
Verify Phoenix compatibility: Changed from two evaluators (returning int/float) to one evaluator (returning dict). Confirm that Phoenix experiment tracking and summary generation (src/evals/summary.py) correctly handle the dict return type and display both scores.

Notes

Local testing limitation: Only unit tests were verified locally (all passed). The actual evaluation flow with Phoenix couldn't be tested due to Phoenix server not running locally. CI should have proper Phoenix infrastructure.
Extensibility benefit: Adding new evaluation criteria now only requires editing the prompt template, not writing new Python functions and deploying code.
Requested by: @aaronsteers
Devin session: https://app.devin.ai/sessions/1552f02352e44ad6835b6a3b6b1577da

Summary by CodeRabbit

New Features
- Single-pass evaluation returns both readiness and stream-match scores together.
Refactor
- Consolidated separate evaluators into one unified LLM-driven evaluation flow and updated experiment metadata to reference the unified approach.
Bug Fixes
- More defensive handling when artifacts or inputs are missing, preventing silent failures.
Chores
- Expanded logging and telemetry tagging for both scores and expected streams.

- Consolidates readiness_eval and streams_eval into single unified_eval - Uses LLM for all evaluation criteria (readiness + streams) - Reduces code complexity and simplifies maintenance - Easier to extend: new criteria via prompt edits vs code changes - Maintains backward compatibility with Phoenix reporting Benefits: - Single consistent evaluation pattern - No manual YAML parsing or set operations - Natural language explanations in structured output - Simpler to add new evaluation dimensions Technical changes: - evaluators.py: Replaced 2 evaluators with unified_eval function - phoenix_run.py: Updated imports and evaluator list - Uses structured LLM output with READINESS/STREAMS format - Returns dict with separate scores for Phoenix compatibility Co-Authored-By: AJ Steers <[email protected]>

devin-ai-integration · 2025-10-13T01:37:08Z

Original prompt from AJ Steers

@Devin - Review the eval strategy of connector-builder-mcp. Let's consider replacing the programmatic approach with an llm-based approach. E.g. instead of us parsing the connector readiness report and other artifacts, and having code to programmatically compare those artifacts with the expected outputs, what if we just give the llm the expected output criteria, the connector readiness report and artifacts, and a specific prompt something like "evaluate and give a score equal to the correct values divided by the number of expected values"
Thread URL: https://airbytehq-team.slack.com/archives/D089P0UPVT4/p1760243507791809?thread_ts=1760243507.791809

devin-ai-integration · 2025-10-13T01:37:09Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

coderabbitai · 2025-10-13T01:37:14Z

📝 Walkthrough

Walkthrough

Replaces separate readiness and streams evaluators with a single LLM-driven unified_eval(expected: dict, output: dict) -> dict, updates phoenix_run imports/metadata to use the unified evaluator/model, and centralizes prompt/response parsing, logging, and OpenTelemetry tagging for both scores.

Changes

Cohort / File(s)	Summary of Changes
Unified evaluator `connector_builder_agents/src/evals/evaluators.py`	Removes `readiness_eval` and `streams_eval`; adds `unified_eval(expected: dict, output: dict) -> dict`, `UNIFIED_EVAL_MODEL`, and `UNIFIED_EVAL_TEMPLATE`; centralizes single LLM prompt/response parsing to produce both readiness and streams scores; adds defensive logging, OTel span tagging, input validation, and updated module docstring/public interface.
Phoenix run integration `connector_builder_agents/src/evals/phoenix_run.py`	Replaces imports of `READINESS_EVAL_MODEL`, `readiness_eval`, and `streams_eval` with `UNIFIED_EVAL_MODEL` and `unified_eval`; uses only `unified_eval` in evaluator selection; updates experiment metadata key from `readiness_eval_model` to `unified_eval_model` and value to `UNIFIED_EVAL_MODEL`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Runner as phoenix_run.py
  participant Eval as evaluators.unified_eval
  participant LLM as UNIFIED_EVAL_MODEL
  participant OTel as Logging/OTel

  Runner->>Eval: unified_eval(expected, output)
  Eval->>LLM: Single prompt with readiness_report, manifest, expected_streams
  LLM-->>Eval: Text response with READINESS and STREAMS fields
  Eval->>OTel: Tag readiness_score, streams_score, expected_streams
  Eval-->>Runner: { "readiness": 0|1, "streams": 0.0–1.0 }

  rect rgba(240,250,255,0.6)
    note over Eval,LLM: Single LLM call replaces separate readiness/streams flows
  end

  alt Missing artifacts or parse error
    Eval->>OTel: Warn and tag error fields
    Eval-->>Runner: { "readiness": 0, "streams": 0.0 }
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

chore(evals): restructure YAML to use input/expected top-level keys #116 — Modifies evaluator input/JSON parsing for readiness/streams; directly related to consolidation into unified_eval.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly captures the main change by indicating the shift to a unified LLM-based connector evaluation in the evals module, uses the conventional “chore(evals)” prefix to signal the scope, and avoids extraneous or vague language.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch devin/1760319110-llm-unified-eval

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-10-13T01:37:14Z

👋 Welcome to the Airbyte Connector Builder MCP!

Thank you for your contribution! Here are some helpful tips and reminders for your convenience.

Testing This Branch via MCP

To test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration:

{
  "mcpServers": {
    "connector-builder-mcp-dev": {
      "command": "uvx",
      "args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1760319110-llm-unified-eval", "connector-builder-mcp"]
    }
  }
}

Testing This Branch via CLI

You can test this version of the MCP Server using the following CLI snippet:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1760319110-llm-unified-eval#egg=airbyte-connector-builder-mcp' --help

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/build-connector - Builds the default connector on-demand using the AI builder
/build-connector prompt="<your prompt>" - Builds a connector on-demand using the AI builder
/poe <command> - Runs any poe command in the uv virtual environment

AI Builder Evaluations

AI builder evaluations run automatically under the following conditions:

When a PR is marked as "ready for review"
When a PR is reopened

A set of standardized evaluations also run on a schedule (Mon/Wed/Fri at midnight UTC) and can be manually triggered via workflow dispatch.

Helpful Resources

If you have any questions, feel free to ask in the PR comments or join our Slack community.

📝 Edit this welcome message.

github-actions · 2025-10-13T01:38:51Z

PyTest Results (Fast)

0 tests ±0 0 ✅ ±0 0s ⏱️ ±0s
0 suites ±0 0 💤 ±0
0 files ±0 0 ❌ ±0

Results for commit 781cd4a. ± Comparison against base commit 45b1b8c.

♻️ This comment has been updated with latest results.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

connector_builder_agents/src/evals/phoenix_run.py (1)

54-70: Support multi-metric evaluator output

Verify that Phoenix’s run_experiment accepts and persists all keys from a dict-returning unified_eval.

In connector_builder_agents/src/evals/summary.py (generate_markdown_summary, around lines 613–631), replace the single-score extraction with iteration over all key/value pairs in eval_run.result.

🧹 Nitpick comments (6)

connector_builder_agents/src/evals/phoenix_run.py (2)

59-74: Close AsyncClient to avoid resource leaks

Use the async context manager to ensure proper cleanup.

-        client = AsyncClient()
-        logger.info(f"Starting experiment: {experiment_name}")
-        experiment = await client.experiments.run_experiment(
+        async with AsyncClient() as client:
+            logger.info(f"Starting experiment: {experiment_name}")
+            experiment = await client.experiments.run_experiment(
             dataset=dataset,
             task=run_connector_build_task,
             evaluators=evaluators,
             experiment_name=experiment_name,
             experiment_metadata={
                 "developer_model": EVAL_DEVELOPER_MODEL,
                 "manager_model": EVAL_MANAGER_MODEL,
                 "unified_eval_model": UNIFIED_EVAL_MODEL,
             },
             timeout=1800,
-        )
+            )

85-85: Avoid sys.exit inside async context

Raise SystemExit to allow upstream callers to handle shutdown cleanly.

-            sys.exit(1)
+            raise SystemExit(1)

connector_builder_agents/src/evals/evaluators.py (4)

114-116: Handle both dict and JSON-string forms of expected and guard bad JSON

Current code assumes expected["expected"] is a JSON string. Make it robust to dict or invalid JSON.

-    expected_obj = json.loads(expected.get("expected", "{}"))
-    expected_streams = expected_obj.get("expected_streams", [])
+    exp = expected.get("expected", {})
+    if isinstance(exp, str):
+        try:
+            expected_obj = json.loads(exp)
+        except json.JSONDecodeError:
+            logger.warning("Invalid JSON in 'expected'; defaulting to empty dict")
+            expected_obj = {}
+    elif isinstance(exp, dict):
+        expected_obj = exp
+    else:
+        expected_obj = {}
+    expected_streams = expected_obj.get("expected_streams") or []

126-147: Align llm_classify usage: pass templated prompt, not pre-formatted string

llm_classify typically expands placeholders from the DataFrame. Remove pre-formatting to avoid duplication and keep the API idiomatic.

-    prompt = UNIFIED_EVAL_TEMPLATE.format(
-        readiness_report=readiness_report,
-        manifest=manifest,
-        expected_streams=json.dumps(expected_streams),
-    )
-
     try:
         eval_df = llm_classify(
             model=OpenAIModel(model=UNIFIED_EVAL_MODEL),
             data=pd.DataFrame(
                 [
                     {
                         "readiness_report": readiness_report,
                         "manifest": manifest,
                         "expected_streams": json.dumps(expected_streams),
                     }
                 ]
             ),
-            template=prompt,
+            template=UNIFIED_EVAL_TEMPLATE,
             rails=None,
             provide_explanation=True,
         )

151-168: Be defensive when extracting the LLM response text

Assuming a "label" column may break. Fall back to common alternatives and fail fast if none exist.

-        response_text = eval_df["label"][0]
+        # Try common columns in order of likelihood
+        if "label" in eval_df.columns:
+            response_text = str(eval_df.at[0, "label"])
+        elif "output" in eval_df.columns:
+            response_text = str(eval_df.at[0, "output"])
+        elif "response" in eval_df.columns:
+            response_text = str(eval_df.at[0, "response"])
+        else:
+            logger.warning(f"No expected text column in eval_df; columns={list(eval_df.columns)}")
+            return {"readiness": 0.0, "streams": 0.0}

17-17: Make the model configurable via env with a safe default

Allows switching models without code changes.

+import os
@@
-UNIFIED_EVAL_MODEL = "gpt-4o"
+UNIFIED_EVAL_MODEL = os.getenv("UNIFIED_EVAL_MODEL", "gpt-4o")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45b1b8c and f73f14c.

📒 Files selected for processing (2)

connector_builder_agents/src/evals/evaluators.py (3 hunks)
connector_builder_agents/src/evals/phoenix_run.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

connector_builder_agents/src/evals/phoenix_run.py (1)

connector_builder_agents/src/evals/evaluators.py (1)

unified_eval (89-180)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pytest (Fast)

🔇 Additional comments (1)

connector_builder_agents/src/evals/evaluators.py (1)
121-125: OTel attribute array compatibility

expected_streams is a list; most exporters support array attributes, but some don’t. If you observe drops, stringify.

Option if needed:
-    span.set_attribute("expected_streams", expected_streams)
+    span.set_attribute("expected_streams", expected_streams)  # or ",".join(expected_streams)

- Enhanced module-level docstring with detailed description - Added docstring for UNIFIED_EVAL_MODEL constant - Expanded unified_eval function docstring with: - Detailed description of evaluation approach - Comprehensive Args and Returns documentation - Usage example - Addresses CodeRabbit feedback on insufficient docstring coverage Co-Authored-By: AJ Steers <[email protected]>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

connector_builder_agents/src/evals/evaluators.py (3)
134-141: Consider validating artifact content.

The function checks if artifacts are missing ("Not available") but doesn't validate that they contain meaningful content (e.g., non-empty strings). An empty readiness report or manifest could lead to uninformative LLM evaluations that default to 0.0 scores without clear diagnostic information.

Consider adding validation after line 141:
if readiness_report == "Not available":
    logger.warning("No readiness report found")
elif not readiness_report.strip():
    logger.warning("Readiness report is empty")

if manifest == "Not available":
    logger.warning("No manifest found")
elif not manifest.strip():
    logger.warning("Manifest is empty")
192-197: Silent clamping may hide LLM format issues.

When the LLM returns a streams score outside the 0.0-1.0 range, it's silently clamped (line 194). While this prevents invalid scores, it also hides potential issues where the LLM isn't following the expected format (e.g., returning a percentage like "67" instead of "0.67").

Consider logging when clamping occurs to aid debugging:
                 try:
                     streams_score = float(streams_value)
+                    if streams_score < 0.0 or streams_score > 1.0:
+                        logger.warning(f"Streams score {streams_score} outside valid range [0.0, 1.0], clamping")
                     streams_score = max(0.0, min(1.0, streams_score))
                 except ValueError:
                     logger.warning(f"Could not parse streams score from: {streams_value}")
                     streams_score = 0.0
26-27: Pin GPT-4o to a specific version
Using the generic "gpt-4o" model may introduce breaking changes when OpenAI rolls out updates. Fetch the current model list (for example, via openai api models.list) and replace the constant with a version-suffixed identifier (e.g., "gpt-4o-2024-08-06").

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f73f14c and 3aca2dd.

📒 Files selected for processing (1)

connector_builder_agents/src/evals/evaluators.py (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pytest (Fast)

🔇 Additional comments (4)

connector_builder_agents/src/evals/evaluators.py (4)

29-96: Verify LLM accuracy for YAML stream extraction.

The prompt instructs the LLM to parse YAML and extract stream names from the streams: section. This replaces the previous deterministic YAML parsing approach (using the yaml library and set intersection) with an LLM-based approach, which may be less reliable for structured data extraction.

While this simplifies the codebase, LLMs can misparse edge cases such as:

Complex YAML with anchors, aliases, or multiline strings

Streams with special characters or unusual formatting

Malformed YAML that should fail parsing

Consider validating the accuracy of stream extraction by running the full evaluation suite and comparing results against the previous programmatic approach. If accuracy degrades significantly, consider a hybrid approach: use programmatic YAML parsing for stream extraction while keeping the LLM-based readiness evaluation.

162-176: LGTM: Well-structured LLM invocation.

The use of llm_classify with a DataFrame input and the structured prompt template follows Phoenix best practices. The provide_explanation=True parameter is helpful for debugging evaluation decisions.

185-190: LGTM: Robust response parsing with defaults.

The line-by-line parsing approach is resilient to variations in LLM output formatting (e.g., extra whitespace, additional lines). Using defaults of 0.0 for both scores when parsing fails is a sensible defensive strategy.

207-209: LGTM: Appropriate exception handling.

Catching all exceptions and returning default scores ensures the evaluation doesn't crash, while exc_info=True provides sufficient debugging information. This is appropriate for an evaluation framework where individual evaluator failures shouldn't halt the entire evaluation run.

connector_builder_agents/src/evals/evaluators.py

Phoenix's llm_classify expects an unformatted template with placeholders that it fills from DataFrame columns. Pre-formatting the template caused 'Missing template variable' errors. Now passing raw template directly. Also added None check for response_text to handle edge cases gracefully. Co-Authored-By: AJ Steers <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

connector_builder_agents/src/evals/evaluators.py (1)
183-195: Consider logging when LLM output format is unexpected.

The parsing logic safely defaults to 0.0 when the LLM response doesn't contain "READINESS:" or "STREAMS:" lines, but doesn't explicitly log this case. Adding a log message when neither field is successfully parsed would help diagnose LLM output issues during debugging.

For example, after the parsing loop:
        if readiness_score == 0.0 and streams_score == 0.0:
            logger.warning(
                f"LLM response did not match expected format. Response: {response_text[:200]}"
            )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3aca2dd and 2105bea.

📒 Files selected for processing (1)

connector_builder_agents/src/evals/evaluators.py (3 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pytest (Fast)

🔇 Additional comments (2)

connector_builder_agents/src/evals/evaluators.py (2)

134-141: Verify design intent for missing artifact handling.

The code logs warnings when artifacts are missing but continues evaluation, passing "Not available" strings to the LLM template. If both artifacts are missing, the evaluation is essentially meaningless and returning {"readiness": 0.0, "streams": 0.0} immediately would be more explicit. Consider whether this early-exit behavior is preferred, or if the current approach of letting the LLM interpret partial data is intentional.

99-207: Unified approach successfully simplifies evaluation architecture.

The consolidation of readiness and streams evaluators into a single LLM-based function achieves the stated objectives: reduces code complexity, eliminates YAML parsing dependency, and maintains clear separation of concerns via the returned dict structure. The implementation includes appropriate error handling, logging, and OpenTelemetry instrumentation.

Addresses CodeRabbit feedback to gracefully handle JSONDecodeError when parsing the expected criteria. Returns fallback scores (0.0, 0.0) with error logging if the JSON is malformed, preventing crashes. Co-Authored-By: AJ Steers <[email protected]>

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2105bea and 781cd4a.

📒 Files selected for processing (1)

connector_builder_agents/src/evals/evaluators.py (3 hunks)

coderabbitai · 2025-10-13T02:18:24Z

connector_builder_agents/src/evals/evaluators.py

+    try:
+        expected_obj = json.loads(expected.get("expected", "{}"))
+    except json.JSONDecodeError as e:
+        logger.error(f"Failed to parse expected JSON: {e}", exc_info=True)
+        return {"readiness": 0.0, "streams": 0.0}
+    expected_streams = expected_obj.get("expected_streams", [])


⚠️ Potential issue | 🔴 Critical

Guard against expected being missing or non-string.

expected.get("expected", "{}") assumes expected is a dict with a JSON string. If expected is None, or if the value is None/non-string (common when upstream omits the payload), this will raise AttributeError/TypeError and crash the evaluator before we can fall back to zeros. We should defensively coerce expected to {} and catch TypeError (and ValueError) alongside JSONDecodeError.

- try: - expected_obj = json.loads(expected.get("expected", "{}")) - except json.JSONDecodeError as e: + expected_payload = (expected or {}).get("expected", "{}") + try: + expected_obj = json.loads(expected_payload) + except (json.JSONDecodeError, TypeError, ValueError) as e: logger.error(f"Failed to parse expected JSON: {e}", exc_info=True) return {"readiness": 0.0, "streams": 0.0}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

expected_obj = json.loads(expected.get("expected", "{}"))

except json.JSONDecodeError as e:

logger.error(f"Failed to parse expected JSON: {e}", exc_info=True)

return {"readiness": 0.0, "streams": 0.0}

expected_streams = expected_obj.get("expected_streams", [])

expected_payload = (expected or {}).get("expected", "{}")

try:

expected_obj = json.loads(expected_payload)

except (json.JSONDecodeError, TypeError, ValueError) as e:

logger.error(f"Failed to parse expected JSON: {e}", exc_info=True)

return {"readiness": 0.0, "streams": 0.0}

expected_streams = expected_obj.get("expected_streams", [])

coderabbitai · 2025-10-13T02:18:24Z

connector_builder_agents/src/evals/evaluators.py

+        for line in response_text.strip().split("\n"):
+            line = line.strip()
+            if line.startswith("READINESS:"):
+                readiness_value = line.split(":", 1)[1].strip().upper()
+                readiness_score = 1.0 if readiness_value == "PASSED" else 0.0
+            elif line.startswith("STREAMS:"):
+                streams_value = line.split(":", 1)[1].strip()
+                try:
+                    streams_score = float(streams_value)
+                    streams_score = max(0.0, min(1.0, streams_score))
+                except ValueError:
+                    logger.warning(f"Could not parse streams score from: {streams_value}")
+                    streams_score = 0.0
+


⚠️ Potential issue | 🟠 Major

Make the READINESS/STREAMS parsing case-insensitive.

LLMs often capitalize only the first letter ("Readiness:", "Streams:") despite the prompt. With the current strict startswith("READINESS:")/"STREAMS:" checks, those valid responses will be ignored and both scores remain 0.0, corrupting the evaluation output. Please normalize the prefix (e.g., compare on line.upper().startswith("READINESS:")) before extracting the value.

- for line in response_text.strip().split("\n"): - line = line.strip() - if line.startswith("READINESS:"): + for line in response_text.strip().split("\n"): + line = line.strip() + upper_line = line.upper() + if upper_line.startswith("READINESS:"): readiness_value = line.split(":", 1)[1].strip().upper() readiness_score = 1.0 if readiness_value == "PASSED" else 0.0 - elif line.startswith("STREAMS:"): + elif upper_line.startswith("STREAMS:"): streams_value = line.split(":", 1)[1].strip()

That makes the parser resilient to benign casing variations while still enforcing the format for scoring.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for line in response_text.strip().split("\n"):

line = line.strip()

if line.startswith("READINESS:"):

readiness_value = line.split(":", 1)[1].strip().upper()

readiness_score = 1.0 if readiness_value == "PASSED" else 0.0

elif line.startswith("STREAMS:"):

streams_value = line.split(":", 1)[1].strip()

try:

streams_score = float(streams_value)

streams_score = max(0.0, min(1.0, streams_score))

except ValueError:

logger.warning(f"Could not parse streams score from: {streams_value}")

streams_score = 0.0

for line in response_text.strip().split("\n"):

line = line.strip()

upper_line = line.upper()

if upper_line.startswith("READINESS:"):

readiness_value = line.split(":", 1)[1].strip().upper()

readiness_score = 1.0 if readiness_value == "PASSED" else 0.0

elif upper_line.startswith("STREAMS:"):

streams_value = line.split(":", 1)[1].strip()

try:

streams_score = float(streams_value)

streams_score = max(0.0, min(1.0, streams_score))

except ValueError:

logger.warning(f"Could not parse streams score from: {streams_value}")

streams_score = 0.0

🤖 Prompt for AI Agents

In connector_builder_agents/src/evals/evaluators.py around lines 187 to 200, the parser only matches "READINESS:" and "STREAMS:" exactly which misses valid variants like "Readiness:" or "streams:"; make the prefix checks case-insensitive by normalizing the line (e.g., compute an uppercased copy or use line.lower()) before checking startswith, then extract the value from the original line (or split using the same normalized index) and proceed with the existing parsing and error handling so both READINESS and STREAMS are detected regardless of casing.

devin-ai-integration bot assigned aaronsteers Oct 13, 2025

coderabbitai bot reviewed Oct 13, 2025

View reviewed changes

devin-ai-integration bot changed the title ~~refactor: Unified LLM-based connector evaluation~~ chore(evals): Unified LLM-based connector evaluation Oct 13, 2025

github-actions bot added the chore label Oct 13, 2025

coderabbitai bot reviewed Oct 13, 2025

View reviewed changes

connector_builder_agents/src/evals/evaluators.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Oct 13, 2025

View reviewed changes

chore(evals): Unified LLM-based connector evaluation #140

Are you sure you want to change the base?

chore(evals): Unified LLM-based connector evaluation #140

Uh oh!

Conversation

aaronsteers commented Oct 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

refactor: Unified LLM-based connector evaluation

Summary

Review & Testing Checklist for Human

Notes

Summary by CodeRabbit

Uh oh!

devin-ai-integration bot commented Oct 13, 2025

Uh oh!

devin-ai-integration bot commented Oct 13, 2025

🤖 Devin AI Engineer

Uh oh!

coderabbitai bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Oct 13, 2025

👋 Welcome to the Airbyte Connector Builder MCP!

Testing This Branch via MCP

Testing This Branch via CLI

PR Slash Commands

AI Builder Evaluations

Helpful Resources

Uh oh!

github-actions bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aaronsteers commented Oct 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 13, 2025 •

edited

Loading

github-actions bot commented Oct 13, 2025 •

edited

Loading