Skip to content

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Oct 13, 2025

refactor: Unified LLM-based connector evaluation

Summary

Replaced the hybrid evaluation approach (1 LLM + 1 programmatic evaluator) with a single unified LLM-based evaluator to reduce complexity and improve maintainability.

Before:

  • readiness_eval: LLM-based binary classifier (PASSED/FAILED)
  • streams_eval: Programmatic YAML parsing + set intersection for stream matching
  • 105 lines, manual parsing logic, yaml dependency

After:

  • unified_eval: Single LLM evaluator handling both readiness and stream matching
  • ~70 lines (-33% code), removed yaml parsing and set operations
  • Returns dict with separate scores: {"readiness": 0.0-1.0, "streams": 0.0-1.0}

Key Changes:

  1. Consolidated evaluators.py: Single function replaces two separate evaluators
  2. Updated phoenix_run.py: Uses unified evaluator in evaluators list
  3. Removed yaml import - no longer needed
  4. LLM now handles stream counting via structured prompt instead of programmatic logic

Review & Testing Checklist for Human

⚠️ Risk Level: YELLOW - 3 critical items to verify

  • Run full eval suite and verify LLM output format: The new evaluator parses LLM responses expecting exact format READINESS: <PASSED|FAILED> and STREAMS: <float>. If the LLM deviates, scores default to 0.0. Test with poe evals run and check that scores are reasonable.

  • Validate stream counting accuracy: The old streams_eval was 100% deterministic (exact YAML parsing + set intersection). The new approach asks the LLM to count streams from YAML, which could have variance or errors. Compare results on a few connectors to ensure acceptable accuracy.

  • Verify Phoenix compatibility: Changed from two evaluators (returning int/float) to one evaluator (returning dict). Confirm that Phoenix experiment tracking and summary generation (src/evals/summary.py) correctly handle the dict return type and display both scores.

Notes

  • Local testing limitation: Only unit tests were verified locally (all passed). The actual evaluation flow with Phoenix couldn't be tested due to Phoenix server not running locally. CI should have proper Phoenix infrastructure.

  • Extensibility benefit: Adding new evaluation criteria now only requires editing the prompt template, not writing new Python functions and deploying code.

  • Requested by: @aaronsteers

  • Devin session: https://app.devin.ai/sessions/1552f02352e44ad6835b6a3b6b1577da

Summary by CodeRabbit

  • New Features
    • Single-pass evaluation returns both readiness and stream-match scores together.
  • Refactor
    • Consolidated separate evaluators into one unified LLM-driven evaluation flow and updated experiment metadata to reference the unified approach.
  • Bug Fixes
    • More defensive handling when artifacts or inputs are missing, preventing silent failures.
  • Chores
    • Expanded logging and telemetry tagging for both scores and expected streams.

- Consolidates readiness_eval and streams_eval into single unified_eval
- Uses LLM for all evaluation criteria (readiness + streams)
- Reduces code complexity and simplifies maintenance
- Easier to extend: new criteria via prompt edits vs code changes
- Maintains backward compatibility with Phoenix reporting

Benefits:
- Single consistent evaluation pattern
- No manual YAML parsing or set operations
- Natural language explanations in structured output
- Simpler to add new evaluation dimensions

Technical changes:
- evaluators.py: Replaced 2 evaluators with unified_eval function
- phoenix_run.py: Updated imports and evaluator list
- Uses structured LLM output with READINESS/STREAMS format
- Returns dict with separate scores for Phoenix compatibility

Co-Authored-By: AJ Steers <[email protected]>
Copy link
Contributor

Original prompt from AJ Steers
@Devin - Review the eval strategy of connector-builder-mcp. Let's consider replacing the programmatic approach with an llm-based approach. E.g. instead of us parsing the connector readiness report and other artifacts, and having code to programmatically compare those artifacts with the expected outputs, what if we just give the llm the expected output criteria, the connector readiness report and artifacts, and a specific prompt something like "evaluate and give a score equal to the correct values divided by the number of expected values"
Thread URL: https://airbytehq-team.slack.com/archives/D089P0UPVT4/p1760243507791809?thread_ts=1760243507.791809

Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link

coderabbitai bot commented Oct 13, 2025

📝 Walkthrough

Walkthrough

Replaces separate readiness and streams evaluators with a single LLM-driven unified_eval(expected: dict, output: dict) -> dict, updates phoenix_run imports/metadata to use the unified evaluator/model, and centralizes prompt/response parsing, logging, and OpenTelemetry tagging for both scores.

Changes

Cohort / File(s) Summary of Changes
Unified evaluator
connector_builder_agents/src/evals/evaluators.py
Removes readiness_eval and streams_eval; adds unified_eval(expected: dict, output: dict) -> dict, UNIFIED_EVAL_MODEL, and UNIFIED_EVAL_TEMPLATE; centralizes single LLM prompt/response parsing to produce both readiness and streams scores; adds defensive logging, OTel span tagging, input validation, and updated module docstring/public interface.
Phoenix run integration
connector_builder_agents/src/evals/phoenix_run.py
Replaces imports of READINESS_EVAL_MODEL, readiness_eval, and streams_eval with UNIFIED_EVAL_MODEL and unified_eval; uses only unified_eval in evaluator selection; updates experiment metadata key from readiness_eval_model to unified_eval_model and value to UNIFIED_EVAL_MODEL.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Runner as phoenix_run.py
  participant Eval as evaluators.unified_eval
  participant LLM as UNIFIED_EVAL_MODEL
  participant OTel as Logging/OTel

  Runner->>Eval: unified_eval(expected, output)
  Eval->>LLM: Single prompt with readiness_report, manifest, expected_streams
  LLM-->>Eval: Text response with READINESS and STREAMS fields
  Eval->>OTel: Tag readiness_score, streams_score, expected_streams
  Eval-->>Runner: { "readiness": 0|1, "streams": 0.0–1.0 }

  rect rgba(240,250,255,0.6)
    note over Eval,LLM: Single LLM call replaces separate readiness/streams flows
  end

  alt Missing artifacts or parse error
    Eval->>OTel: Warn and tag error fields
    Eval-->>Runner: { "readiness": 0, "streams": 0.0 }
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly captures the main change by indicating the shift to a unified LLM-based connector evaluation in the evals module, uses the conventional “chore(evals)” prefix to signal the scope, and avoids extraneous or vague language.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch devin/1760319110-llm-unified-eval

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

👋 Welcome to the Airbyte Connector Builder MCP!

Thank you for your contribution! Here are some helpful tips and reminders for your convenience.

Testing This Branch via MCP

To test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration:

{
  "mcpServers": {
    "connector-builder-mcp-dev": {
      "command": "uvx",
      "args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1760319110-llm-unified-eval", "connector-builder-mcp"]
    }
  }
}

Testing This Branch via CLI

You can test this version of the MCP Server using the following CLI snippet:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1760319110-llm-unified-eval#egg=airbyte-connector-builder-mcp' --help

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /build-connector - Builds the default connector on-demand using the AI builder
  • /build-connector prompt="<your prompt>" - Builds a connector on-demand using the AI builder
  • /poe <command> - Runs any poe command in the uv virtual environment

AI Builder Evaluations

AI builder evaluations run automatically under the following conditions:

  • When a PR is marked as "ready for review"
  • When a PR is reopened

A set of standardized evaluations also run on a schedule (Mon/Wed/Fri at midnight UTC) and can be manually triggered via workflow dispatch.

Helpful Resources

If you have any questions, feel free to ask in the PR comments or join our Slack community.

📝 Edit this welcome message.

Copy link

github-actions bot commented Oct 13, 2025

PyTest Results (Fast)

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
0 files   ±0   0 ❌ ±0 

Results for commit 781cd4a. ± Comparison against base commit 45b1b8c.

♻️ This comment has been updated with latest results.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
connector_builder_agents/src/evals/phoenix_run.py (1)

54-70: Support multi-metric evaluator output

  • Verify that Phoenix’s run_experiment accepts and persists all keys from a dict-returning unified_eval.
  • In connector_builder_agents/src/evals/summary.py (generate_markdown_summary, around lines 613–631), replace the single-score extraction with iteration over all key/value pairs in eval_run.result.
🧹 Nitpick comments (6)
connector_builder_agents/src/evals/phoenix_run.py (2)

59-74: Close AsyncClient to avoid resource leaks

Use the async context manager to ensure proper cleanup.

-        client = AsyncClient()
-        logger.info(f"Starting experiment: {experiment_name}")
-        experiment = await client.experiments.run_experiment(
+        async with AsyncClient() as client:
+            logger.info(f"Starting experiment: {experiment_name}")
+            experiment = await client.experiments.run_experiment(
             dataset=dataset,
             task=run_connector_build_task,
             evaluators=evaluators,
             experiment_name=experiment_name,
             experiment_metadata={
                 "developer_model": EVAL_DEVELOPER_MODEL,
                 "manager_model": EVAL_MANAGER_MODEL,
                 "unified_eval_model": UNIFIED_EVAL_MODEL,
             },
             timeout=1800,
-        )
+            )

85-85: Avoid sys.exit inside async context

Raise SystemExit to allow upstream callers to handle shutdown cleanly.

-            sys.exit(1)
+            raise SystemExit(1)
connector_builder_agents/src/evals/evaluators.py (4)

114-116: Handle both dict and JSON-string forms of expected and guard bad JSON

Current code assumes expected["expected"] is a JSON string. Make it robust to dict or invalid JSON.

-    expected_obj = json.loads(expected.get("expected", "{}"))
-    expected_streams = expected_obj.get("expected_streams", [])
+    exp = expected.get("expected", {})
+    if isinstance(exp, str):
+        try:
+            expected_obj = json.loads(exp)
+        except json.JSONDecodeError:
+            logger.warning("Invalid JSON in 'expected'; defaulting to empty dict")
+            expected_obj = {}
+    elif isinstance(exp, dict):
+        expected_obj = exp
+    else:
+        expected_obj = {}
+    expected_streams = expected_obj.get("expected_streams") or []

126-147: Align llm_classify usage: pass templated prompt, not pre-formatted string

llm_classify typically expands placeholders from the DataFrame. Remove pre-formatting to avoid duplication and keep the API idiomatic.

-    prompt = UNIFIED_EVAL_TEMPLATE.format(
-        readiness_report=readiness_report,
-        manifest=manifest,
-        expected_streams=json.dumps(expected_streams),
-    )
-
     try:
         eval_df = llm_classify(
             model=OpenAIModel(model=UNIFIED_EVAL_MODEL),
             data=pd.DataFrame(
                 [
                     {
                         "readiness_report": readiness_report,
                         "manifest": manifest,
                         "expected_streams": json.dumps(expected_streams),
                     }
                 ]
             ),
-            template=prompt,
+            template=UNIFIED_EVAL_TEMPLATE,
             rails=None,
             provide_explanation=True,
         )

151-168: Be defensive when extracting the LLM response text

Assuming a "label" column may break. Fall back to common alternatives and fail fast if none exist.

-        response_text = eval_df["label"][0]
+        # Try common columns in order of likelihood
+        if "label" in eval_df.columns:
+            response_text = str(eval_df.at[0, "label"])
+        elif "output" in eval_df.columns:
+            response_text = str(eval_df.at[0, "output"])
+        elif "response" in eval_df.columns:
+            response_text = str(eval_df.at[0, "response"])
+        else:
+            logger.warning(f"No expected text column in eval_df; columns={list(eval_df.columns)}")
+            return {"readiness": 0.0, "streams": 0.0}

17-17: Make the model configurable via env with a safe default

Allows switching models without code changes.

+import os
@@
-UNIFIED_EVAL_MODEL = "gpt-4o"
+UNIFIED_EVAL_MODEL = os.getenv("UNIFIED_EVAL_MODEL", "gpt-4o")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45b1b8c and f73f14c.

📒 Files selected for processing (2)
  • connector_builder_agents/src/evals/evaluators.py (3 hunks)
  • connector_builder_agents/src/evals/phoenix_run.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
connector_builder_agents/src/evals/phoenix_run.py (1)
connector_builder_agents/src/evals/evaluators.py (1)
  • unified_eval (89-180)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (1)
connector_builder_agents/src/evals/evaluators.py (1)

121-125: OTel attribute array compatibility

expected_streams is a list; most exporters support array attributes, but some don’t. If you observe drops, stringify.

Option if needed:

-    span.set_attribute("expected_streams", expected_streams)
+    span.set_attribute("expected_streams", expected_streams)  # or ",".join(expected_streams)

@devin-ai-integration devin-ai-integration bot changed the title refactor: Unified LLM-based connector evaluation chore(evals): Unified LLM-based connector evaluation Oct 13, 2025
- Enhanced module-level docstring with detailed description
- Added docstring for UNIFIED_EVAL_MODEL constant
- Expanded unified_eval function docstring with:
  - Detailed description of evaluation approach
  - Comprehensive Args and Returns documentation
  - Usage example
- Addresses CodeRabbit feedback on insufficient docstring coverage

Co-Authored-By: AJ Steers <[email protected]>
@github-actions github-actions bot added the chore label Oct 13, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
connector_builder_agents/src/evals/evaluators.py (3)

134-141: Consider validating artifact content.

The function checks if artifacts are missing ("Not available") but doesn't validate that they contain meaningful content (e.g., non-empty strings). An empty readiness report or manifest could lead to uninformative LLM evaluations that default to 0.0 scores without clear diagnostic information.

Consider adding validation after line 141:

if readiness_report == "Not available":
    logger.warning("No readiness report found")
elif not readiness_report.strip():
    logger.warning("Readiness report is empty")

if manifest == "Not available":
    logger.warning("No manifest found")
elif not manifest.strip():
    logger.warning("Manifest is empty")

192-197: Silent clamping may hide LLM format issues.

When the LLM returns a streams score outside the 0.0-1.0 range, it's silently clamped (line 194). While this prevents invalid scores, it also hides potential issues where the LLM isn't following the expected format (e.g., returning a percentage like "67" instead of "0.67").

Consider logging when clamping occurs to aid debugging:

                 try:
                     streams_score = float(streams_value)
+                    if streams_score < 0.0 or streams_score > 1.0:
+                        logger.warning(f"Streams score {streams_score} outside valid range [0.0, 1.0], clamping")
                     streams_score = max(0.0, min(1.0, streams_score))
                 except ValueError:
                     logger.warning(f"Could not parse streams score from: {streams_value}")
                     streams_score = 0.0

26-27: Pin GPT-4o to a specific version
Using the generic "gpt-4o" model may introduce breaking changes when OpenAI rolls out updates. Fetch the current model list (for example, via openai api models.list) and replace the constant with a version-suffixed identifier (e.g., "gpt-4o-2024-08-06").

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f73f14c and 3aca2dd.

📒 Files selected for processing (1)
  • connector_builder_agents/src/evals/evaluators.py (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (4)
connector_builder_agents/src/evals/evaluators.py (4)

29-96: Verify LLM accuracy for YAML stream extraction.

The prompt instructs the LLM to parse YAML and extract stream names from the streams: section. This replaces the previous deterministic YAML parsing approach (using the yaml library and set intersection) with an LLM-based approach, which may be less reliable for structured data extraction.

While this simplifies the codebase, LLMs can misparse edge cases such as:

  • Complex YAML with anchors, aliases, or multiline strings
  • Streams with special characters or unusual formatting
  • Malformed YAML that should fail parsing

Consider validating the accuracy of stream extraction by running the full evaluation suite and comparing results against the previous programmatic approach. If accuracy degrades significantly, consider a hybrid approach: use programmatic YAML parsing for stream extraction while keeping the LLM-based readiness evaluation.


162-176: LGTM: Well-structured LLM invocation.

The use of llm_classify with a DataFrame input and the structured prompt template follows Phoenix best practices. The provide_explanation=True parameter is helpful for debugging evaluation decisions.


185-190: LGTM: Robust response parsing with defaults.

The line-by-line parsing approach is resilient to variations in LLM output formatting (e.g., extra whitespace, additional lines). Using defaults of 0.0 for both scores when parsing fails is a sensible defensive strategy.


207-209: LGTM: Appropriate exception handling.

Catching all exceptions and returning default scores ensures the evaluation doesn't crash, while exc_info=True provides sufficient debugging information. This is appropriate for an evaluation framework where individual evaluator failures shouldn't halt the entire evaluation run.

Phoenix's llm_classify expects an unformatted template with placeholders
that it fills from DataFrame columns. Pre-formatting the template caused
'Missing template variable' errors. Now passing raw template directly.

Also added None check for response_text to handle edge cases gracefully.

Co-Authored-By: AJ Steers <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
connector_builder_agents/src/evals/evaluators.py (1)

183-195: Consider logging when LLM output format is unexpected.

The parsing logic safely defaults to 0.0 when the LLM response doesn't contain "READINESS:" or "STREAMS:" lines, but doesn't explicitly log this case. Adding a log message when neither field is successfully parsed would help diagnose LLM output issues during debugging.

For example, after the parsing loop:

        if readiness_score == 0.0 and streams_score == 0.0:
            logger.warning(
                f"LLM response did not match expected format. Response: {response_text[:200]}"
            )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3aca2dd and 2105bea.

📒 Files selected for processing (1)
  • connector_builder_agents/src/evals/evaluators.py (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pytest (Fast)
🔇 Additional comments (2)
connector_builder_agents/src/evals/evaluators.py (2)

134-141: Verify design intent for missing artifact handling.

The code logs warnings when artifacts are missing but continues evaluation, passing "Not available" strings to the LLM template. If both artifacts are missing, the evaluation is essentially meaningless and returning {"readiness": 0.0, "streams": 0.0} immediately would be more explicit. Consider whether this early-exit behavior is preferred, or if the current approach of letting the LLM interpret partial data is intentional.


99-207: Unified approach successfully simplifies evaluation architecture.

The consolidation of readiness and streams evaluators into a single LLM-based function achieves the stated objectives: reduces code complexity, eliminates YAML parsing dependency, and maintains clear separation of concerns via the returned dict structure. The implementation includes appropriate error handling, logging, and OpenTelemetry instrumentation.

Addresses CodeRabbit feedback to gracefully handle JSONDecodeError
when parsing the expected criteria. Returns fallback scores (0.0, 0.0)
with error logging if the JSON is malformed, preventing crashes.

Co-Authored-By: AJ Steers <[email protected]>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2105bea and 781cd4a.

📒 Files selected for processing (1)
  • connector_builder_agents/src/evals/evaluators.py (3 hunks)

Comment on lines +143 to +148
try:
expected_obj = json.loads(expected.get("expected", "{}"))
except json.JSONDecodeError as e:
logger.error(f"Failed to parse expected JSON: {e}", exc_info=True)
return {"readiness": 0.0, "streams": 0.0}
expected_streams = expected_obj.get("expected_streams", [])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard against expected being missing or non-string.

expected.get("expected", "{}") assumes expected is a dict with a JSON string. If expected is None, or if the value is None/non-string (common when upstream omits the payload), this will raise AttributeError/TypeError and crash the evaluator before we can fall back to zeros. We should defensively coerce expected to {} and catch TypeError (and ValueError) alongside JSONDecodeError.

-    try:
-        expected_obj = json.loads(expected.get("expected", "{}"))
-    except json.JSONDecodeError as e:
+    expected_payload = (expected or {}).get("expected", "{}")
+    try:
+        expected_obj = json.loads(expected_payload)
+    except (json.JSONDecodeError, TypeError, ValueError) as e:
         logger.error(f"Failed to parse expected JSON: {e}", exc_info=True)
         return {"readiness": 0.0, "streams": 0.0}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
expected_obj = json.loads(expected.get("expected", "{}"))
except json.JSONDecodeError as e:
logger.error(f"Failed to parse expected JSON: {e}", exc_info=True)
return {"readiness": 0.0, "streams": 0.0}
expected_streams = expected_obj.get("expected_streams", [])
expected_payload = (expected or {}).get("expected", "{}")
try:
expected_obj = json.loads(expected_payload)
except (json.JSONDecodeError, TypeError, ValueError) as e:
logger.error(f"Failed to parse expected JSON: {e}", exc_info=True)
return {"readiness": 0.0, "streams": 0.0}
expected_streams = expected_obj.get("expected_streams", [])

Comment on lines +187 to +200
for line in response_text.strip().split("\n"):
line = line.strip()
if line.startswith("READINESS:"):
readiness_value = line.split(":", 1)[1].strip().upper()
readiness_score = 1.0 if readiness_value == "PASSED" else 0.0
elif line.startswith("STREAMS:"):
streams_value = line.split(":", 1)[1].strip()
try:
streams_score = float(streams_value)
streams_score = max(0.0, min(1.0, streams_score))
except ValueError:
logger.warning(f"Could not parse streams score from: {streams_value}")
streams_score = 0.0

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the READINESS/STREAMS parsing case-insensitive.

LLMs often capitalize only the first letter ("Readiness:", "Streams:") despite the prompt. With the current strict startswith("READINESS:")/"STREAMS:" checks, those valid responses will be ignored and both scores remain 0.0, corrupting the evaluation output. Please normalize the prefix (e.g., compare on line.upper().startswith("READINESS:")) before extracting the value.

-        for line in response_text.strip().split("\n"):
-            line = line.strip()
-            if line.startswith("READINESS:"):
+        for line in response_text.strip().split("\n"):
+            line = line.strip()
+            upper_line = line.upper()
+            if upper_line.startswith("READINESS:"):
                 readiness_value = line.split(":", 1)[1].strip().upper()
                 readiness_score = 1.0 if readiness_value == "PASSED" else 0.0
-            elif line.startswith("STREAMS:"):
+            elif upper_line.startswith("STREAMS:"):
                 streams_value = line.split(":", 1)[1].strip()

That makes the parser resilient to benign casing variations while still enforcing the format for scoring.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for line in response_text.strip().split("\n"):
line = line.strip()
if line.startswith("READINESS:"):
readiness_value = line.split(":", 1)[1].strip().upper()
readiness_score = 1.0 if readiness_value == "PASSED" else 0.0
elif line.startswith("STREAMS:"):
streams_value = line.split(":", 1)[1].strip()
try:
streams_score = float(streams_value)
streams_score = max(0.0, min(1.0, streams_score))
except ValueError:
logger.warning(f"Could not parse streams score from: {streams_value}")
streams_score = 0.0
for line in response_text.strip().split("\n"):
line = line.strip()
upper_line = line.upper()
if upper_line.startswith("READINESS:"):
readiness_value = line.split(":", 1)[1].strip().upper()
readiness_score = 1.0 if readiness_value == "PASSED" else 0.0
elif upper_line.startswith("STREAMS:"):
streams_value = line.split(":", 1)[1].strip()
try:
streams_score = float(streams_value)
streams_score = max(0.0, min(1.0, streams_score))
except ValueError:
logger.warning(f"Could not parse streams score from: {streams_value}")
streams_score = 0.0
🤖 Prompt for AI Agents
In connector_builder_agents/src/evals/evaluators.py around lines 187 to 200, the
parser only matches "READINESS:" and "STREAMS:" exactly which misses valid
variants like "Readiness:" or "streams:"; make the prefix checks
case-insensitive by normalizing the line (e.g., compute an uppercased copy or
use line.lower()) before checking startswith, then extract the value from the
original line (or split using the same normalized index) and proceed with the
existing parsing and error handling so both READINESS and STREAMS are detected
regardless of casing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant