-
Notifications
You must be signed in to change notification settings - Fork 2
chore(evals): Unified LLM-based connector evaluation #140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Consolidates readiness_eval and streams_eval into single unified_eval - Uses LLM for all evaluation criteria (readiness + streams) - Reduces code complexity and simplifies maintenance - Easier to extend: new criteria via prompt edits vs code changes - Maintains backward compatibility with Phoenix reporting Benefits: - Single consistent evaluation pattern - No manual YAML parsing or set operations - Natural language explanations in structured output - Simpler to add new evaluation dimensions Technical changes: - evaluators.py: Replaced 2 evaluators with unified_eval function - phoenix_run.py: Updated imports and evaluator list - Uses structured LLM output with READINESS/STREAMS format - Returns dict with separate scores for Phoenix compatibility Co-Authored-By: AJ Steers <[email protected]>
Original prompt from AJ Steers
|
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
📝 WalkthroughWalkthroughReplaces separate readiness and streams evaluators with a single LLM-driven Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Runner as phoenix_run.py
participant Eval as evaluators.unified_eval
participant LLM as UNIFIED_EVAL_MODEL
participant OTel as Logging/OTel
Runner->>Eval: unified_eval(expected, output)
Eval->>LLM: Single prompt with readiness_report, manifest, expected_streams
LLM-->>Eval: Text response with READINESS and STREAMS fields
Eval->>OTel: Tag readiness_score, streams_score, expected_streams
Eval-->>Runner: { "readiness": 0|1, "streams": 0.0–1.0 }
rect rgba(240,250,255,0.6)
note over Eval,LLM: Single LLM call replaces separate readiness/streams flows
end
alt Missing artifacts or parse error
Eval->>OTel: Warn and tag error fields
Eval-->>Runner: { "readiness": 0, "streams": 0.0 }
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
👋 Welcome to the Airbyte Connector Builder MCP!Thank you for your contribution! Here are some helpful tips and reminders for your convenience. Testing This Branch via MCPTo test the changes in this specific branch with an MCP client like Claude Desktop, use the following configuration: {
"mcpServers": {
"connector-builder-mcp-dev": {
"command": "uvx",
"args": ["--from", "git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1760319110-llm-unified-eval", "connector-builder-mcp"]
}
}
} Testing This Branch via CLIYou can test this version of the MCP Server using the following CLI snippet: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/connector-builder-mcp.git@devin/1760319110-llm-unified-eval#egg=airbyte-connector-builder-mcp' --help PR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
AI Builder EvaluationsAI builder evaluations run automatically under the following conditions:
A set of standardized evaluations also run on a schedule (Mon/Wed/Fri at midnight UTC) and can be manually triggered via workflow dispatch. Helpful ResourcesIf you have any questions, feel free to ask in the PR comments or join our Slack community. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
connector_builder_agents/src/evals/phoenix_run.py (1)
54-70
: Support multi-metric evaluator output
- Verify that Phoenix’s
run_experiment
accepts and persists all keys from a dict-returningunified_eval
.- In
connector_builder_agents/src/evals/summary.py
(generate_markdown_summary, around lines 613–631), replace the single-score
extraction with iteration over all key/value pairs ineval_run.result
.
🧹 Nitpick comments (6)
connector_builder_agents/src/evals/phoenix_run.py (2)
59-74
: Close AsyncClient to avoid resource leaksUse the async context manager to ensure proper cleanup.
- client = AsyncClient() - logger.info(f"Starting experiment: {experiment_name}") - experiment = await client.experiments.run_experiment( + async with AsyncClient() as client: + logger.info(f"Starting experiment: {experiment_name}") + experiment = await client.experiments.run_experiment( dataset=dataset, task=run_connector_build_task, evaluators=evaluators, experiment_name=experiment_name, experiment_metadata={ "developer_model": EVAL_DEVELOPER_MODEL, "manager_model": EVAL_MANAGER_MODEL, "unified_eval_model": UNIFIED_EVAL_MODEL, }, timeout=1800, - ) + )
85-85
: Avoid sys.exit inside async contextRaise SystemExit to allow upstream callers to handle shutdown cleanly.
- sys.exit(1) + raise SystemExit(1)connector_builder_agents/src/evals/evaluators.py (4)
114-116
: Handle both dict and JSON-string forms of expected and guard bad JSONCurrent code assumes expected["expected"] is a JSON string. Make it robust to dict or invalid JSON.
- expected_obj = json.loads(expected.get("expected", "{}")) - expected_streams = expected_obj.get("expected_streams", []) + exp = expected.get("expected", {}) + if isinstance(exp, str): + try: + expected_obj = json.loads(exp) + except json.JSONDecodeError: + logger.warning("Invalid JSON in 'expected'; defaulting to empty dict") + expected_obj = {} + elif isinstance(exp, dict): + expected_obj = exp + else: + expected_obj = {} + expected_streams = expected_obj.get("expected_streams") or []
126-147
: Align llm_classify usage: pass templated prompt, not pre-formatted stringllm_classify typically expands placeholders from the DataFrame. Remove pre-formatting to avoid duplication and keep the API idiomatic.
- prompt = UNIFIED_EVAL_TEMPLATE.format( - readiness_report=readiness_report, - manifest=manifest, - expected_streams=json.dumps(expected_streams), - ) - try: eval_df = llm_classify( model=OpenAIModel(model=UNIFIED_EVAL_MODEL), data=pd.DataFrame( [ { "readiness_report": readiness_report, "manifest": manifest, "expected_streams": json.dumps(expected_streams), } ] ), - template=prompt, + template=UNIFIED_EVAL_TEMPLATE, rails=None, provide_explanation=True, )
151-168
: Be defensive when extracting the LLM response textAssuming a "label" column may break. Fall back to common alternatives and fail fast if none exist.
- response_text = eval_df["label"][0] + # Try common columns in order of likelihood + if "label" in eval_df.columns: + response_text = str(eval_df.at[0, "label"]) + elif "output" in eval_df.columns: + response_text = str(eval_df.at[0, "output"]) + elif "response" in eval_df.columns: + response_text = str(eval_df.at[0, "response"]) + else: + logger.warning(f"No expected text column in eval_df; columns={list(eval_df.columns)}") + return {"readiness": 0.0, "streams": 0.0}
17-17
: Make the model configurable via env with a safe defaultAllows switching models without code changes.
+import os @@ -UNIFIED_EVAL_MODEL = "gpt-4o" +UNIFIED_EVAL_MODEL = os.getenv("UNIFIED_EVAL_MODEL", "gpt-4o")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
connector_builder_agents/src/evals/evaluators.py
(3 hunks)connector_builder_agents/src/evals/phoenix_run.py
(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
connector_builder_agents/src/evals/phoenix_run.py (1)
connector_builder_agents/src/evals/evaluators.py (1)
unified_eval
(89-180)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pytest (Fast)
🔇 Additional comments (1)
connector_builder_agents/src/evals/evaluators.py (1)
121-125
: OTel attribute array compatibilityexpected_streams is a list; most exporters support array attributes, but some don’t. If you observe drops, stringify.
Option if needed:
- span.set_attribute("expected_streams", expected_streams) + span.set_attribute("expected_streams", expected_streams) # or ",".join(expected_streams)
- Enhanced module-level docstring with detailed description - Added docstring for UNIFIED_EVAL_MODEL constant - Expanded unified_eval function docstring with: - Detailed description of evaluation approach - Comprehensive Args and Returns documentation - Usage example - Addresses CodeRabbit feedback on insufficient docstring coverage Co-Authored-By: AJ Steers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
connector_builder_agents/src/evals/evaluators.py (3)
134-141
: Consider validating artifact content.The function checks if artifacts are missing (
"Not available"
) but doesn't validate that they contain meaningful content (e.g., non-empty strings). An empty readiness report or manifest could lead to uninformative LLM evaluations that default to 0.0 scores without clear diagnostic information.Consider adding validation after line 141:
if readiness_report == "Not available": logger.warning("No readiness report found") elif not readiness_report.strip(): logger.warning("Readiness report is empty") if manifest == "Not available": logger.warning("No manifest found") elif not manifest.strip(): logger.warning("Manifest is empty")
192-197
: Silent clamping may hide LLM format issues.When the LLM returns a streams score outside the 0.0-1.0 range, it's silently clamped (line 194). While this prevents invalid scores, it also hides potential issues where the LLM isn't following the expected format (e.g., returning a percentage like "67" instead of "0.67").
Consider logging when clamping occurs to aid debugging:
try: streams_score = float(streams_value) + if streams_score < 0.0 or streams_score > 1.0: + logger.warning(f"Streams score {streams_score} outside valid range [0.0, 1.0], clamping") streams_score = max(0.0, min(1.0, streams_score)) except ValueError: logger.warning(f"Could not parse streams score from: {streams_value}") streams_score = 0.0
26-27
: Pin GPT-4o to a specific version
Using the generic"gpt-4o"
model may introduce breaking changes when OpenAI rolls out updates. Fetch the current model list (for example, viaopenai api models.list
) and replace the constant with a version-suffixed identifier (e.g.,"gpt-4o-2024-08-06"
).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
connector_builder_agents/src/evals/evaluators.py
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pytest (Fast)
🔇 Additional comments (4)
connector_builder_agents/src/evals/evaluators.py (4)
29-96
: Verify LLM accuracy for YAML stream extraction.The prompt instructs the LLM to parse YAML and extract stream names from the
streams:
section. This replaces the previous deterministic YAML parsing approach (using theyaml
library and set intersection) with an LLM-based approach, which may be less reliable for structured data extraction.While this simplifies the codebase, LLMs can misparse edge cases such as:
- Complex YAML with anchors, aliases, or multiline strings
- Streams with special characters or unusual formatting
- Malformed YAML that should fail parsing
Consider validating the accuracy of stream extraction by running the full evaluation suite and comparing results against the previous programmatic approach. If accuracy degrades significantly, consider a hybrid approach: use programmatic YAML parsing for stream extraction while keeping the LLM-based readiness evaluation.
162-176
: LGTM: Well-structured LLM invocation.The use of
llm_classify
with a DataFrame input and the structured prompt template follows Phoenix best practices. Theprovide_explanation=True
parameter is helpful for debugging evaluation decisions.
185-190
: LGTM: Robust response parsing with defaults.The line-by-line parsing approach is resilient to variations in LLM output formatting (e.g., extra whitespace, additional lines). Using defaults of 0.0 for both scores when parsing fails is a sensible defensive strategy.
207-209
: LGTM: Appropriate exception handling.Catching all exceptions and returning default scores ensures the evaluation doesn't crash, while
exc_info=True
provides sufficient debugging information. This is appropriate for an evaluation framework where individual evaluator failures shouldn't halt the entire evaluation run.
Phoenix's llm_classify expects an unformatted template with placeholders that it fills from DataFrame columns. Pre-formatting the template caused 'Missing template variable' errors. Now passing raw template directly. Also added None check for response_text to handle edge cases gracefully. Co-Authored-By: AJ Steers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
connector_builder_agents/src/evals/evaluators.py (1)
183-195
: Consider logging when LLM output format is unexpected.The parsing logic safely defaults to
0.0
when the LLM response doesn't contain"READINESS:"
or"STREAMS:"
lines, but doesn't explicitly log this case. Adding a log message when neither field is successfully parsed would help diagnose LLM output issues during debugging.For example, after the parsing loop:
if readiness_score == 0.0 and streams_score == 0.0: logger.warning( f"LLM response did not match expected format. Response: {response_text[:200]}" )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
connector_builder_agents/src/evals/evaluators.py
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pytest (Fast)
🔇 Additional comments (2)
connector_builder_agents/src/evals/evaluators.py (2)
134-141
: Verify design intent for missing artifact handling.The code logs warnings when artifacts are missing but continues evaluation, passing
"Not available"
strings to the LLM template. If both artifacts are missing, the evaluation is essentially meaningless and returning{"readiness": 0.0, "streams": 0.0}
immediately would be more explicit. Consider whether this early-exit behavior is preferred, or if the current approach of letting the LLM interpret partial data is intentional.
99-207
: Unified approach successfully simplifies evaluation architecture.The consolidation of readiness and streams evaluators into a single LLM-based function achieves the stated objectives: reduces code complexity, eliminates YAML parsing dependency, and maintains clear separation of concerns via the returned dict structure. The implementation includes appropriate error handling, logging, and OpenTelemetry instrumentation.
Addresses CodeRabbit feedback to gracefully handle JSONDecodeError when parsing the expected criteria. Returns fallback scores (0.0, 0.0) with error logging if the JSON is malformed, preventing crashes. Co-Authored-By: AJ Steers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
try: | ||
expected_obj = json.loads(expected.get("expected", "{}")) | ||
except json.JSONDecodeError as e: | ||
logger.error(f"Failed to parse expected JSON: {e}", exc_info=True) | ||
return {"readiness": 0.0, "streams": 0.0} | ||
expected_streams = expected_obj.get("expected_streams", []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against expected
being missing or non-string.
expected.get("expected", "{}")
assumes expected
is a dict with a JSON string. If expected
is None
, or if the value is None
/non-string (common when upstream omits the payload), this will raise AttributeError
/TypeError
and crash the evaluator before we can fall back to zeros. We should defensively coerce expected
to {}
and catch TypeError
(and ValueError
) alongside JSONDecodeError
.
- try:
- expected_obj = json.loads(expected.get("expected", "{}"))
- except json.JSONDecodeError as e:
+ expected_payload = (expected or {}).get("expected", "{}")
+ try:
+ expected_obj = json.loads(expected_payload)
+ except (json.JSONDecodeError, TypeError, ValueError) as e:
logger.error(f"Failed to parse expected JSON: {e}", exc_info=True)
return {"readiness": 0.0, "streams": 0.0}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
try: | |
expected_obj = json.loads(expected.get("expected", "{}")) | |
except json.JSONDecodeError as e: | |
logger.error(f"Failed to parse expected JSON: {e}", exc_info=True) | |
return {"readiness": 0.0, "streams": 0.0} | |
expected_streams = expected_obj.get("expected_streams", []) | |
expected_payload = (expected or {}).get("expected", "{}") | |
try: | |
expected_obj = json.loads(expected_payload) | |
except (json.JSONDecodeError, TypeError, ValueError) as e: | |
logger.error(f"Failed to parse expected JSON: {e}", exc_info=True) | |
return {"readiness": 0.0, "streams": 0.0} | |
expected_streams = expected_obj.get("expected_streams", []) |
for line in response_text.strip().split("\n"): | ||
line = line.strip() | ||
if line.startswith("READINESS:"): | ||
readiness_value = line.split(":", 1)[1].strip().upper() | ||
readiness_score = 1.0 if readiness_value == "PASSED" else 0.0 | ||
elif line.startswith("STREAMS:"): | ||
streams_value = line.split(":", 1)[1].strip() | ||
try: | ||
streams_score = float(streams_value) | ||
streams_score = max(0.0, min(1.0, streams_score)) | ||
except ValueError: | ||
logger.warning(f"Could not parse streams score from: {streams_value}") | ||
streams_score = 0.0 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the READINESS/STREAMS parsing case-insensitive.
LLMs often capitalize only the first letter ("Readiness:"
, "Streams:"
) despite the prompt. With the current strict startswith("READINESS:")
/"STREAMS:"
checks, those valid responses will be ignored and both scores remain 0.0, corrupting the evaluation output. Please normalize the prefix (e.g., compare on line.upper().startswith("READINESS:")
) before extracting the value.
- for line in response_text.strip().split("\n"):
- line = line.strip()
- if line.startswith("READINESS:"):
+ for line in response_text.strip().split("\n"):
+ line = line.strip()
+ upper_line = line.upper()
+ if upper_line.startswith("READINESS:"):
readiness_value = line.split(":", 1)[1].strip().upper()
readiness_score = 1.0 if readiness_value == "PASSED" else 0.0
- elif line.startswith("STREAMS:"):
+ elif upper_line.startswith("STREAMS:"):
streams_value = line.split(":", 1)[1].strip()
That makes the parser resilient to benign casing variations while still enforcing the format for scoring.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for line in response_text.strip().split("\n"): | |
line = line.strip() | |
if line.startswith("READINESS:"): | |
readiness_value = line.split(":", 1)[1].strip().upper() | |
readiness_score = 1.0 if readiness_value == "PASSED" else 0.0 | |
elif line.startswith("STREAMS:"): | |
streams_value = line.split(":", 1)[1].strip() | |
try: | |
streams_score = float(streams_value) | |
streams_score = max(0.0, min(1.0, streams_score)) | |
except ValueError: | |
logger.warning(f"Could not parse streams score from: {streams_value}") | |
streams_score = 0.0 | |
for line in response_text.strip().split("\n"): | |
line = line.strip() | |
upper_line = line.upper() | |
if upper_line.startswith("READINESS:"): | |
readiness_value = line.split(":", 1)[1].strip().upper() | |
readiness_score = 1.0 if readiness_value == "PASSED" else 0.0 | |
elif upper_line.startswith("STREAMS:"): | |
streams_value = line.split(":", 1)[1].strip() | |
try: | |
streams_score = float(streams_value) | |
streams_score = max(0.0, min(1.0, streams_score)) | |
except ValueError: | |
logger.warning(f"Could not parse streams score from: {streams_value}") | |
streams_score = 0.0 |
🤖 Prompt for AI Agents
In connector_builder_agents/src/evals/evaluators.py around lines 187 to 200, the
parser only matches "READINESS:" and "STREAMS:" exactly which misses valid
variants like "Readiness:" or "streams:"; make the prefix checks
case-insensitive by normalizing the line (e.g., compute an uppercased copy or
use line.lower()) before checking startswith, then extract the value from the
original line (or split using the same normalized index) and proceed with the
existing parsing and error handling so both READINESS and STREAMS are detected
regardless of casing.
refactor: Unified LLM-based connector evaluation
Summary
Replaced the hybrid evaluation approach (1 LLM + 1 programmatic evaluator) with a single unified LLM-based evaluator to reduce complexity and improve maintainability.
Before:
readiness_eval
: LLM-based binary classifier (PASSED/FAILED)streams_eval
: Programmatic YAML parsing + set intersection for stream matchingAfter:
unified_eval
: Single LLM evaluator handling both readiness and stream matching{"readiness": 0.0-1.0, "streams": 0.0-1.0}
Key Changes:
evaluators.py
: Single function replaces two separate evaluatorsphoenix_run.py
: Uses unified evaluator in evaluators listyaml
import - no longer neededReview & Testing Checklist for Human
Run full eval suite and verify LLM output format: The new evaluator parses LLM responses expecting exact format
READINESS: <PASSED|FAILED>
andSTREAMS: <float>
. If the LLM deviates, scores default to 0.0. Test withpoe evals run
and check that scores are reasonable.Validate stream counting accuracy: The old
streams_eval
was 100% deterministic (exact YAML parsing + set intersection). The new approach asks the LLM to count streams from YAML, which could have variance or errors. Compare results on a few connectors to ensure acceptable accuracy.Verify Phoenix compatibility: Changed from two evaluators (returning int/float) to one evaluator (returning dict). Confirm that Phoenix experiment tracking and summary generation (
src/evals/summary.py
) correctly handle the dict return type and display both scores.Notes
Local testing limitation: Only unit tests were verified locally (all passed). The actual evaluation flow with Phoenix couldn't be tested due to Phoenix server not running locally. CI should have proper Phoenix infrastructure.
Extensibility benefit: Adding new evaluation criteria now only requires editing the prompt template, not writing new Python functions and deploying code.
Requested by: @aaronsteers
Devin session: https://app.devin.ai/sessions/1552f02352e44ad6835b6a3b6b1577da
Summary by CodeRabbit