feat: add AutoGen MCP agent to EvalHub pipeline (RHAIENG-4224)#93
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds AutoGen MCP behavioral tests and fixtures, golden EvalHub fixtures, harness/tool-invocation parsing, MLflow client/adaptor updates, run-e2e and Containerfile changes, and supporting docs/config edits. ChangesAutoGen MCP Agent Testing & EvalHub Integration
sequenceDiagram
participant TestRunner as pytest (run_eval)
participant Agent as AutoGen MCP Agent (HTTP)
participant EvalHarness as evals.harness.runner
participant MLflow as mlflow
participant EvalHub as EvalHub Adapter
TestRunner->>Agent: send eval request (TaskConfig, stream=False)
Agent-->>TestRunner: return assistant response (choices / context / tool_invocations)
TestRunner->>EvalHarness: submit result for normalization
EvalHarness->>EvalHarness: _extract_tool_calls (choices → context → tool_invocations)
EvalHarness->>EvalHarness: _extract_response_text (choices[0].message.content or messages[])
EvalHarness->>MLflow: optionally enrich traces (MlflowClient())
EvalHarness->>EvalHub: write/submit eval job (eval-autogen-mcp-agent.yaml)
EvalHub-->>MLflow: run logging / experiment lookup
EvalHarness-->>TestRunner: return normalized TaskResult
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
evals/evalhub_adapter/tests/run-e2e.sh (1)
555-568: 💤 Low valueReuse the JSON-escape pattern for
MLFLOW_TOKENinterpolation.The
MLFLOW_TOKENis interpolated as${MLFLOW_TOKEN}directly inside a single-quoted Python literal here, while step 4 (line 339) already establishes a saferMLFLOW_TOKEN_JSONescape. Tokens with single quotes or backslashes would break thispython3 -cblock. Same pattern would also harden the inlineMLFLOW_TRACKING_URIinterpolation.♻️ Proposed change
+ local mlflow_token_json mlflow_uri_json mlflow_exp_json ns_json + mlflow_token_json=$(python3 -c "import json,sys; sys.stdout.write(json.dumps(sys.argv[1]))" "${MLFLOW_TOKEN}") + mlflow_uri_json=$(python3 -c "import json,sys; sys.stdout.write(json.dumps(sys.argv[1]))" "${MLFLOW_TRACKING_URI}") + mlflow_exp_json=$(python3 -c "import json,sys; sys.stdout.write(json.dumps(sys.argv[1]))" "${MLFLOW_EXPERIMENT}") + ns_json=$(python3 -c "import json,sys; sys.stdout.write(json.dumps(sys.argv[1]))" "${OC_NAMESPACE}") local experiment_id experiment_id=$(python3 -c " import os, sys -os.environ.setdefault('MLFLOW_TRACKING_URI', '${MLFLOW_TRACKING_URI}') -os.environ.setdefault('MLFLOW_TRACKING_TOKEN', '${MLFLOW_TOKEN}') +os.environ.setdefault('MLFLOW_TRACKING_URI', ${mlflow_uri_json}) +os.environ.setdefault('MLFLOW_TRACKING_TOKEN', ${mlflow_token_json}) os.environ.setdefault('MLFLOW_TRACKING_INSECURE_TLS', 'true') -os.environ.setdefault('MLFLOW_WORKSPACE', '${OC_NAMESPACE}') +os.environ.setdefault('MLFLOW_WORKSPACE', ${ns_json}) import mlflow -exp = mlflow.MlflowClient().get_experiment_by_name('${MLFLOW_EXPERIMENT}') +exp = mlflow.MlflowClient().get_experiment_by_name(${mlflow_exp_json}) print(exp.experiment_id if exp else '') " 2>/dev/null || true)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@evals/evalhub_adapter/tests/run-e2e.sh` around lines 555 - 568, The python3 -c block that sets experiment_id interpolates ${MLFLOW_TOKEN} and ${MLFLOW_TRACKING_URI} directly into a single-quoted Python string (inside the experiment_id assignment), which can break when tokens contain single quotes or backslashes; update this block to reuse the existing JSON-escaped variables (e.g., MLFLOW_TOKEN_JSON and a similarly-escaped MLFLOW_TRACKING_URI_JSON) instead of raw ${MLFLOW_TOKEN}/${MLFLOW_TRACKING_URI} so the inline Python receives safe, escaped values; ensure you reference the same JSON-escape pattern used earlier and replace the interpolations inside the experiment_id python3 -c command accordingly.agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py (1)
19-19: 💤 Low valueImporting from
conftestis fragile.
from conftest import load_goldenonly works because pytest happens to put the test directory onsys.path. Since_tool_queries()runs at collection time, you can't use a fixture, but movingload_goldeninto a sibling helper module (e.g._golden.py) is more robust and decouples test data loading from pytest's conftest mechanism.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py` at line 19, The test imports load_golden directly from conftest which is fragile at collection time; extract the load_golden helper into a sibling module named _golden (define a top-level function load_golden there), update the test and any collection-time code (e.g., the _tool_queries() usage) to import from _golden instead of conftest, and ensure the new module is located alongside the tests so the import is stable during collection.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@agents/autogen/mcp_agent/tests/behavioral/test_reliability.py`:
- Line 55: The long assignment to text_normalized exceeds the line-length lint
rule; split it into shorter statements by extracting the regex or the lowered
response into a separate variable and then call re.sub. For example, create a
variable like pattern = r"[\s,\u00a0\u2009\u202f]+" or lowered =
result.response.lower() and then do text_normalized = re.sub(pattern, "",
lowered) so the symbols to edit are the text_normalized assignment and its use
of re.sub/result.response.lower().
In `@agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py`:
- Line 53: The long-line ruff formatting error is caused by the single long
statement creating text_normalized in test_tool_usage.py; refactor the line by
breaking it into multiple shorter parts or running ruff format. Concretely,
split the statement that assigns text_normalized (the call to re.sub with the
regex r"[\s,\u00a0\u2009\u202f]+" and result.response.lower()) into two
lines—assign the pattern or the lowered response to a separate variable, then
call re.sub—so the function call and regex literal do not exceed the line length
(or simply run `ruff format` to apply the same fix).
In `@evals/harness/mlflow_client.py`:
- Around line 60-61: The code mutates os.environ["MLFLOW_TRACKING_URI"] in
_get_client which globally affects subprocesses and other instances; change this
to either call mlflow.set_tracking_uri(self.tracking_uri) before creating the
client or pass the tracking URI directly into MlflowClient (e.g.,
mlflow.MlflowClient(tracking_uri=self.tracking_uri)) and remove the os.environ
assignment so that self._client and mlflow.search_traces resolve the correct
server without globally altering process environment; update the lazy-init in
_get_client (and any places that rely on os.environ) to use self.tracking_uri
explicitly.
In `@pyproject.toml`:
- Line 17: The test-mlflow extra in pyproject.toml still pins mlflow>=2.0;
update the test-mlflow extras specification to require mlflow>=3.10.0 so it
matches the Containerfile and evals/evalhub_adapter README requirements and
ensures workspace-aware SDK features are installed when running uv pip install
.[evalhub,test-mlflow]; modify the test-mlflow entry (the extras key
"test-mlflow") to use "mlflow>=3.10.0".
---
Nitpick comments:
In `@agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py`:
- Line 19: The test imports load_golden directly from conftest which is fragile
at collection time; extract the load_golden helper into a sibling module named
_golden (define a top-level function load_golden there), update the test and any
collection-time code (e.g., the _tool_queries() usage) to import from _golden
instead of conftest, and ensure the new module is located alongside the tests so
the import is stable during collection.
In `@evals/evalhub_adapter/tests/run-e2e.sh`:
- Around line 555-568: The python3 -c block that sets experiment_id interpolates
${MLFLOW_TOKEN} and ${MLFLOW_TRACKING_URI} directly into a single-quoted Python
string (inside the experiment_id assignment), which can break when tokens
contain single quotes or backslashes; update this block to reuse the existing
JSON-escaped variables (e.g., MLFLOW_TOKEN_JSON and a similarly-escaped
MLFLOW_TRACKING_URI_JSON) instead of raw ${MLFLOW_TOKEN}/${MLFLOW_TRACKING_URI}
so the inline Python receives safe, escaped values; ensure you reference the
same JSON-escape pattern used earlier and replace the interpolations inside the
experiment_id python3 -c command accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 7213ef84-7ece-4452-a026-927fc7098a49
📒 Files selected for processing (18)
README.mdagents/autogen/mcp_agent/README.mdagents/autogen/mcp_agent/evalhub/tool_use.yamlagents/autogen/mcp_agent/tests/behavioral/conftest.pyagents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yamlagents/autogen/mcp_agent/tests/behavioral/test_cost_latency.pyagents/autogen/mcp_agent/tests/behavioral/test_reliability.pyagents/autogen/mcp_agent/tests/behavioral/test_response_quality.pyagents/autogen/mcp_agent/tests/behavioral/test_tool_usage.pydocs/adding-behavioral-tests.mdevals/evalhub_adapter/Containerfileevals/evalhub_adapter/README.mdevals/evalhub_adapter/tests/conftest.pyevals/evalhub_adapter/tests/run-e2e.shevals/harness/mlflow_client.pyevals/harness/runner.pypyproject.tomltests/behavioral/configs/thresholds.yaml
💤 Files with no reviewable changes (1)
- evals/evalhub_adapter/tests/conftest.py
de8a194 to
15748c9
Compare
- Add the AutoGen MCP agent (add/sub tools) as the 3rd agent in the EvalHub on-cluster eval pipeline - Add behavioral tests for tool usage, reliability, response quality, and latency - Upgrade MLflow to >=3.10.0 for workspace-aware SDK support; simplify experiment config so traces and eval metrics live in one experiment - Remove hand-rolled eval-hub-sdk stubs from test conftest; use the real package as a test dependency - Extend harness runner to extract tool_invocations[] from non-streaming responses Co-authored-by: Cursor <cursoragent@cursor.com>
15748c9 to
be33ff0
Compare
- Add chained add→sub query to exercise tool_sequence scorer - Update run-e2e.sh header to reflect all three agent profiles Co-authored-by: Cursor <cursoragent@cursor.com>
AutoGen's reflection loop returns 500 ("Reflect on tool use produced
no valid text response") when two tools are called in a single turn.
Revert the multi-tool query until the agent supports chained tool calls.
Tested: 11/11 behavioral tests pass against agentic-mcp cluster.
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Claude Says:
@andrewdonheiser do you think we need to override this in the EvalHub adapter config params as well ? in the e2e script to make sure it will aligned with what we want ? |
|
apart from that I thinks this PR LGTM. |
Non-streaming responses include tool_invocations/tool_calls in the JSON body. Streaming relies on delta.tool_calls which not all agents emit (e.g. AutoGen uses a custom mcp.tool_usage SSE event). Defaulting to false ensures tool scorers work for all agents out of the box; jobs can still opt in to streaming via job parameters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Good catch, Kamesh. You're right — the EvalHub adapter defaults I've changed the default to Verified against all three deployed agents:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml`:
- Around line 30-31: Update the golden fixture so the prompt uses the exact MCP
tool name 'sub' instead of the generic phrase "subtract tool": modify the YAML
entry where the query string asks for subtraction (the 'query' key in the
golden_queries.yaml fixture) to read something like "Please use the `sub` tool
to find the difference between 1000000 and 734291" so the model is explicitly
directed to call the registered tool 'sub' which matches the expected_tools
array.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 96d6886b-3bc0-4dab-a480-57fdc1b4a991
📒 Files selected for processing (3)
agents/autogen/mcp_agent/evalhub/tool_use.yamlagents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yamlevals/evalhub_adapter/config.py
✅ Files skipped from review due to trivial changes (1)
- agents/autogen/mcp_agent/evalhub/tool_use.yaml
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Changes
agents/autogen/mcp_agent/evalhub/tool_use.yaml— 4 golden queries for add/sub toolsevals/evalhub_adapter/Containerfile— COPY autogen fixtures, bump mlflow pinevals/evalhub_adapter/tests/run-e2e.sh— autogen route discovery, health check, eval config, job submissionevals/evalhub_adapter/README.md— updated experiment design docsevals/harness/mlflow_client.py— env-based client init for workspace supportagents/autogen/mcp_agent/tests/behavioral/test_*.py— regex-based whitespace normalizationpyproject.toml— add eval-hub-sdk to test depsevals/evalhub_adapter/tests/conftest.py— remove SDK stubsTest plan
pytest evals/evalhub_adapter/tests/ -m unitpasses (68 tests)Made with Cursor