Skip to content

feat: add AutoGen MCP agent to EvalHub pipeline (RHAIENG-4224)#93

Merged
kami619 merged 5 commits into
mainfrom
RHAIENG-4224/btest-autogen-mcp
May 13, 2026
Merged

feat: add AutoGen MCP agent to EvalHub pipeline (RHAIENG-4224)#93
kami619 merged 5 commits into
mainfrom
RHAIENG-4224/btest-autogen-mcp

Conversation

@andrewdonheiser
Copy link
Copy Markdown
Contributor

Summary

  • Adds the AutoGen MCP agent (add/sub tools) as the 3rd agent in the EvalHub on-cluster eval pipeline
  • Fixes unicode whitespace handling in behavioral test assertions (LLMs emit NBSP/thin-space in numeric responses)
  • Upgrades MLflow to >=3.10.0 for workspace-aware SDK support; simplifies experiment config so traces and eval metrics live in one experiment
  • Removes 130-line hand-rolled eval-hub-sdk stubs from test conftest; uses the real package (publicly available on PyPI) as a test dependency instead

Changes

  • agents/autogen/mcp_agent/evalhub/tool_use.yaml — 4 golden queries for add/sub tools
  • evals/evalhub_adapter/Containerfile — COPY autogen fixtures, bump mlflow pin
  • evals/evalhub_adapter/tests/run-e2e.sh — autogen route discovery, health check, eval config, job submission
  • evals/evalhub_adapter/README.md — updated experiment design docs
  • evals/harness/mlflow_client.py — env-based client init for workspace support
  • agents/autogen/mcp_agent/tests/behavioral/test_*.py — regex-based whitespace normalization
  • pyproject.toml — add eval-hub-sdk to test deps
  • evals/evalhub_adapter/tests/conftest.py — remove SDK stubs

Test plan

  • pytest evals/evalhub_adapter/tests/ -m unit passes (68 tests)
  • Full E2E run against cluster with all 3 agents
  • Verify container builds with mlflow>=3.10.0

Made with Cursor

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: fa7a3946-e8fc-48e4-b649-5d46103c2bed

📥 Commits

Reviewing files that changed from the base of the PR and between 6d74274 and c124da7.

📒 Files selected for processing (1)
  • agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml

📝 Walkthrough

Walkthrough

Adds AutoGen MCP behavioral tests and fixtures, golden EvalHub fixtures, harness/tool-invocation parsing, MLflow client/adaptor updates, run-e2e and Containerfile changes, and supporting docs/config edits.

Changes

AutoGen MCP Agent Testing & EvalHub Integration

Layer / File(s) Summary
Project Config & Docs
README.md, agents/autogen/mcp_agent/README.md, docs/adding-behavioral-tests.md, pyproject.toml, tests/behavioral/configs/thresholds.yaml
Adds AUTOGEN_MCP_AGENT_URL docs, testing README, autogen_mcp pytest marker and thresholds, and test extras dependency.
Golden Data & EvalHub Fixture
agents/autogen/mcp_agent/evalhub/tool_use.yaml, agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml, evals/evalhub_adapter/Containerfile
Adds EvalHub tool_use.yaml (four golden cases), golden queries fixture, and Containerfile updates to include autogen_mcp fixtures and bump mlflow runtime constraint.
Test Fixtures
agents/autogen/mcp_agent/tests/behavioral/conftest.py
Adds pytest fixtures: agent_url, http_client, eval_config, load_golden, known_tools, autogen_mcp_thresholds, and run_eval (forces stream=False, optional MLflow trace enrichment).
Behavioral Tests
agents/autogen/mcp_agent/tests/behavioral/*.py
Adds tests for tool usage, adversarial/system leakage, hallucinated-tool detection, tool-call arg validity, greeting behavior, pass@k reliability, latency p95, and plan-coherence response-quality.
Eval Harness & MLflow
evals/harness/runner.py, evals/harness/mlflow_client.py, evals/evalhub_adapter/config.py
Runner now extracts tool calls from choices/context/tool_invocations and normalizes args; response-text extraction improved; MLflow client now constructs MlflowClient() without setting tracking URI; AgenticEvalParams.stream default set to false.
EvalHub Adapter & run-e2e
evals/evalhub_adapter/README.md, evals/evalhub_adapter/tests/run-e2e.sh, evals/evalhub_adapter/tests/conftest.py
Adapter README standardizes MLflow params; run-e2e discovers AUTOGEN_MCP_AGENT_ROUTE, requires tool_use.yaml, resolves MLflow token, adjusts TLS defaults, generates/submits eval-autogen-mcp-agent.yaml, updates provider JSON; removed evalhub bootstrap stub from adapter tests.
sequenceDiagram
  participant TestRunner as pytest (run_eval)
  participant Agent as AutoGen MCP Agent (HTTP)
  participant EvalHarness as evals.harness.runner
  participant MLflow as mlflow
  participant EvalHub as EvalHub Adapter

  TestRunner->>Agent: send eval request (TaskConfig, stream=False)
  Agent-->>TestRunner: return assistant response (choices / context / tool_invocations)
  TestRunner->>EvalHarness: submit result for normalization
  EvalHarness->>EvalHarness: _extract_tool_calls (choices → context → tool_invocations)
  EvalHarness->>EvalHarness: _extract_response_text (choices[0].message.content or messages[])
  EvalHarness->>MLflow: optionally enrich traces (MlflowClient())
  EvalHarness->>EvalHub: write/submit eval job (eval-autogen-mcp-agent.yaml)
  EvalHub-->>MLflow: run logging / experiment lookup
  EvalHarness-->>TestRunner: return normalized TaskResult
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately summarizes the main objective: adding the AutoGen MCP agent to the EvalHub pipeline.
Description check ✅ Passed The description is well-related to the changeset, providing a structured summary of objectives, changes, and test plan that directly correspond to the files modified.
Docstring Coverage ✅ Passed Docstring coverage is 95.83% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch RHAIENG-4224/btest-autogen-mcp

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
evals/evalhub_adapter/tests/run-e2e.sh (1)

555-568: 💤 Low value

Reuse the JSON-escape pattern for MLFLOW_TOKEN interpolation.

The MLFLOW_TOKEN is interpolated as ${MLFLOW_TOKEN} directly inside a single-quoted Python literal here, while step 4 (line 339) already establishes a safer MLFLOW_TOKEN_JSON escape. Tokens with single quotes or backslashes would break this python3 -c block. Same pattern would also harden the inline MLFLOW_TRACKING_URI interpolation.

♻️ Proposed change
+    local mlflow_token_json mlflow_uri_json mlflow_exp_json ns_json
+    mlflow_token_json=$(python3 -c "import json,sys; sys.stdout.write(json.dumps(sys.argv[1]))" "${MLFLOW_TOKEN}")
+    mlflow_uri_json=$(python3 -c "import json,sys; sys.stdout.write(json.dumps(sys.argv[1]))" "${MLFLOW_TRACKING_URI}")
+    mlflow_exp_json=$(python3 -c "import json,sys; sys.stdout.write(json.dumps(sys.argv[1]))" "${MLFLOW_EXPERIMENT}")
+    ns_json=$(python3 -c "import json,sys; sys.stdout.write(json.dumps(sys.argv[1]))" "${OC_NAMESPACE}")
     local experiment_id
     experiment_id=$(python3 -c "
 import os, sys
-os.environ.setdefault('MLFLOW_TRACKING_URI', '${MLFLOW_TRACKING_URI}')
-os.environ.setdefault('MLFLOW_TRACKING_TOKEN', '${MLFLOW_TOKEN}')
+os.environ.setdefault('MLFLOW_TRACKING_URI', ${mlflow_uri_json})
+os.environ.setdefault('MLFLOW_TRACKING_TOKEN', ${mlflow_token_json})
 os.environ.setdefault('MLFLOW_TRACKING_INSECURE_TLS', 'true')
-os.environ.setdefault('MLFLOW_WORKSPACE', '${OC_NAMESPACE}')
+os.environ.setdefault('MLFLOW_WORKSPACE', ${ns_json})
 import mlflow
-exp = mlflow.MlflowClient().get_experiment_by_name('${MLFLOW_EXPERIMENT}')
+exp = mlflow.MlflowClient().get_experiment_by_name(${mlflow_exp_json})
 print(exp.experiment_id if exp else '')
 " 2>/dev/null || true)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/evalhub_adapter/tests/run-e2e.sh` around lines 555 - 568, The python3
-c block that sets experiment_id interpolates ${MLFLOW_TOKEN} and
${MLFLOW_TRACKING_URI} directly into a single-quoted Python string (inside the
experiment_id assignment), which can break when tokens contain single quotes or
backslashes; update this block to reuse the existing JSON-escaped variables
(e.g., MLFLOW_TOKEN_JSON and a similarly-escaped MLFLOW_TRACKING_URI_JSON)
instead of raw ${MLFLOW_TOKEN}/${MLFLOW_TRACKING_URI} so the inline Python
receives safe, escaped values; ensure you reference the same JSON-escape pattern
used earlier and replace the interpolations inside the experiment_id python3 -c
command accordingly.
agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py (1)

19-19: 💤 Low value

Importing from conftest is fragile.

from conftest import load_golden only works because pytest happens to put the test directory on sys.path. Since _tool_queries() runs at collection time, you can't use a fixture, but moving load_golden into a sibling helper module (e.g. _golden.py) is more robust and decouples test data loading from pytest's conftest mechanism.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py` at line 19, The
test imports load_golden directly from conftest which is fragile at collection
time; extract the load_golden helper into a sibling module named _golden (define
a top-level function load_golden there), update the test and any collection-time
code (e.g., the _tool_queries() usage) to import from _golden instead of
conftest, and ensure the new module is located alongside the tests so the import
is stable during collection.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agents/autogen/mcp_agent/tests/behavioral/test_reliability.py`:
- Line 55: The long assignment to text_normalized exceeds the line-length lint
rule; split it into shorter statements by extracting the regex or the lowered
response into a separate variable and then call re.sub. For example, create a
variable like pattern = r"[\s,\u00a0\u2009\u202f]+" or lowered =
result.response.lower() and then do text_normalized = re.sub(pattern, "",
lowered) so the symbols to edit are the text_normalized assignment and its use
of re.sub/result.response.lower().

In `@agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py`:
- Line 53: The long-line ruff formatting error is caused by the single long
statement creating text_normalized in test_tool_usage.py; refactor the line by
breaking it into multiple shorter parts or running ruff format. Concretely,
split the statement that assigns text_normalized (the call to re.sub with the
regex r"[\s,\u00a0\u2009\u202f]+" and result.response.lower()) into two
lines—assign the pattern or the lowered response to a separate variable, then
call re.sub—so the function call and regex literal do not exceed the line length
(or simply run `ruff format` to apply the same fix).

In `@evals/harness/mlflow_client.py`:
- Around line 60-61: The code mutates os.environ["MLFLOW_TRACKING_URI"] in
_get_client which globally affects subprocesses and other instances; change this
to either call mlflow.set_tracking_uri(self.tracking_uri) before creating the
client or pass the tracking URI directly into MlflowClient (e.g.,
mlflow.MlflowClient(tracking_uri=self.tracking_uri)) and remove the os.environ
assignment so that self._client and mlflow.search_traces resolve the correct
server without globally altering process environment; update the lazy-init in
_get_client (and any places that rely on os.environ) to use self.tracking_uri
explicitly.

In `@pyproject.toml`:
- Line 17: The test-mlflow extra in pyproject.toml still pins mlflow>=2.0;
update the test-mlflow extras specification to require mlflow>=3.10.0 so it
matches the Containerfile and evals/evalhub_adapter README requirements and
ensures workspace-aware SDK features are installed when running uv pip install
.[evalhub,test-mlflow]; modify the test-mlflow entry (the extras key
"test-mlflow") to use "mlflow>=3.10.0".

---

Nitpick comments:
In `@agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py`:
- Line 19: The test imports load_golden directly from conftest which is fragile
at collection time; extract the load_golden helper into a sibling module named
_golden (define a top-level function load_golden there), update the test and any
collection-time code (e.g., the _tool_queries() usage) to import from _golden
instead of conftest, and ensure the new module is located alongside the tests so
the import is stable during collection.

In `@evals/evalhub_adapter/tests/run-e2e.sh`:
- Around line 555-568: The python3 -c block that sets experiment_id interpolates
${MLFLOW_TOKEN} and ${MLFLOW_TRACKING_URI} directly into a single-quoted Python
string (inside the experiment_id assignment), which can break when tokens
contain single quotes or backslashes; update this block to reuse the existing
JSON-escaped variables (e.g., MLFLOW_TOKEN_JSON and a similarly-escaped
MLFLOW_TRACKING_URI_JSON) instead of raw ${MLFLOW_TOKEN}/${MLFLOW_TRACKING_URI}
so the inline Python receives safe, escaped values; ensure you reference the
same JSON-escape pattern used earlier and replace the interpolations inside the
experiment_id python3 -c command accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 7213ef84-7ece-4452-a026-927fc7098a49

📥 Commits

Reviewing files that changed from the base of the PR and between 49f1245 and 3b5c203.

📒 Files selected for processing (18)
  • README.md
  • agents/autogen/mcp_agent/README.md
  • agents/autogen/mcp_agent/evalhub/tool_use.yaml
  • agents/autogen/mcp_agent/tests/behavioral/conftest.py
  • agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml
  • agents/autogen/mcp_agent/tests/behavioral/test_cost_latency.py
  • agents/autogen/mcp_agent/tests/behavioral/test_reliability.py
  • agents/autogen/mcp_agent/tests/behavioral/test_response_quality.py
  • agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py
  • docs/adding-behavioral-tests.md
  • evals/evalhub_adapter/Containerfile
  • evals/evalhub_adapter/README.md
  • evals/evalhub_adapter/tests/conftest.py
  • evals/evalhub_adapter/tests/run-e2e.sh
  • evals/harness/mlflow_client.py
  • evals/harness/runner.py
  • pyproject.toml
  • tests/behavioral/configs/thresholds.yaml
💤 Files with no reviewable changes (1)
  • evals/evalhub_adapter/tests/conftest.py

Comment thread agents/autogen/mcp_agent/tests/behavioral/test_reliability.py Outdated
Comment thread agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py Outdated
Comment thread evals/harness/mlflow_client.py Outdated
Comment thread pyproject.toml Outdated
@andrewdonheiser andrewdonheiser force-pushed the RHAIENG-4224/btest-autogen-mcp branch 2 times, most recently from de8a194 to 15748c9 Compare May 6, 2026 20:18
- Add the AutoGen MCP agent (add/sub tools) as the 3rd agent in the
  EvalHub on-cluster eval pipeline
- Add behavioral tests for tool usage, reliability, response quality,
  and latency
- Upgrade MLflow to >=3.10.0 for workspace-aware SDK support; simplify
  experiment config so traces and eval metrics live in one experiment
- Remove hand-rolled eval-hub-sdk stubs from test conftest; use the real
  package as a test dependency
- Extend harness runner to extract tool_invocations[] from non-streaming
  responses

Co-authored-by: Cursor <cursoragent@cursor.com>
@andrewdonheiser andrewdonheiser force-pushed the RHAIENG-4224/btest-autogen-mcp branch from 15748c9 to be33ff0 Compare May 6, 2026 20:28
Comment thread agents/autogen/mcp_agent/evalhub/tool_use.yaml
Comment thread evals/evalhub_adapter/tests/run-e2e.sh Outdated
andrewdonheiser and others added 2 commits May 8, 2026 13:04
- Add chained add→sub query to exercise tool_sequence scorer
- Update run-e2e.sh header to reflect all three agent profiles

Co-authored-by: Cursor <cursoragent@cursor.com>
AutoGen's reflection loop returns 500 ("Reflect on tool use produced
no valid text response") when two tools are called in a single turn.
Revert the multi-tool query until the agent supports chained tool calls.

Tested: 11/11 behavioral tests pass against agentic-mcp cluster.
Co-authored-by: Cursor <cursoragent@cursor.com>
@kami619
Copy link
Copy Markdown
Contributor

kami619 commented May 11, 2026

Claude Says:

The AutoGen MCP eval config generated by the e2e script doesn't set stream: false. The adapter defaults to stream: true. The entire PR rationale is that tool_invocations[] is only available in non-streaming mode — so the EvalHub adapter path will fail to capture tool calls, causing all tool scorers to score 0.0.

@andrewdonheiser do you think we need to override this in the EvalHub adapter config params as well ? in the e2e script to make sure it will aligned with what we want ?

@kami619
Copy link
Copy Markdown
Contributor

kami619 commented May 11, 2026

apart from that I thinks this PR LGTM.

Non-streaming responses include tool_invocations/tool_calls in the JSON
body. Streaming relies on delta.tool_calls which not all agents emit
(e.g. AutoGen uses a custom mcp.tool_usage SSE event). Defaulting to
false ensures tool scorers work for all agents out of the box; jobs can
still opt in to streaming via job parameters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@andrewdonheiser
Copy link
Copy Markdown
Contributor Author

Good catch, Kamesh. You're right — the EvalHub adapter defaults stream: True in AgenticEvalParams, and since the e2e script doesn't override it, AutoGen MCP evals would stream. The AutoGen agent emits tool data via a custom mcp.tool_usage SSE event that the runner's _run_streaming() doesn't parse, so tool scorers would see zero tool calls.

I've changed the default to stream: False in 6d74274. Non-streaming is the safer default for all agents — tool calls are reliably available in the JSON response body. Jobs can still opt in to streaming via job parameters if needed.

Verified against all three deployed agents:

  • AutoGen MCP: 11/11 passed
  • LangGraph React: 9 passed, 2 skipped
  • OpenAI Responses: pre-existing issue (empty responses), unrelated to this change

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml`:
- Around line 30-31: Update the golden fixture so the prompt uses the exact MCP
tool name 'sub' instead of the generic phrase "subtract tool": modify the YAML
entry where the query string asks for subtraction (the 'query' key in the
golden_queries.yaml fixture) to read something like "Please use the `sub` tool
to find the difference between 1000000 and 734291" so the model is explicitly
directed to call the registered tool 'sub' which matches the expected_tools
array.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 96d6886b-3bc0-4dab-a480-57fdc1b4a991

📥 Commits

Reviewing files that changed from the base of the PR and between 421463c and 6d74274.

📒 Files selected for processing (3)
  • agents/autogen/mcp_agent/evalhub/tool_use.yaml
  • agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml
  • evals/evalhub_adapter/config.py
✅ Files skipped from review due to trivial changes (1)
  • agents/autogen/mcp_agent/evalhub/tool_use.yaml

Comment thread agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml Outdated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread agents/autogen/mcp_agent/evalhub/tool_use.yaml
Copy link
Copy Markdown
Contributor

@kami619 kami619 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kami619 kami619 merged commit b4080a9 into main May 13, 2026
6 checks passed
@kami619 kami619 deleted the RHAIENG-4224/btest-autogen-mcp branch May 13, 2026 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants