feat: add AutoGen MCP agent to EvalHub pipeline (RHAIENG-4224) by andrewdonheiser · Pull Request #93 · red-hat-data-services/agentic-starter-kits

andrewdonheiser · 2026-05-06T17:57:10Z

Summary

Adds the AutoGen MCP agent (add/sub tools) as the 3rd agent in the EvalHub on-cluster eval pipeline
Fixes unicode whitespace handling in behavioral test assertions (LLMs emit NBSP/thin-space in numeric responses)
Upgrades MLflow to >=3.10.0 for workspace-aware SDK support; simplifies experiment config so traces and eval metrics live in one experiment
Removes 130-line hand-rolled eval-hub-sdk stubs from test conftest; uses the real package (publicly available on PyPI) as a test dependency instead

Changes

agents/autogen/mcp_agent/evalhub/tool_use.yaml — 4 golden queries for add/sub tools
evals/evalhub_adapter/Containerfile — COPY autogen fixtures, bump mlflow pin
evals/evalhub_adapter/tests/run-e2e.sh — autogen route discovery, health check, eval config, job submission
evals/evalhub_adapter/README.md — updated experiment design docs
evals/harness/mlflow_client.py — env-based client init for workspace support
agents/autogen/mcp_agent/tests/behavioral/test_*.py — regex-based whitespace normalization
pyproject.toml — add eval-hub-sdk to test deps
evals/evalhub_adapter/tests/conftest.py — remove SDK stubs

Test plan

pytest evals/evalhub_adapter/tests/ -m unit passes (68 tests)
Full E2E run against cluster with all 3 agents
Verify container builds with mlflow>=3.10.0

Made with Cursor

coderabbitai · 2026-05-06T18:02:07Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: fa7a3946-e8fc-48e4-b649-5d46103c2bed

📥 Commits

Reviewing files that changed from the base of the PR and between 6d74274 and c124da7.

📒 Files selected for processing (1)

agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml

📝 Walkthrough

Walkthrough

Adds AutoGen MCP behavioral tests and fixtures, golden EvalHub fixtures, harness/tool-invocation parsing, MLflow client/adaptor updates, run-e2e and Containerfile changes, and supporting docs/config edits.

Changes

AutoGen MCP Agent Testing & EvalHub Integration

Layer / File(s)	Summary
Project Config & Docs `README.md`, `agents/autogen/mcp_agent/README.md`, `docs/adding-behavioral-tests.md`, `pyproject.toml`, `tests/behavioral/configs/thresholds.yaml`	Adds `AUTOGEN_MCP_AGENT_URL` docs, testing README, `autogen_mcp` pytest marker and thresholds, and test extras dependency.
Golden Data & EvalHub Fixture `agents/autogen/mcp_agent/evalhub/tool_use.yaml`, `agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml`, `evals/evalhub_adapter/Containerfile`	Adds EvalHub `tool_use.yaml` (four golden cases), golden queries fixture, and Containerfile updates to include autogen_mcp fixtures and bump mlflow runtime constraint.
Test Fixtures `agents/autogen/mcp_agent/tests/behavioral/conftest.py`	Adds pytest fixtures: `agent_url`, `http_client`, `eval_config`, `load_golden`, `known_tools`, `autogen_mcp_thresholds`, and `run_eval` (forces `stream=False`, optional MLflow trace enrichment).
Behavioral Tests `agents/autogen/mcp_agent/tests/behavioral/*.py`	Adds tests for tool usage, adversarial/system leakage, hallucinated-tool detection, tool-call arg validity, greeting behavior, pass@k reliability, latency p95, and plan-coherence response-quality.
Eval Harness & MLflow `evals/harness/runner.py`, `evals/harness/mlflow_client.py`, `evals/evalhub_adapter/config.py`	Runner now extracts tool calls from choices/context/tool_invocations and normalizes args; response-text extraction improved; MLflow client now constructs `MlflowClient()` without setting tracking URI; `AgenticEvalParams.stream` default set to false.
EvalHub Adapter & run-e2e `evals/evalhub_adapter/README.md`, `evals/evalhub_adapter/tests/run-e2e.sh`, `evals/evalhub_adapter/tests/conftest.py`	Adapter README standardizes MLflow params; run-e2e discovers `AUTOGEN_MCP_AGENT_ROUTE`, requires `tool_use.yaml`, resolves MLflow token, adjusts TLS defaults, generates/submits `eval-autogen-mcp-agent.yaml`, updates provider JSON; removed evalhub bootstrap stub from adapter tests.

sequenceDiagram
  participant TestRunner as pytest (run_eval)
  participant Agent as AutoGen MCP Agent (HTTP)
  participant EvalHarness as evals.harness.runner
  participant MLflow as mlflow
  participant EvalHub as EvalHub Adapter

  TestRunner->>Agent: send eval request (TaskConfig, stream=False)
  Agent-->>TestRunner: return assistant response (choices / context / tool_invocations)
  TestRunner->>EvalHarness: submit result for normalization
  EvalHarness->>EvalHarness: _extract_tool_calls (choices → context → tool_invocations)
  EvalHarness->>EvalHarness: _extract_response_text (choices[0].message.content or messages[])
  EvalHarness->>MLflow: optionally enrich traces (MlflowClient())
  EvalHarness->>EvalHub: write/submit eval job (eval-autogen-mcp-agent.yaml)
  EvalHub-->>MLflow: run logging / experiment lookup
  EvalHarness-->>TestRunner: return normalized TaskResult

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and accurately summarizes the main objective: adding the AutoGen MCP agent to the EvalHub pipeline.
Description check	✅ Passed	The description is well-related to the changeset, providing a structured summary of objectives, changes, and test plan that directly correspond to the files modified.
Docstring Coverage	✅ Passed	Docstring coverage is 95.83% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch RHAIENG-4224/btest-autogen-mcp

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

evals/evalhub_adapter/tests/run-e2e.sh (1)

555-568: 💤 Low value

Reuse the JSON-escape pattern for MLFLOW_TOKEN interpolation.

The MLFLOW_TOKEN is interpolated as ${MLFLOW_TOKEN} directly inside a single-quoted Python literal here, while step 4 (line 339) already establishes a safer MLFLOW_TOKEN_JSON escape. Tokens with single quotes or backslashes would break this python3 -c block. Same pattern would also harden the inline MLFLOW_TRACKING_URI interpolation.

♻️ Proposed change

+    local mlflow_token_json mlflow_uri_json mlflow_exp_json ns_json
+    mlflow_token_json=$(python3 -c "import json,sys; sys.stdout.write(json.dumps(sys.argv[1]))" "${MLFLOW_TOKEN}")
+    mlflow_uri_json=$(python3 -c "import json,sys; sys.stdout.write(json.dumps(sys.argv[1]))" "${MLFLOW_TRACKING_URI}")
+    mlflow_exp_json=$(python3 -c "import json,sys; sys.stdout.write(json.dumps(sys.argv[1]))" "${MLFLOW_EXPERIMENT}")
+    ns_json=$(python3 -c "import json,sys; sys.stdout.write(json.dumps(sys.argv[1]))" "${OC_NAMESPACE}")
     local experiment_id
     experiment_id=$(python3 -c "
 import os, sys
-os.environ.setdefault('MLFLOW_TRACKING_URI', '${MLFLOW_TRACKING_URI}')
-os.environ.setdefault('MLFLOW_TRACKING_TOKEN', '${MLFLOW_TOKEN}')
+os.environ.setdefault('MLFLOW_TRACKING_URI', ${mlflow_uri_json})
+os.environ.setdefault('MLFLOW_TRACKING_TOKEN', ${mlflow_token_json})
 os.environ.setdefault('MLFLOW_TRACKING_INSECURE_TLS', 'true')
-os.environ.setdefault('MLFLOW_WORKSPACE', '${OC_NAMESPACE}')
+os.environ.setdefault('MLFLOW_WORKSPACE', ${ns_json})
 import mlflow
-exp = mlflow.MlflowClient().get_experiment_by_name('${MLFLOW_EXPERIMENT}')
+exp = mlflow.MlflowClient().get_experiment_by_name(${mlflow_exp_json})
 print(exp.experiment_id if exp else '')
 " 2>/dev/null || true)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@evals/evalhub_adapter/tests/run-e2e.sh` around lines 555 - 568, The python3
-c block that sets experiment_id interpolates ${MLFLOW_TOKEN} and
${MLFLOW_TRACKING_URI} directly into a single-quoted Python string (inside the
experiment_id assignment), which can break when tokens contain single quotes or
backslashes; update this block to reuse the existing JSON-escaped variables
(e.g., MLFLOW_TOKEN_JSON and a similarly-escaped MLFLOW_TRACKING_URI_JSON)
instead of raw ${MLFLOW_TOKEN}/${MLFLOW_TRACKING_URI} so the inline Python
receives safe, escaped values; ensure you reference the same JSON-escape pattern
used earlier and replace the interpolations inside the experiment_id python3 -c
command accordingly.

agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py (1)

19-19: 💤 Low value

Importing from conftest is fragile.

from conftest import load_golden only works because pytest happens to put the test directory on sys.path. Since _tool_queries() runs at collection time, you can't use a fixture, but moving load_golden into a sibling helper module (e.g. _golden.py) is more robust and decouples test data loading from pytest's conftest mechanism.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py` at line 19, The
test imports load_golden directly from conftest which is fragile at collection
time; extract the load_golden helper into a sibling module named _golden (define
a top-level function load_golden there), update the test and any collection-time
code (e.g., the _tool_queries() usage) to import from _golden instead of
conftest, and ensure the new module is located alongside the tests so the import
is stable during collection.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agents/autogen/mcp_agent/tests/behavioral/test_reliability.py`:
- Line 55: The long assignment to text_normalized exceeds the line-length lint
rule; split it into shorter statements by extracting the regex or the lowered
response into a separate variable and then call re.sub. For example, create a
variable like pattern = r"[\s,\u00a0\u2009\u202f]+" or lowered =
result.response.lower() and then do text_normalized = re.sub(pattern, "",
lowered) so the symbols to edit are the text_normalized assignment and its use
of re.sub/result.response.lower().

In `@agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py`:
- Line 53: The long-line ruff formatting error is caused by the single long
statement creating text_normalized in test_tool_usage.py; refactor the line by
breaking it into multiple shorter parts or running ruff format. Concretely,
split the statement that assigns text_normalized (the call to re.sub with the
regex r"[\s,\u00a0\u2009\u202f]+" and result.response.lower()) into two
lines—assign the pattern or the lowered response to a separate variable, then
call re.sub—so the function call and regex literal do not exceed the line length
(or simply run `ruff format` to apply the same fix).

In `@evals/harness/mlflow_client.py`:
- Around line 60-61: The code mutates os.environ["MLFLOW_TRACKING_URI"] in
_get_client which globally affects subprocesses and other instances; change this
to either call mlflow.set_tracking_uri(self.tracking_uri) before creating the
client or pass the tracking URI directly into MlflowClient (e.g.,
mlflow.MlflowClient(tracking_uri=self.tracking_uri)) and remove the os.environ
assignment so that self._client and mlflow.search_traces resolve the correct
server without globally altering process environment; update the lazy-init in
_get_client (and any places that rely on os.environ) to use self.tracking_uri
explicitly.

In `@pyproject.toml`:
- Line 17: The test-mlflow extra in pyproject.toml still pins mlflow>=2.0;
update the test-mlflow extras specification to require mlflow>=3.10.0 so it
matches the Containerfile and evals/evalhub_adapter README requirements and
ensures workspace-aware SDK features are installed when running uv pip install
.[evalhub,test-mlflow]; modify the test-mlflow entry (the extras key
"test-mlflow") to use "mlflow>=3.10.0".

---

Nitpick comments:
In `@agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py`:
- Line 19: The test imports load_golden directly from conftest which is fragile
at collection time; extract the load_golden helper into a sibling module named
_golden (define a top-level function load_golden there), update the test and any
collection-time code (e.g., the _tool_queries() usage) to import from _golden
instead of conftest, and ensure the new module is located alongside the tests so
the import is stable during collection.

In `@evals/evalhub_adapter/tests/run-e2e.sh`:
- Around line 555-568: The python3 -c block that sets experiment_id interpolates
${MLFLOW_TOKEN} and ${MLFLOW_TRACKING_URI} directly into a single-quoted Python
string (inside the experiment_id assignment), which can break when tokens
contain single quotes or backslashes; update this block to reuse the existing
JSON-escaped variables (e.g., MLFLOW_TOKEN_JSON and a similarly-escaped
MLFLOW_TRACKING_URI_JSON) instead of raw ${MLFLOW_TOKEN}/${MLFLOW_TRACKING_URI}
so the inline Python receives safe, escaped values; ensure you reference the
same JSON-escape pattern used earlier and replace the interpolations inside the
experiment_id python3 -c command accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 7213ef84-7ece-4452-a026-927fc7098a49

📥 Commits

Reviewing files that changed from the base of the PR and between 49f1245 and 3b5c203.

📒 Files selected for processing (18)

README.md
agents/autogen/mcp_agent/README.md
agents/autogen/mcp_agent/evalhub/tool_use.yaml
agents/autogen/mcp_agent/tests/behavioral/conftest.py
agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml
agents/autogen/mcp_agent/tests/behavioral/test_cost_latency.py
agents/autogen/mcp_agent/tests/behavioral/test_reliability.py
agents/autogen/mcp_agent/tests/behavioral/test_response_quality.py
agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py
docs/adding-behavioral-tests.md
evals/evalhub_adapter/Containerfile
evals/evalhub_adapter/README.md
evals/evalhub_adapter/tests/conftest.py
evals/evalhub_adapter/tests/run-e2e.sh
evals/harness/mlflow_client.py
evals/harness/runner.py
pyproject.toml
tests/behavioral/configs/thresholds.yaml

💤 Files with no reviewable changes (1)

evals/evalhub_adapter/tests/conftest.py

- Add the AutoGen MCP agent (add/sub tools) as the 3rd agent in the EvalHub on-cluster eval pipeline - Add behavioral tests for tool usage, reliability, response quality, and latency - Upgrade MLflow to >=3.10.0 for workspace-aware SDK support; simplify experiment config so traces and eval metrics live in one experiment - Remove hand-rolled eval-hub-sdk stubs from test conftest; use the real package as a test dependency - Extend harness runner to extract tool_invocations[] from non-streaming responses Co-authored-by: Cursor <cursoragent@cursor.com>

- Add chained add→sub query to exercise tool_sequence scorer - Update run-e2e.sh header to reflect all three agent profiles Co-authored-by: Cursor <cursoragent@cursor.com>

AutoGen's reflection loop returns 500 ("Reflect on tool use produced no valid text response") when two tools are called in a single turn. Revert the multi-tool query until the agent supports chained tool calls. Tested: 11/11 behavioral tests pass against agentic-mcp cluster. Co-authored-by: Cursor <cursoragent@cursor.com>

kami619 · 2026-05-11T18:23:30Z

Claude Says:

The AutoGen MCP eval config generated by the e2e script doesn't set stream: false. The adapter defaults to stream: true. The entire PR rationale is that tool_invocations[] is only available in non-streaming mode — so the EvalHub adapter path will fail to capture tool calls, causing all tool scorers to score 0.0.

@andrewdonheiser do you think we need to override this in the EvalHub adapter config params as well ? in the e2e script to make sure it will aligned with what we want ?

kami619 · 2026-05-11T18:26:29Z

apart from that I thinks this PR LGTM.

Non-streaming responses include tool_invocations/tool_calls in the JSON body. Streaming relies on delta.tool_calls which not all agents emit (e.g. AutoGen uses a custom mcp.tool_usage SSE event). Defaulting to false ensures tool scorers work for all agents out of the box; jobs can still opt in to streaming via job parameters. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

andrewdonheiser · 2026-05-12T21:18:01Z

Good catch, Kamesh. You're right — the EvalHub adapter defaults stream: True in AgenticEvalParams, and since the e2e script doesn't override it, AutoGen MCP evals would stream. The AutoGen agent emits tool data via a custom mcp.tool_usage SSE event that the runner's _run_streaming() doesn't parse, so tool scorers would see zero tool calls.

I've changed the default to stream: False in 6d74274. Non-streaming is the safer default for all agents — tool calls are reliably available in the JSON response body. Jobs can still opt in to streaming via job parameters if needed.

Verified against all three deployed agents:

AutoGen MCP: 11/11 passed
LangGraph React: 9 passed, 2 skipped
OpenAI Responses: pre-existing issue (empty responses), unrelated to this change

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml`:
- Around line 30-31: Update the golden fixture so the prompt uses the exact MCP
tool name 'sub' instead of the generic phrase "subtract tool": modify the YAML
entry where the query string asks for subtraction (the 'query' key in the
golden_queries.yaml fixture) to read something like "Please use the `sub` tool
to find the difference between 1000000 and 734291" so the model is explicitly
directed to call the registered tool 'sub' which matches the expected_tools
array.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 96d6886b-3bc0-4dab-a480-57fdc1b4a991

📥 Commits

Reviewing files that changed from the base of the PR and between 421463c and 6d74274.

📒 Files selected for processing (3)

agents/autogen/mcp_agent/evalhub/tool_use.yaml
agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml
evals/evalhub_adapter/config.py

✅ Files skipped from review due to trivial changes (1)

agents/autogen/mcp_agent/evalhub/tool_use.yaml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kami619

lgtm

github-actions Bot added area/autogen area/docs area/tests size/l labels May 6, 2026

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

Comment thread agents/autogen/mcp_agent/tests/behavioral/test_reliability.py Outdated

Comment thread agents/autogen/mcp_agent/tests/behavioral/test_tool_usage.py Outdated

Comment thread evals/harness/mlflow_client.py Outdated

Comment thread pyproject.toml Outdated

andrewdonheiser force-pushed the RHAIENG-4224/btest-autogen-mcp branch 2 times, most recently from de8a194 to 15748c9 Compare May 6, 2026 20:18

andrewdonheiser force-pushed the RHAIENG-4224/btest-autogen-mcp branch from 15748c9 to be33ff0 Compare May 6, 2026 20:28

andrewdonheiser requested review from kami619 and sanafayyaz315 May 6, 2026 21:49

sanafayyaz315 reviewed May 7, 2026

View reviewed changes

Comment thread agents/autogen/mcp_agent/evalhub/tool_use.yaml

Comment thread evals/evalhub_adapter/tests/run-e2e.sh Outdated

andrewdonheiser and others added 2 commits May 8, 2026 13:04

fix: address PR review — add multi-tool query, fix e2e comment

421463c

- Add chained add→sub query to exercise tool_sequence scorer - Update run-e2e.sh header to reflect all three agent profiles Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Comment thread agents/autogen/mcp_agent/tests/behavioral/fixtures/golden_queries.yaml Outdated

fix: use exact MCP tool name in golden query fixture

c124da7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kami619 reviewed May 13, 2026

View reviewed changes

Comment thread agents/autogen/mcp_agent/evalhub/tool_use.yaml

kami619 approved these changes May 13, 2026

View reviewed changes

kami619 merged commit b4080a9 into main May 13, 2026
6 checks passed

kami619 deleted the RHAIENG-4224/btest-autogen-mcp branch May 13, 2026 11:35

mpk-droid mentioned this pull request May 13, 2026

feat: add behavioral tests and EvalHub integration for CrewAI websearch agent #97

Merged

5 tasks

Conversation

andrewdonheiser commented May 6, 2026

Summary

Changes

Test plan

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kami619 commented May 11, 2026

Uh oh!

kami619 commented May 11, 2026

Uh oh!

andrewdonheiser commented May 12, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kami619 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented May 6, 2026 •

edited

Loading