feat(evalhub): EvalHub adapter — E2E validated, hardened, documented#82
Conversation
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds an EvalHub agent evaluation adapter and harness: new adapter package, config/validation, benchmark registry and YAML fixtures, MLflow integration and client checks, container build and docs, comprehensive unit/integration tests plus an e2e script, and project metadata/gitignore updates to surface evals assets. ChangesAgentic EvalHub adapter + harness
Sequence DiagramsequenceDiagram
participant CLI as EvalHub CLI
participant Adapter as AgenticEvalAdapter
participant Agent as Agent Service
participant Scorer as Scoring subsystem
participant MLflow as MLflow
CLI->>Adapter: run_benchmark_job(config, callbacks)
Adapter->>Adapter: Load benchmark & queries
Adapter->>Adapter: Validate config (URLs, TLS, paths)
loop For each query
Adapter->>Agent: run_task(query, stream=True)
Agent-->>Adapter: SSE stream (tool calls, content deltas)
Adapter->>Adapter: Parse stream -> build trace, extract tool calls
Adapter->>Scorer: _run_scorer(name, QuerySpec, trace)
Scorer-->>Adapter: Score(value, passed, details)
Adapter->>Adapter: Record per-query scores & metadata
end
Adapter->>Adapter: Aggregate scores (mean, pass rate, min/max)
Adapter->>MLflow: _log_mlflow_run(experiment, metrics, artifacts)
MLflow-->>Adapter: run_id
Adapter->>Adapter: Enrich results with MLflow trace info
Adapter-->>CLI: JobResults(metrics, run_id, status)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 13 minutes.Comment |
|
Large PR detected (3576 lines changed) This PR exceeds 1200 lines of code changes (excluding lock files, generated content, and images). Large PRs are harder to review thoroughly and are more likely to introduce bugs. Consider splitting this PR into smaller, focused changes. |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evals/evalhub_adapter/Containerfile`:
- Around line 27-31: The RUN pip install line installs unpinned runtime packages
(the RUN line installing "eval-hub-sdk[adapter]", httpx, mlflow, PyYAML), so pin
or bound their versions to ensure reproducible builds; update the RUN invocation
to install either exact versions (package==x.y.z) or bounded ranges
(package>=x.y,<x+1.y) for each of "eval-hub-sdk[adapter]", httpx, mlflow and
PyYAML, or reference a checked-in constraints.txt/requirements.txt file and use
pip install --no-cache -r constraints.txt to enforce deterministic dependency
resolution.
In `@evals/evalhub_adapter/tests/run-e2e.sh`:
- Around line 489-492: The current curl call that sets experiment_id omits
authentication flags and does not URL-encode the experiment name; update the
command that sets the experiment_id (the block that calls curl and assigns
experiment_id) to include the existing CURL_TLS_FLAG and to pass the
MLFLOW_TOKEN for auth (e.g., via -H "Authorization: Bearer ${MLFLOW_TOKEN}" or
similar), and URL-encode the experiment name when building the query parameter
so the lookup succeeds on secured clusters; also ensure the later URL generation
that falls back to the raw experiment name/experiment_id uses the URL-encoded
value rather than the unencoded variable.
- Around line 300-307: MLFLOW_AUTH_CHECK currently uses the low-privilege
"/api/3.0/mlflow/server-info" endpoint which can return non-200 for reasons
unrelated to auth; change the curl check to call a protected endpoint such as
"/api/2.0/mlflow/experiments/list" so the HTTP status reliably reflects
authorization (401/403 on token failure). Update the MLFLOW_AUTH_CHECK command
to hit "${MLFLOW_TRACKING_URI}/api/2.0/mlflow/experiments/list" with the same
Authorization header and error handling, and keep the existing conditional that
warns and prints the token refresh instructions when the status is not 200.
- Line 109: The JSONPath "contains" predicate is unsupported and returns empty
results; replace the failing jsonpath expression used to populate EVALHUB_ROUTE
with a command that lists route names and hosts (e.g., oc get route -n
"$OC_NAMESPACE" -o custom-columns=NAME:.metadata.name,HOST:.spec.host
--no-headers) and then grep/awk to pick the host for the route whose NAME
contains "eval"; do the same replacement for the commands that set REACT_ROUTE
and OPENAI_ROUTE so all three use the custom-columns + grep/awk pattern instead
of the jsonpath contains filter.
In `@evals/harness/mlflow_client.py`:
- Around line 92-105: The log messages around MLflow auth handling use two
different environment variable names (MLFLOW_TRACKING_TOKEN vs MLFLOW_TOKEN),
causing confusion; update the guidance in the JSONDecodeError/non-JSON branch
(the logger.error message that currently says "export MLFLOW_TOKEN=$(oc whoami
-t)...") to reference the same name used earlier (MLFLOW_TRACKING_TOKEN) so both
messages consistently instruct users to export MLFLOW_TRACKING_TOKEN=$(oc whoami
-t) and re-register the provider; verify both logger.error occurrences mention
the identical env var string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 4487de3d-4e4b-4169-b08f-d8427c04b5e4
📒 Files selected for processing (25)
.gitignoreREADME.mdagents/langgraph/react_agent/evalhub/tool_use.yamlagents/vanilla_python/openai_responses_agent/evalhub/tool_use.yamldocs/adding-evalhub-agent-integration.mdevals/evalhub_adapter/Containerfileevals/evalhub_adapter/README.mdevals/evalhub_adapter/__init__.pyevals/evalhub_adapter/adapter.pyevals/evalhub_adapter/config.pyevals/evalhub_adapter/evaluations.pyevals/evalhub_adapter/tests/conftest.pyevals/evalhub_adapter/tests/run-e2e.shevals/evalhub_adapter/tests/test_adapter.pyevals/evalhub_adapter/tests/test_config_and_evaluations.pyevals/evalhub_adapter/tests/test_integration.pyevals/harness/__init__.pyevals/harness/mlflow_client.pyevals/harness/runner.pyevals/harness/scorers/__init__.pyevals/harness/scorers/latency.pyevals/harness/scorers/plan_coherence.pyevals/harness/scorers/safety.pyevals/harness/scorers/tool_sequence.pypyproject.toml
|
All 5 CodeRabbit findings addressed in dd91971:
Also added missing docstrings ( Note: The CodeQL "Analyze (actions)" failure is a pre-existing infrastructure issue (CodeQL cannot process GitHub Actions files in this repo) — not related to this PR's changes. |
dd91971 to
d5e9849
Compare
Add the EvalHub on-cluster adapter (evals/evalhub_adapter/) and shared eval harness (evals/harness/). Includes Containerfile, unit/integration tests, agent-specific eval fixtures, and walkthrough documentation. Made-with: Cursor
d5e9849 to
6ce2a58
Compare
sanafayyaz315
left a comment
There was a problem hiding this comment.
Looks good overall. Added a few comments
|
Test contract mismatch for .[test] installs evals/evalhub_adapter/README.md says uv pip install .[test] is enough for adapter tests, but two unit tests patch top-level mlflow symbols and fail when mlflow is not installed. Claude Reproduced this issue in a clean env: UV_PROJECT_ENVIRONMENT=/tmp/uv-pr82-test uv sync --extra test test_adapter.py Lines 353-360 README.md Lines 479-482 |
|
Just learning about the adapter. You might have already looked into this. @andrewdonheiser do you think the adapter would be better added to this repo(https://github.com/eval-hub/eval-hub-contrib)? @kami619 @sanafayyaz315 are you guys aware of this? Maybe we should ask the suggestion from one of the contributors of that repo? My reasoning is that, this adapter seems like it could benefit any agent and a larger audience than just agentic-starter-kit users? |
From my point of view, I think it is fine to land it here, as we need this for the QG7: Behavioral Evals., once this matures enough, we can make a contribution to the evalhub space directly. |
Thats a fair point. We will need them for QG7. In the future, we will need to find a way to have these in a centralized place and import for testing. |
I feel like either we will have to adjust the PR size rule or follow the suggestion. What do you guys think? In general, I am on the smaller PR team. Curious to hear everyone else's thoughts. |
100% on the small PR team, but I think with these initial bits being added, we would have to live through the pain a bit, otherwise we might succumb to the Agile papercuts. But we need to make a conscious effort once we have scaffolding in place, to ensure the atomic PRs become the norm. |
@mpk-droid I had the same thought of contributing it to the eval-hub-contrib repo. This adapter is agent-agnostic — the runner, scorers, and MLflow integration work with any agent that exposes /chat/completions. The only agent-specific piece is the fixture YAML files (golden queries and expected tools). That said, I think we should first validate it across the agents we have in agentic-starter-kits-v2 (e.g., LangGraph, OpenAI Responses) to prove it handles different agents reliably, and then contribute it to the contrib repo. |
the tests and documentation take up a lot of size. the adapter code itself is stripped down. I don't think splitting it up makes sense. |
Addresses PR review feedback: _validate_url now permits localhost hosts (localhost, 127.0.0.1, ::1, 0.0.0.0) when EVALHUB_ALLOW_LOCALHOST=true is set, following the same gating pattern as EVALHUB_ALLOW_INSECURE_TLS. Cloud metadata endpoints remain blocked regardless. Adds TODO for future auto-discovery of agent fixture dirs in the Containerfile. Co-authored-by: Cursor <cursoragent@cursor.com>
Two unit tests in test_adapter.py patch top-level mlflow symbols (@patch("mlflow.log_metric"), etc.) which requires mlflow to be in sys.modules at decoration time. Since mlflow lives in the test-mlflow extra, not test, these tests fail in a clean .[test]-only env. Add an mlflow stub in conftest.py following the same pattern used for evalhub. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evals/evalhub_adapter/evaluations.py`:
- Around line 101-115: The code accepts expected_tools and expected_elements
without validating their types, which can let scalars or mixed lists slip
through; in the loop that builds QuerySpec (the block creating QuerySpec with
query=entry["query"], expected_tools=entry.get("expected_tools", []),
expected_elements=entry.get("expected_elements", [])), validate that both
entry.get("expected_tools") and entry.get("expected_elements") are either
missing/None or lists, and that every item in those lists is a string (or raise
a ValueError including the path and the query index i); if you prefer
strictness, coerce missing values to [] before constructing QuerySpec and reject
non-list or non-string items with a clear error referencing QuerySpec creation
and the offending field name.
In `@evals/evalhub_adapter/README.md`:
- Around line 146-187: The README templating step for provider-agentic.json
interpolates ${MLFLOW_TOKEN} into the runtime Env MLFLOW_TRACKING_TOKEN but
never exports MLFLOW_TOKEN beforehand; export MLFLOW_TOKEN (e.g., set
MLFLOW_TOKEN="$(oc whoami -t)" or equivalent) before creating
provider-agentic.json so the "MLFLOW_TRACKING_TOKEN" Env in
provider-agentic.json contains a valid token rather than an empty string, then
re-run the provider registration step that uses provider-agentic.json.
In `@evals/harness/mlflow_client.py`:
- Around line 71-86: verify_connection() calls self._get_client() but
_get_client() only resolves the experiment_id during initial client creation
(when self._client is None), so once _experiment_id is None it stays missing and
get_latest_trace() can never re-resolve it; modify _get_client() so the
experiment lookup/assignment (setting self._experiment_id using
self.experiment_name) is performed unconditionally (i.e., move the experiment
resolution outside the "if self._client is None" guard) or add a separate method
called from _get_client() to always re-check and set _experiment_id; ensure
verify_connection(), get_latest_trace(), and any callers rely on the updated
_experiment_id after subsequent _get_client() calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: e68680d4-10ac-4c42-adb8-28d45b330f5a
📒 Files selected for processing (25)
.gitignoreREADME.mdagents/langgraph/react_agent/evalhub/tool_use.yamlagents/vanilla_python/openai_responses_agent/evalhub/tool_use.yamldocs/adding-evalhub-agent-integration.mdevals/evalhub_adapter/Containerfileevals/evalhub_adapter/README.mdevals/evalhub_adapter/__init__.pyevals/evalhub_adapter/adapter.pyevals/evalhub_adapter/config.pyevals/evalhub_adapter/evaluations.pyevals/evalhub_adapter/tests/conftest.pyevals/evalhub_adapter/tests/run-e2e.shevals/evalhub_adapter/tests/test_adapter.pyevals/evalhub_adapter/tests/test_config_and_evaluations.pyevals/evalhub_adapter/tests/test_integration.pyevals/harness/__init__.pyevals/harness/mlflow_client.pyevals/harness/runner.pyevals/harness/scorers/__init__.pyevals/harness/scorers/latency.pyevals/harness/scorers/plan_coherence.pyevals/harness/scorers/safety.pyevals/harness/scorers/tool_sequence.pypyproject.toml
✅ Files skipped from review due to trivial changes (6)
- evals/harness/scorers/latency.py
- README.md
- evals/evalhub_adapter/init.py
- agents/langgraph/react_agent/evalhub/tool_use.yaml
- .gitignore
- evals/evalhub_adapter/Containerfile
🚧 Files skipped from review as they are similar to previous changes (2)
- agents/vanilla_python/openai_responses_agent/evalhub/tool_use.yaml
- pyproject.toml
- evaluations.py: validate query/expected_tools/expected_elements types in load_queries to fail fast on malformed fixtures - mlflow_client.py: re-resolve experiment_id on subsequent _get_client() calls so trace enrichment works when the experiment is created after verify_connection() - README.md: add MLFLOW_TOKEN export before provider JSON template; add language tag to fenced code block (MD040) Co-authored-by: Cursor <cursoragent@cursor.com>
|
E2E validated on RHOAI 3.4.0-ea.2 cluster — works end-to-end, but surfaced a server/SDK version mismatch that will affect users.
Suggestion: Either document the minimum server version requirement (0.3.0) in the runbook/README, or coordinate with the EvalHub team to ensure RHOAI ships 0.3.0+ before this adapter lands. The COMPATIBILITY.md in the eval-hub repo also needs updating — the only listed pair (0.1.0 + 0.1.0a8) no longer exists as published artifacts. Additional notes from the E2E run:
|
- run-e2e.sh: guard `evalhub providers list` with || echo to handle SDK 0.1.6 Pydantic crash on tags:null from built-in providers - run-e2e.sh: add EVALHUB_ALLOW_LOCALHOST to provider runtime Env so adapter pods can reach localhost-bound services during E2E - README.md: document minimum EvalHub server version 0.3.0 requirement (BYOF provider path fails on 0.2.0 shipped with RHOAI 3.4.0-ea) Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@evals/evalhub_adapter/evaluations.py`:
- Around line 95-138: The loader currently treats a missing or falsy "queries"
field as an empty list (data.get("queries") or []), allowing invalid fixtures to
silently become zero-sample benchmarks; change the validation to require that
"queries" exists in the top-level mapping, is a list, and contains at least one
entry before iterating — e.g., check that "queries" in data,
isinstance(data["queries"], list), and len(data["queries"]) > 0, and if any
check fails raise a ValueError referencing the path; keep the subsequent
per-entry checks and still build the list of QuerySpec objects (QuerySpec,
queries, entry) as before.
In `@evals/evalhub_adapter/README.md`:
- Around line 139-143: The README is missing an export of OC_NAMESPACE before
using it to template provider-agentic.json, causing MLFLOW_WORKSPACE to be
empty; update the walkthrough to explicitly export OC_NAMESPACE (e.g., export
OC_NAMESPACE=<your-namespace>) prior to any templating steps and callouts that
populate MLFLOW_WORKSPACE, and ensure all other occurrences that reference
${OC_NAMESPACE} (including the provider-agentic.json templating and the later
walkthrough sections) are updated to rely on this exported variable so MLflow
receives a non-empty MLFLOW_WORKSPACE value.
In `@evals/evalhub_adapter/tests/run-e2e.sh`:
- Line 25: The current trap only removes WORK_DIR on EXIT and so if an error
occurs after provider registration the provider is leaked; add a dedicated
cleanup function (e.g., cleanup_provider) that deletes the registered provider
and ensure the trap calls that function on EXIT/ERR in addition to removing
WORK_DIR. Wire the trap replacement so it runs after the provider registration
step (replace or extend the existing trap 'rm -rf "${WORK_DIR}"' EXIT to call
cleanup_provider && rm -rf "${WORK_DIR}" or use trap 'on_exit' EXIT where
on_exit calls both provider deletion and rm -rf "${WORK_DIR}"), and make sure
the provider deletion logic references the same registration identifiers used
during registration so it always cleans up on failure paths.
- Around line 107-112: The get_route_contains function currently greps both NAME
and HOST columns which can match a hostname; update get_route_contains to only
match the NAME column by using awk to test $1 (route name) for the needle and
print $2 (host) when matched (e.g., replace the grep|head|awk pipeline with a
single awk that checks $1 ~ needle {print $2; exit}) so only route names are
considered; keep the existing behavior of returning nothing on no match.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 4cfb03a8-1b17-45da-ac98-fe80799a52c7
📒 Files selected for processing (6)
evals/evalhub_adapter/README.mdevals/evalhub_adapter/config.pyevals/evalhub_adapter/evaluations.pyevals/evalhub_adapter/tests/conftest.pyevals/evalhub_adapter/tests/run-e2e.shevals/harness/mlflow_client.py
✅ Files skipped from review due to trivial changes (1)
- evals/evalhub_adapter/config.py
- evaluations.py: fail fast when queries is missing/empty instead of silently producing a zero-sample benchmark - run-e2e.sh: trap-based cleanup deletes the provider on error paths; match only route names (not hosts) in get_route_contains - README.md: export OC_NAMESPACE before templating provider JSON Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
Completes the EvalHub adapter migration (RHAIENG-4605) and validates it end-to-end against both implemented agents on the
agentic-mcpcluster.verify_connection()on startup, separatemlflow_trace_experiment_name, run-ID propagation to EvalHub results, richer run tagsrun-e2e.shscript that auto-discovers agent/EvalHub routes, builds/pushes the adapter image, registers the provider, submits jobs for both agents, polls for results, and cleans upinjection_payloads.yamlduplicates from agent dirs, removed unusedEXPOSE 8080from ContainerfileHow to test
The easiest way to validate is with the automated E2E script against both implemented agents (LangGraph
react_agentand vanilla Pythonopenai_responses_agent):The script handles image build, provider registration, job submission, polling, and cleanup. See the adapter README end-to-end walkthrough for the full manual flow and parameter reference.
Cluster credentials: login credentials for the
agentic-mcpcluster are stored in Bitwarden.Test plan
make testorpytest evals/evalhub_adapter/tests/test_adapter.py evals/evalhub_adapter/tests/test_config_and_evaluations.py)pytest evals/evalhub_adapter/tests/test_integration.py)run-e2e.shcompletes successfully against both agents onagentic-mcpevalhub eval resultsreturns non-nullmlflow_run_idMade with Cursor