Skip to content

feat(evalhub): EvalHub adapter — E2E validated, hardened, documented#82

Merged
andrewdonheiser merged 7 commits into
mainfrom
RHAIENG-4605/validate-evalhub-e2e
May 4, 2026
Merged

feat(evalhub): EvalHub adapter — E2E validated, hardened, documented#82
andrewdonheiser merged 7 commits into
mainfrom
RHAIENG-4605/validate-evalhub-e2e

Conversation

@andrewdonheiser
Copy link
Copy Markdown
Contributor

Summary

Completes the EvalHub adapter migration (RHAIENG-4605) and validates it end-to-end against both implemented agents on the agentic-mcp cluster.

  • Adapter hardening: URL/param validation, MLflow connection verification, scorer exception handling, insecure-TLS gating, sanitized error messages
  • MLflow integration: verify_connection() on startup, separate mlflow_trace_experiment_name, run-ID propagation to EvalHub results, richer run tags
  • Comprehensive docs: full end-to-end walkthrough in the adapter README, new Adding an EvalHub Agent Integration guide
  • E2E automation: run-e2e.sh script that auto-discovers agent/EvalHub routes, builds/pushes the adapter image, registers the provider, submits jobs for both agents, polls for results, and cleans up
  • Test coverage: 50 unit tests + 11 integration tests covering the new validation, MLflow features, and error handling
  • Cleanup: removed stale injection_payloads.yaml duplicates from agent dirs, removed unused EXPOSE 8080 from Containerfile

How to test

The easiest way to validate is with the automated E2E script against both implemented agents (LangGraph react_agent and vanilla Python openai_responses_agent):

cd evals/evalhub_adapter/tests
REGISTRY_USER=<your-quay-user> OC_NAMESPACE=<your-namespace> ./run-e2e.sh

The script handles image build, provider registration, job submission, polling, and cleanup. See the adapter README end-to-end walkthrough for the full manual flow and parameter reference.

Cluster credentials: login credentials for the agentic-mcp cluster are stored in Bitwarden.

Test plan

  • Unit tests pass (make test or pytest evals/evalhub_adapter/tests/test_adapter.py evals/evalhub_adapter/tests/test_config_and_evaluations.py)
  • Integration tests pass (pytest evals/evalhub_adapter/tests/test_integration.py)
  • run-e2e.sh completes successfully against both agents on agentic-mcp
  • MLflow runs appear with correct metrics, tags, and run IDs
  • evalhub eval results returns non-null mlflow_run_id

Made with Cursor

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Warning

Rate limit exceeded

@andrewdonheiser has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 13 minutes before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 30603b5e-abe9-40ee-bab5-948c146d850f

📥 Commits

Reviewing files that changed from the base of the PR and between 8426fe6 and 0a6d00a.

📒 Files selected for processing (4)
  • evals/evalhub_adapter/README.md
  • evals/evalhub_adapter/evaluations.py
  • evals/evalhub_adapter/tests/run-e2e.sh
  • evals/evalhub_adapter/tests/test_config_and_evaluations.py
📝 Walkthrough

Walkthrough

Adds an EvalHub agent evaluation adapter and harness: new adapter package, config/validation, benchmark registry and YAML fixtures, MLflow integration and client checks, container build and docs, comprehensive unit/integration tests plus an e2e script, and project metadata/gitignore updates to surface evals assets.

Changes

Agentic EvalHub adapter + harness

Layer / File(s) Summary
Benchmark data shape
agents/*/*/evalhub/tool_use.yaml, evals/evalhub_adapter/evaluations.py
New per-agent golden-query YAML fixtures and QuerySpec/BenchmarkDef dataclasses; benchmark registry includes "agentic-tool-use" and loader load_queries() with YAML validation.
Config & validation
evals/evalhub_adapter/config.py
Adds AgenticEvalParams dataclass, URL/host/TLS/fixtures validation (_validate_url), environment-gated allowances, and job_spec_to_task_config() mapping to internal TaskConfig.
Core adapter implementation
evals/evalhub_adapter/adapter.py, evals/evalhub_adapter/__init__.py
New AgenticEvalAdapter class with async/sync orchestration, SSE-driven task execution, scorer dispatch/aggregation, MLflow logging, phase callbacks, and a main() entry; package re-export added.
MLflow harness changes
evals/harness/mlflow_client.py
MLflowTraceClient.verify_connection() added; client lazy re-resolution/refetch of experiment id improved and targeted error logging added.
Container + docs
evals/evalhub_adapter/Containerfile, evals/evalhub_adapter/README.md, docs/adding-evalhub-agent-integration.md
Container build for adapter (copies fixtures, runtime-only deps, build-time fixture assertions); comprehensive README and integration guide describing JobSpec parameters, provider registration, MLflow integration, and build/push/register workflow.
Tests & test fixtures
evals/evalhub_adapter/tests/conftest.py, .../test_adapter.py, .../test_config_and_evaluations.py, .../test_integration.py
Adds pytest stubs for missing evalhub/mlflow, fixtures, and extensive unit + integration tests covering scorers, aggregation, config validation, YAML loading, SSE orchestration, MLflow logging, and adapter CLI behavior.
E2E automation
evals/evalhub_adapter/tests/run-e2e.sh
New Bash end-to-end script for OpenShift: preflight checks, build/push adapter image, register provider via EvalHub REST, submit eval runs, fetch results, surface MLflow run links, and cleanup.
Packaging & metadata
pyproject.toml, .gitignore, README.md
Adds evalhub optional dependency and pytest discovery/markers, expands evals/ documentation section, and updates .gitignore to exclude editor/evalhub artifacts.
Minor harness docstring
evals/harness/scorers/latency.py
Small docstring addition in LatencyTracker.__init__.

Sequence Diagram

sequenceDiagram
    participant CLI as EvalHub CLI
    participant Adapter as AgenticEvalAdapter
    participant Agent as Agent Service
    participant Scorer as Scoring subsystem
    participant MLflow as MLflow

    CLI->>Adapter: run_benchmark_job(config, callbacks)
    Adapter->>Adapter: Load benchmark & queries
    Adapter->>Adapter: Validate config (URLs, TLS, paths)
    loop For each query
        Adapter->>Agent: run_task(query, stream=True)
        Agent-->>Adapter: SSE stream (tool calls, content deltas)
        Adapter->>Adapter: Parse stream -> build trace, extract tool calls
        Adapter->>Scorer: _run_scorer(name, QuerySpec, trace)
        Scorer-->>Adapter: Score(value, passed, details)
        Adapter->>Adapter: Record per-query scores & metadata
    end
    Adapter->>Adapter: Aggregate scores (mean, pass rate, min/max)
    Adapter->>MLflow: _log_mlflow_run(experiment, metrics, artifacts)
    MLflow-->>Adapter: run_id
    Adapter->>Adapter: Enrich results with MLflow trace info
    Adapter-->>CLI: JobResults(metrics, run_id, status)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.24% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: completion of the EvalHub adapter with E2E validation, hardening, and documentation.
Description check ✅ Passed The description is directly related to the changeset, providing a detailed summary of the implementation, testing approach, and operational guidance for the EvalHub adapter feature.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch RHAIENG-4605/validate-evalhub-e2e

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 13 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

Large PR detected (3576 lines changed)

This PR exceeds 1200 lines of code changes (excluding lock files, generated content, and images). Large PRs are harder to review thoroughly and are more likely to introduce bugs.

Consider splitting this PR into smaller, focused changes.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evals/evalhub_adapter/Containerfile`:
- Around line 27-31: The RUN pip install line installs unpinned runtime packages
(the RUN line installing "eval-hub-sdk[adapter]", httpx, mlflow, PyYAML), so pin
or bound their versions to ensure reproducible builds; update the RUN invocation
to install either exact versions (package==x.y.z) or bounded ranges
(package>=x.y,<x+1.y) for each of "eval-hub-sdk[adapter]", httpx, mlflow and
PyYAML, or reference a checked-in constraints.txt/requirements.txt file and use
pip install --no-cache -r constraints.txt to enforce deterministic dependency
resolution.

In `@evals/evalhub_adapter/tests/run-e2e.sh`:
- Around line 489-492: The current curl call that sets experiment_id omits
authentication flags and does not URL-encode the experiment name; update the
command that sets the experiment_id (the block that calls curl and assigns
experiment_id) to include the existing CURL_TLS_FLAG and to pass the
MLFLOW_TOKEN for auth (e.g., via -H "Authorization: Bearer ${MLFLOW_TOKEN}" or
similar), and URL-encode the experiment name when building the query parameter
so the lookup succeeds on secured clusters; also ensure the later URL generation
that falls back to the raw experiment name/experiment_id uses the URL-encoded
value rather than the unencoded variable.
- Around line 300-307: MLFLOW_AUTH_CHECK currently uses the low-privilege
"/api/3.0/mlflow/server-info" endpoint which can return non-200 for reasons
unrelated to auth; change the curl check to call a protected endpoint such as
"/api/2.0/mlflow/experiments/list" so the HTTP status reliably reflects
authorization (401/403 on token failure). Update the MLFLOW_AUTH_CHECK command
to hit "${MLFLOW_TRACKING_URI}/api/2.0/mlflow/experiments/list" with the same
Authorization header and error handling, and keep the existing conditional that
warns and prints the token refresh instructions when the status is not 200.
- Line 109: The JSONPath "contains" predicate is unsupported and returns empty
results; replace the failing jsonpath expression used to populate EVALHUB_ROUTE
with a command that lists route names and hosts (e.g., oc get route -n
"$OC_NAMESPACE" -o custom-columns=NAME:.metadata.name,HOST:.spec.host
--no-headers) and then grep/awk to pick the host for the route whose NAME
contains "eval"; do the same replacement for the commands that set REACT_ROUTE
and OPENAI_ROUTE so all three use the custom-columns + grep/awk pattern instead
of the jsonpath contains filter.

In `@evals/harness/mlflow_client.py`:
- Around line 92-105: The log messages around MLflow auth handling use two
different environment variable names (MLFLOW_TRACKING_TOKEN vs MLFLOW_TOKEN),
causing confusion; update the guidance in the JSONDecodeError/non-JSON branch
(the logger.error message that currently says "export MLFLOW_TOKEN=$(oc whoami
-t)...") to reference the same name used earlier (MLFLOW_TRACKING_TOKEN) so both
messages consistently instruct users to export MLFLOW_TRACKING_TOKEN=$(oc whoami
-t) and re-register the provider; verify both logger.error occurrences mention
the identical env var string.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 4487de3d-4e4b-4169-b08f-d8427c04b5e4

📥 Commits

Reviewing files that changed from the base of the PR and between 3893a2b and ceabb41.

📒 Files selected for processing (25)
  • .gitignore
  • README.md
  • agents/langgraph/react_agent/evalhub/tool_use.yaml
  • agents/vanilla_python/openai_responses_agent/evalhub/tool_use.yaml
  • docs/adding-evalhub-agent-integration.md
  • evals/evalhub_adapter/Containerfile
  • evals/evalhub_adapter/README.md
  • evals/evalhub_adapter/__init__.py
  • evals/evalhub_adapter/adapter.py
  • evals/evalhub_adapter/config.py
  • evals/evalhub_adapter/evaluations.py
  • evals/evalhub_adapter/tests/conftest.py
  • evals/evalhub_adapter/tests/run-e2e.sh
  • evals/evalhub_adapter/tests/test_adapter.py
  • evals/evalhub_adapter/tests/test_config_and_evaluations.py
  • evals/evalhub_adapter/tests/test_integration.py
  • evals/harness/__init__.py
  • evals/harness/mlflow_client.py
  • evals/harness/runner.py
  • evals/harness/scorers/__init__.py
  • evals/harness/scorers/latency.py
  • evals/harness/scorers/plan_coherence.py
  • evals/harness/scorers/safety.py
  • evals/harness/scorers/tool_sequence.py
  • pyproject.toml

Comment thread evals/evalhub_adapter/Containerfile Outdated
Comment thread evals/evalhub_adapter/tests/run-e2e.sh Outdated
Comment thread evals/evalhub_adapter/tests/run-e2e.sh
Comment thread evals/evalhub_adapter/tests/run-e2e.sh Outdated
Comment thread evals/harness/mlflow_client.py
@andrewdonheiser
Copy link
Copy Markdown
Contributor Author

All 5 CodeRabbit findings addressed in dd91971:

  1. Containerfile deps pinned — bounded version ranges for eval-hub-sdk, httpx, mlflow, PyYAML
  2. JSONPath contains replaced — new get_route_contains() helper using oc get route -o custom-columns + grep/awk for all three route discovery fallbacks
  3. MLflow token validation endpoint — switched to protected /api/2.0/mlflow/experiments/list with separate 401/403 vs other error handling
  4. Experiment lookup hardened — added CURL_TLS_FLAG, Authorization: Bearer header, and urllib.parse.quote() URL-encoding
  5. Env var consistency — fixed MLFLOW_TOKENMLFLOW_TRACKING_TOKEN in mlflow_client.py error guidance

Also added missing docstrings (__post_init__, __init__ methods) toward 80% coverage.

Note: The CodeQL "Analyze (actions)" failure is a pre-existing infrastructure issue (CodeQL cannot process GitHub Actions files in this repo) — not related to this PR's changes.

@andrewdonheiser andrewdonheiser force-pushed the RHAIENG-4605/validate-evalhub-e2e branch from dd91971 to d5e9849 Compare April 29, 2026 17:08
Add the EvalHub on-cluster adapter (evals/evalhub_adapter/) and shared
eval harness (evals/harness/). Includes Containerfile, unit/integration
tests, agent-specific eval fixtures, and walkthrough documentation.

Made-with: Cursor
@andrewdonheiser andrewdonheiser force-pushed the RHAIENG-4605/validate-evalhub-e2e branch from d5e9849 to 6ce2a58 Compare April 29, 2026 17:59
Copy link
Copy Markdown
Contributor

@sanafayyaz315 sanafayyaz315 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Added a few comments

Comment thread evals/evalhub_adapter/config.py
Comment thread evals/evalhub_adapter/Containerfile
@kami619
Copy link
Copy Markdown
Contributor

kami619 commented Apr 30, 2026

Test contract mismatch for .[test] installs evals/evalhub_adapter/README.md says uv pip install .[test] is enough for adapter tests, but two unit tests patch top-level mlflow symbols and fail when mlflow is not installed.

Claude Reproduced this issue in a clean env:

UV_PROJECT_ENVIRONMENT=/tmp/uv-pr82-test uv sync --extra test
UV_PROJECT_ENVIRONMENT=/tmp/uv-pr82-test uv run --extra test pytest evals/evalhub_adapter/tests -m unit -q
Result: 2 failed, 60 passed (ModuleNotFoundError: No module named 'mlflow')

test_adapter.py Lines 353-360

class TestLogMlflowRun:
    """Tests for MLflow run logging and run_id propagation."""
    @patch("mlflow.log_metric")
    @patch("mlflow.log_param")
    @patch("mlflow.start_run")
    @patch("mlflow.set_experiment")
    @patch("mlflow.set_tracking_uri")

README.md Lines 479-482

Tests stub `evalhub` imports if the package isn't installed (bootstrap
lives in `conftest.py` so it runs before any test module, regardless of
which files are selected). `uv pip install .[test]` alone is sufficient.

@mpk-droid
Copy link
Copy Markdown
Contributor

mpk-droid commented Apr 30, 2026

Just learning about the adapter. You might have already looked into this. @andrewdonheiser do you think the adapter would be better added to this repo(https://github.com/eval-hub/eval-hub-contrib)?

@kami619 @sanafayyaz315 are you guys aware of this? Maybe we should ask the suggestion from one of the contributors of that repo?

My reasoning is that, this adapter seems like it could benefit any agent and a larger audience than just agentic-starter-kit users?

@kami619
Copy link
Copy Markdown
Contributor

kami619 commented Apr 30, 2026

Just learning about the adapter. You might have already looked into this. @andrewdonheiser do you think the adapter would be better added to this repo(https://github.com/eval-hub/eval-hub-contrib)?

@kami619 @sanafayyaz315 are you guys aware of this? Maybe we should ask the suggestion from one of the contributors of that repo?

My reasoning is that, this adapter seems like it could benefit any agent and a larger audience than just agentic-starter-kit users?

From my point of view, I think it is fine to land it here, as we need this for the QG7: Behavioral Evals., once this matures enough, we can make a contribution to the evalhub space directly.

@mpk-droid
Copy link
Copy Markdown
Contributor

Just learning about the adapter. You might have already looked into this. @andrewdonheiser do you think the adapter would be better added to this repo(https://github.com/eval-hub/eval-hub-contrib)?
@kami619 @sanafayyaz315 are you guys aware of this? Maybe we should ask the suggestion from one of the contributors of that repo?
My reasoning is that, this adapter seems like it could benefit any agent and a larger audience than just agentic-starter-kit users?

From my point of view, I think it is fine to land it here, as we need this for the QG7: Behavioral Evals., once this matures enough, we can make a contribution to the evalhub space directly.

Just learning about the adapter. You might have already looked into this. @andrewdonheiser do you think the adapter would be better added to this repo(https://github.com/eval-hub/eval-hub-contrib)?
@kami619 @sanafayyaz315 are you guys aware of this? Maybe we should ask the suggestion from one of the contributors of that repo?
My reasoning is that, this adapter seems like it could benefit any agent and a larger audience than just agentic-starter-kit users?

From my point of view, I think it is fine to land it here, as we need this for the QG7: Behavioral Evals., once this matures enough, we can make a contribution to the evalhub space directly.

Thats a fair point. We will need them for QG7. In the future, we will need to find a way to have these in a centralized place and import for testing.

@mpk-droid
Copy link
Copy Markdown
Contributor

Large PR detected (3576 lines changed)

This PR exceeds 1200 lines of code changes (excluding lock files, generated content, and images). Large PRs are harder to review thoroughly and are more likely to introduce bugs.

Consider splitting this PR into smaller, focused changes.

Large PR detected (3576 lines changed)

This PR exceeds 1200 lines of code changes (excluding lock files, generated content, and images). Large PRs are harder to review thoroughly and are more likely to introduce bugs.

Consider splitting this PR into smaller, focused changes.

I feel like either we will have to adjust the PR size rule or follow the suggestion. What do you guys think? In general, I am on the smaller PR team. Curious to hear everyone else's thoughts.

@kami619
Copy link
Copy Markdown
Contributor

kami619 commented Apr 30, 2026

Large PR detected (3576 lines changed)
This PR exceeds 1200 lines of code changes (excluding lock files, generated content, and images). Large PRs are harder to review thoroughly and are more likely to introduce bugs.
Consider splitting this PR into smaller, focused changes.

Large PR detected (3576 lines changed)
This PR exceeds 1200 lines of code changes (excluding lock files, generated content, and images). Large PRs are harder to review thoroughly and are more likely to introduce bugs.
Consider splitting this PR into smaller, focused changes.

I feel like either we will have to adjust the PR size rule or follow the suggestion. What do you guys think? In general, I am on the smaller PR team. Curious to hear everyone else's thoughts.

100% on the small PR team, but I think with these initial bits being added, we would have to live through the pain a bit, otherwise we might succumb to the Agile papercuts. But we need to make a conscious effort once we have scaffolding in place, to ensure the atomic PRs become the norm.

@sanafayyaz315
Copy link
Copy Markdown
Contributor

Just learning about the adapter. You might have already looked into this. @andrewdonheiser do you think the adapter would be better added to this repo(https://github.com/eval-hub/eval-hub-contrib)?

@kami619 @sanafayyaz315 are you guys aware of this? Maybe we should ask the suggestion from one of the contributors of that repo?

My reasoning is that, this adapter seems like it could benefit any agent and a larger audience than just agentic-starter-kit users?

@mpk-droid I had the same thought of contributing it to the eval-hub-contrib repo. This adapter is agent-agnostic — the runner, scorers, and MLflow integration work with any agent that exposes /chat/completions. The only agent-specific piece is the fixture YAML files (golden queries and expected tools).

That said, I think we should first validate it across the agents we have in agentic-starter-kits-v2 (e.g., LangGraph, OpenAI Responses) to prove it handles different agents reliably, and then contribute it to the contrib repo.

@andrewdonheiser
Copy link
Copy Markdown
Contributor Author

andrewdonheiser commented May 1, 2026

Large PR detected (3576 lines changed)
This PR exceeds 1200 lines of code changes (excluding lock files, generated content, and images). Large PRs are harder to review thoroughly and are more likely to introduce bugs.
Consider splitting this PR into smaller, focused changes.

Large PR detected (3576 lines changed)
This PR exceeds 1200 lines of code changes (excluding lock files, generated content, and images). Large PRs are harder to review thoroughly and are more likely to introduce bugs.
Consider splitting this PR into smaller, focused changes.

I feel like either we will have to adjust the PR size rule or follow the suggestion. What do you guys think? In general, I am on the smaller PR team. Curious to hear everyone else's thoughts.

the tests and documentation take up a lot of size. the adapter code itself is stripped down. I don't think splitting it up makes sense.

andrewdonheiser and others added 2 commits May 1, 2026 10:39
Addresses PR review feedback: _validate_url now permits localhost hosts
(localhost, 127.0.0.1, ::1, 0.0.0.0) when EVALHUB_ALLOW_LOCALHOST=true
is set, following the same gating pattern as EVALHUB_ALLOW_INSECURE_TLS.
Cloud metadata endpoints remain blocked regardless. Adds TODO for future
auto-discovery of agent fixture dirs in the Containerfile.

Co-authored-by: Cursor <cursoragent@cursor.com>
Two unit tests in test_adapter.py patch top-level mlflow symbols
(@patch("mlflow.log_metric"), etc.) which requires mlflow to be in
sys.modules at decoration time. Since mlflow lives in the test-mlflow
extra, not test, these tests fail in a clean .[test]-only env. Add an
mlflow stub in conftest.py following the same pattern used for evalhub.

Co-authored-by: Cursor <cursoragent@cursor.com>
@andrewdonheiser
Copy link
Copy Markdown
Contributor Author

@kami619 Good catch — fixed in 373b0af. Added an mlflow stub in conftest.py following the same pattern used for the evalhub stub, so @patch("mlflow.log_metric") etc. resolve even when only .[test] is installed (without test-mlflow). All 68 unit tests pass in a .[test]-only environment now.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evals/evalhub_adapter/evaluations.py`:
- Around line 101-115: The code accepts expected_tools and expected_elements
without validating their types, which can let scalars or mixed lists slip
through; in the loop that builds QuerySpec (the block creating QuerySpec with
query=entry["query"], expected_tools=entry.get("expected_tools", []),
expected_elements=entry.get("expected_elements", [])), validate that both
entry.get("expected_tools") and entry.get("expected_elements") are either
missing/None or lists, and that every item in those lists is a string (or raise
a ValueError including the path and the query index i); if you prefer
strictness, coerce missing values to [] before constructing QuerySpec and reject
non-list or non-string items with a clear error referencing QuerySpec creation
and the offending field name.

In `@evals/evalhub_adapter/README.md`:
- Around line 146-187: The README templating step for provider-agentic.json
interpolates ${MLFLOW_TOKEN} into the runtime Env MLFLOW_TRACKING_TOKEN but
never exports MLFLOW_TOKEN beforehand; export MLFLOW_TOKEN (e.g., set
MLFLOW_TOKEN="$(oc whoami -t)" or equivalent) before creating
provider-agentic.json so the "MLFLOW_TRACKING_TOKEN" Env in
provider-agentic.json contains a valid token rather than an empty string, then
re-run the provider registration step that uses provider-agentic.json.

In `@evals/harness/mlflow_client.py`:
- Around line 71-86: verify_connection() calls self._get_client() but
_get_client() only resolves the experiment_id during initial client creation
(when self._client is None), so once _experiment_id is None it stays missing and
get_latest_trace() can never re-resolve it; modify _get_client() so the
experiment lookup/assignment (setting self._experiment_id using
self.experiment_name) is performed unconditionally (i.e., move the experiment
resolution outside the "if self._client is None" guard) or add a separate method
called from _get_client() to always re-check and set _experiment_id; ensure
verify_connection(), get_latest_trace(), and any callers rely on the updated
_experiment_id after subsequent _get_client() calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: e68680d4-10ac-4c42-adb8-28d45b330f5a

📥 Commits

Reviewing files that changed from the base of the PR and between d5e9849 and c0e6cb5.

📒 Files selected for processing (25)
  • .gitignore
  • README.md
  • agents/langgraph/react_agent/evalhub/tool_use.yaml
  • agents/vanilla_python/openai_responses_agent/evalhub/tool_use.yaml
  • docs/adding-evalhub-agent-integration.md
  • evals/evalhub_adapter/Containerfile
  • evals/evalhub_adapter/README.md
  • evals/evalhub_adapter/__init__.py
  • evals/evalhub_adapter/adapter.py
  • evals/evalhub_adapter/config.py
  • evals/evalhub_adapter/evaluations.py
  • evals/evalhub_adapter/tests/conftest.py
  • evals/evalhub_adapter/tests/run-e2e.sh
  • evals/evalhub_adapter/tests/test_adapter.py
  • evals/evalhub_adapter/tests/test_config_and_evaluations.py
  • evals/evalhub_adapter/tests/test_integration.py
  • evals/harness/__init__.py
  • evals/harness/mlflow_client.py
  • evals/harness/runner.py
  • evals/harness/scorers/__init__.py
  • evals/harness/scorers/latency.py
  • evals/harness/scorers/plan_coherence.py
  • evals/harness/scorers/safety.py
  • evals/harness/scorers/tool_sequence.py
  • pyproject.toml
✅ Files skipped from review due to trivial changes (6)
  • evals/harness/scorers/latency.py
  • README.md
  • evals/evalhub_adapter/init.py
  • agents/langgraph/react_agent/evalhub/tool_use.yaml
  • .gitignore
  • evals/evalhub_adapter/Containerfile
🚧 Files skipped from review as they are similar to previous changes (2)
  • agents/vanilla_python/openai_responses_agent/evalhub/tool_use.yaml
  • pyproject.toml

Comment thread evals/evalhub_adapter/evaluations.py Outdated
Comment thread evals/evalhub_adapter/README.md Outdated
Comment thread evals/harness/mlflow_client.py
- evaluations.py: validate query/expected_tools/expected_elements types
  in load_queries to fail fast on malformed fixtures
- mlflow_client.py: re-resolve experiment_id on subsequent _get_client()
  calls so trace enrichment works when the experiment is created after
  verify_connection()
- README.md: add MLFLOW_TOKEN export before provider JSON template;
  add language tag to fenced code block (MD040)

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread evals/evalhub_adapter/tests/run-e2e.sh Outdated
@kami619
Copy link
Copy Markdown
Contributor

kami619 commented May 1, 2026

E2E validated on RHOAI 3.4.0-ea.2 cluster — works end-to-end, but surfaced a server/SDK version mismatch that will affect users.

  • The E2E script ran successfully against both agents (react + openai-responses) with eval scores returned. However, the operator-shipped EvalHub image (odh-eval-hub-rhel9@sha256:c27bfe01...) is at server version 0.2.0 (reports 0.0.1 due to missing BUILD_NUMBER in the Red Hat build). This image is incompatible with the SDK >=0.1.4 that the adapter Containerfile and pyproject.toml specify.

  • What breaks: REST-registered providers (the BYOF path this adapter uses) fail at job execution time with provider not found. The server accepts the registration but the k8s runtime can't resolve it. This means the entire custom adapter flow doesn't work on the shipped image — only pre-installed ConfigMap providers (lm-eval-harness, Garak, etc.) function.

  • Workaround we used: Replaced the server image with quay.io/evalhub/evalhub:0.3.0 (scaled down the TrustyAI operator to prevent reconciliation). With 0.3.0, both jobs completed successfully.

Suggestion: Either document the minimum server version requirement (0.3.0) in the runbook/README, or coordinate with the EvalHub team to ensure RHOAI ships 0.3.0+ before this adapter lands. The COMPATIBILITY.md in the eval-hub repo also needs updating — the only listed pair (0.1.0 + 0.1.0a8) no longer exists as published artifacts.

Additional notes from the E2E run:

  • mlflow_run_id returned null — MLflow auth from adapter pods needs investigation (token was passed but MLflow's /mlflow subpath may need explicit configuration in the adapter)
  • evalhub providers list crashes on SDK 0.1.6 when built-in providers return tags: null (Pydantic validation error) — fixed by the || true guard in the script
  • Sidecar image config field changed between server versions (service.eval_sidecar_image in 0.2.0 vs sidecar.sidecar_container.image in 0.3.0) — another reason to align on 0.3.0

Comment thread evals/evalhub_adapter/tests/run-e2e.sh Outdated
- run-e2e.sh: guard `evalhub providers list` with || echo to handle
  SDK 0.1.6 Pydantic crash on tags:null from built-in providers
- run-e2e.sh: add EVALHUB_ALLOW_LOCALHOST to provider runtime Env so
  adapter pods can reach localhost-bound services during E2E
- README.md: document minimum EvalHub server version 0.3.0 requirement
  (BYOF provider path fails on 0.2.0 shipped with RHOAI 3.4.0-ea)

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@evals/evalhub_adapter/evaluations.py`:
- Around line 95-138: The loader currently treats a missing or falsy "queries"
field as an empty list (data.get("queries") or []), allowing invalid fixtures to
silently become zero-sample benchmarks; change the validation to require that
"queries" exists in the top-level mapping, is a list, and contains at least one
entry before iterating — e.g., check that "queries" in data,
isinstance(data["queries"], list), and len(data["queries"]) > 0, and if any
check fails raise a ValueError referencing the path; keep the subsequent
per-entry checks and still build the list of QuerySpec objects (QuerySpec,
queries, entry) as before.

In `@evals/evalhub_adapter/README.md`:
- Around line 139-143: The README is missing an export of OC_NAMESPACE before
using it to template provider-agentic.json, causing MLFLOW_WORKSPACE to be
empty; update the walkthrough to explicitly export OC_NAMESPACE (e.g., export
OC_NAMESPACE=<your-namespace>) prior to any templating steps and callouts that
populate MLFLOW_WORKSPACE, and ensure all other occurrences that reference
${OC_NAMESPACE} (including the provider-agentic.json templating and the later
walkthrough sections) are updated to rely on this exported variable so MLflow
receives a non-empty MLFLOW_WORKSPACE value.

In `@evals/evalhub_adapter/tests/run-e2e.sh`:
- Line 25: The current trap only removes WORK_DIR on EXIT and so if an error
occurs after provider registration the provider is leaked; add a dedicated
cleanup function (e.g., cleanup_provider) that deletes the registered provider
and ensure the trap calls that function on EXIT/ERR in addition to removing
WORK_DIR. Wire the trap replacement so it runs after the provider registration
step (replace or extend the existing trap 'rm -rf "${WORK_DIR}"' EXIT to call
cleanup_provider && rm -rf "${WORK_DIR}" or use trap 'on_exit' EXIT where
on_exit calls both provider deletion and rm -rf "${WORK_DIR}"), and make sure
the provider deletion logic references the same registration identifiers used
during registration so it always cleans up on failure paths.
- Around line 107-112: The get_route_contains function currently greps both NAME
and HOST columns which can match a hostname; update get_route_contains to only
match the NAME column by using awk to test $1 (route name) for the needle and
print $2 (host) when matched (e.g., replace the grep|head|awk pipeline with a
single awk that checks $1 ~ needle {print $2; exit}) so only route names are
considered; keep the existing behavior of returning nothing on no match.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 4cfb03a8-1b17-45da-ac98-fe80799a52c7

📥 Commits

Reviewing files that changed from the base of the PR and between c0e6cb5 and 8426fe6.

📒 Files selected for processing (6)
  • evals/evalhub_adapter/README.md
  • evals/evalhub_adapter/config.py
  • evals/evalhub_adapter/evaluations.py
  • evals/evalhub_adapter/tests/conftest.py
  • evals/evalhub_adapter/tests/run-e2e.sh
  • evals/harness/mlflow_client.py
✅ Files skipped from review due to trivial changes (1)
  • evals/evalhub_adapter/config.py

Comment thread evals/evalhub_adapter/evaluations.py
Comment thread evals/evalhub_adapter/README.md
Comment thread evals/evalhub_adapter/tests/run-e2e.sh Outdated
Comment thread evals/evalhub_adapter/tests/run-e2e.sh
- evaluations.py: fail fast when queries is missing/empty instead of
  silently producing a zero-sample benchmark
- run-e2e.sh: trap-based cleanup deletes the provider on error paths;
  match only route names (not hosts) in get_route_contains
- README.md: export OC_NAMESPACE before templating provider JSON

Co-authored-by: Cursor <cursoragent@cursor.com>
@andrewdonheiser andrewdonheiser merged commit cb7a4be into main May 4, 2026
6 checks passed
@andrewdonheiser andrewdonheiser deleted the RHAIENG-4605/validate-evalhub-e2e branch May 4, 2026 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants