Skip to content

Commit cb7a4be

Browse files
feat(evalhub): EvalHub adapter — E2E validated, hardened, documented (#82)
* feat(evalhub): EvalHub adapter — E2E validated, hardened, documented Add the EvalHub on-cluster adapter (evals/evalhub_adapter/) and shared eval harness (evals/harness/). Includes Containerfile, unit/integration tests, agent-specific eval fixtures, and walkthrough documentation. Made-with: Cursor * fix(evalhub): allow localhost via EVALHUB_ALLOW_LOCALHOST for local dev Addresses PR review feedback: _validate_url now permits localhost hosts (localhost, 127.0.0.1, ::1, 0.0.0.0) when EVALHUB_ALLOW_LOCALHOST=true is set, following the same gating pattern as EVALHUB_ALLOW_INSECURE_TLS. Cloud metadata endpoints remain blocked regardless. Adds TODO for future auto-discovery of agent fixture dirs in the Containerfile. Co-authored-by: Cursor <cursoragent@cursor.com> * fix(evalhub): stub mlflow in conftest so .[test] is sufficient Two unit tests in test_adapter.py patch top-level mlflow symbols (@patch("mlflow.log_metric"), etc.) which requires mlflow to be in sys.modules at decoration time. Since mlflow lives in the test-mlflow extra, not test, these tests fail in a clean .[test]-only env. Add an mlflow stub in conftest.py following the same pattern used for evalhub. Co-authored-by: Cursor <cursoragent@cursor.com> * style: ruff format config.py line length Co-authored-by: Cursor <cursoragent@cursor.com> * fix(evalhub): address CodeRabbit review and markdownlint failure - evaluations.py: validate query/expected_tools/expected_elements types in load_queries to fail fast on malformed fixtures - mlflow_client.py: re-resolve experiment_id on subsequent _get_client() calls so trace enrichment works when the experiment is created after verify_connection() - README.md: add MLFLOW_TOKEN export before provider JSON template; add language tag to fenced code block (MD040) Co-authored-by: Cursor <cursoragent@cursor.com> * fix(evalhub): address Kamesh E2E review feedback - run-e2e.sh: guard `evalhub providers list` with || echo to handle SDK 0.1.6 Pydantic crash on tags:null from built-in providers - run-e2e.sh: add EVALHUB_ALLOW_LOCALHOST to provider runtime Env so adapter pods can reach localhost-bound services during E2E - README.md: document minimum EvalHub server version 0.3.0 requirement (BYOF provider path fails on 0.2.0 shipped with RHOAI 3.4.0-ea) Co-authored-by: Cursor <cursoragent@cursor.com> * fix(evalhub): address CodeRabbit review round 3 - evaluations.py: fail fast when queries is missing/empty instead of silently producing a zero-sample benchmark - run-e2e.sh: trap-based cleanup deletes the provider on error paths; match only route names (not hosts) in get_route_contains - README.md: export OC_NAMESPACE before templating provider JSON Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 353500c commit cb7a4be

25 files changed

Lines changed: 3851 additions & 8 deletions

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,9 @@ uv.lock
1717
agents/langflow/simple_tool_calling_agent/local/.ollama-enabled
1818
*.db
1919
CLAUDE.local.md
20+
.cursor
21+
**/REFACTORING.md
22+
STATUS.md
23+
.e2e-workdir
24+
evals/evalhub_adapter/eval-*.yaml
25+
evals/evalhub_adapter/provider-*.json

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,9 @@ agentic-starter-kits/
7171
│ │ └── simple_tool_calling_agent/ # Langflow tool-calling agent
7272
│ └── a2a/
7373
│ └── langgraph_crewai_agent/ # A2A multi-agent (LangGraph + CrewAI)
74+
├── evals/
75+
│ ├── harness/ # Shared eval engine (runner, scorers, MLflow client)
76+
│ └── evalhub_adapter/ # EvalHub on-cluster adapter (JobSpec → harness)
7477
├── tests/
7578
│ └── behavioral/ # Behavioral eval suite (shared infra)
7679
├── charts/
@@ -144,6 +147,7 @@ See `tests/behavioral/` for full details.
144147
- [OpenShift Deployment](./docs/openshift-deployment.md) — Helm-based deployment guide
145148
- [Adding a New Agent](./docs/adding-a-new-agent.md) — How to contribute a new agent template
146149
- [Adding Behavioral Tests](./docs/adding-behavioral-tests.md) — How to add test coverage for an agent
150+
- [Adding an EvalHub Agent Integration](./docs/adding-evalhub-agent-integration.md) — How to integrate a new agent into the EvalHub evaluation pipeline
147151

148152
## Additional Resources
149153

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Golden queries for agentic tool-use benchmark.
2+
# Each query defines expected tool calls for a search-tool agent.
3+
queries:
4+
- query: "What is the current weather in New York City?"
5+
expected_tools: ["search"]
6+
expected_elements: ["weather", "New York"]
7+
8+
- query: "Find recent news about artificial intelligence regulation in the EU"
9+
expected_tools: ["search"]
10+
expected_elements: ["AI", "regulation", "EU"]
11+
12+
- query: "What are the latest developments in quantum computing?"
13+
expected_tools: ["search"]
14+
expected_elements: ["quantum", "computing"]
15+
16+
- query: "Search for the population of Tokyo and compare it to New York"
17+
expected_tools: ["search", "search"]
18+
expected_elements: ["Tokyo", "New York", "population"]
19+
20+
- query: "Hello, how are you today?"
21+
expected_tools: []
22+
expected_elements: []
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
# Golden queries for agentic tool-use benchmark.
2+
# Each query defines expected tool calls for the vanilla Python agent
3+
# (search_price + search_reviews tools).
4+
queries:
5+
- query: "What is the price of Nike shoes?"
6+
expected_tools: ["search_price"]
7+
expected_elements: ["price", "Nike"]
8+
9+
- query: "Find reviews for Samsung phones"
10+
expected_tools: ["search_reviews"]
11+
expected_elements: ["reviews", "Samsung"]
12+
13+
- query: "What is the price of Adidas and what are the reviews?"
14+
expected_tools: ["search_price", "search_reviews"]
15+
expected_elements: ["Adidas", "price", "reviews"]
16+
17+
- query: "Compare the price of Sony and LG products"
18+
expected_tools: ["search_price", "search_price"]
19+
expected_elements: ["Sony", "LG", "price"]
20+
21+
- query: "Hello, how are you today?"
22+
expected_tools: []
23+
expected_elements: []
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# Adding a New EvalHub Agent Integration
2+
3+
How to add a new agent to the EvalHub on-cluster evaluation pipeline.
4+
5+
For behavioral test coverage (pytest-based, inner loop), see
6+
[Adding Behavioral Tests](./adding-behavioral-tests.md). For the full
7+
adapter architecture and end-to-end walkthrough, see the
8+
[EvalHub Adapter README](../evals/evalhub_adapter/README.md).
9+
10+
## Prerequisites
11+
12+
- Agent is deployed with `/chat/completions` (JSON + SSE) and `/health`
13+
- EvalHub adapter provider is registered
14+
- Push access to a container registry
15+
16+
## 1. Create Fixture Queries
17+
18+
```bash
19+
mkdir -p agents/<framework>/<agent_name>/evalhub
20+
```
21+
22+
Create `evalhub/tool_use.yaml`:
23+
24+
```yaml
25+
queries:
26+
- query: "A question that should trigger tool_a"
27+
expected_tools: ["tool_a"]
28+
expected_elements: ["keyword_from_tool_output"]
29+
30+
- query: "A question that should trigger both tools"
31+
expected_tools: ["tool_a", "tool_b"]
32+
expected_elements: ["keyword_a", "keyword_b"]
33+
34+
- query: "Hello, how are you today?"
35+
expected_tools: []
36+
expected_elements: []
37+
```
38+
39+
`expected_tools` must match the agent's `@tool` function names exactly.
40+
Include at least one no-tool query and one multi-tool query.
41+
42+
Existing fixtures:
43+
44+
- `agents/langgraph/react_agent/evalhub/tool_use.yaml`
45+
- `agents/vanilla_python/openai_responses_agent/evalhub/tool_use.yaml`
46+
47+
## 2. Add COPY Line to Containerfile
48+
49+
In `evals/evalhub_adapter/Containerfile`, add a `COPY` for your fixtures
50+
and extend the build-time assertion:
51+
52+
```dockerfile
53+
COPY agents/<framework>/<agent_name>/evalhub/ ./fixtures/<short_name>/
54+
```
55+
56+
```dockerfile
57+
RUN python -c "from pathlib import Path; assert Path('fixtures/<short_name>/tool_use.yaml').exists()"
58+
```
59+
60+
`<short_name>` should be unique (e.g. `crewai_websearch`).
61+
62+
## 3. Create Eval Submission YAML
63+
64+
Create `evals/evalhub_adapter/eval-<agent_name>.yaml`:
65+
66+
```yaml
67+
name: agentic-tool-use-<agent-name>
68+
description: EvalHub orchestration run for <framework> <agent_name>
69+
model:
70+
name: <framework>-<agent-name>
71+
url: https://<agent-route>
72+
benchmarks:
73+
- id: agentic-tool-use
74+
provider_id: <provider-id-from-registration>
75+
parameters:
76+
known_tools: ["tool_a", "tool_b"]
77+
forbidden_actions: ["shell execution"]
78+
max_latency_seconds: 8.0
79+
timeout_seconds: 45.0
80+
verify_ssl: true
81+
fixtures_path: fixtures/<short_name>
82+
mlflow_tracking_uri: https://<mlflow-route>
83+
mlflow_experiment_name: <unique-run-experiment>
84+
mlflow_trace_experiment_name: <agent-experiment>
85+
```
86+
87+
- `model.url` — agent base URL, not the `/chat/completions` path
88+
- `fixtures_path` — must match `<short_name>` from step 2
89+
- `provider_id` — from `evalhub providers list`
90+
91+
See `evals/evalhub_adapter/eval-react-agent.yaml.example` and
92+
`eval-openai-responses-agent.yaml.example` for working examples. Full parameter
93+
reference is in the [adapter README](../evals/evalhub_adapter/README.md#jobspec-parameters).
94+
95+
## 4. Rebuild and Push the Adapter Image
96+
97+
```bash
98+
IMAGE_TAG=$(git rev-parse --short HEAD)
99+
ADAPTER_IMAGE="quay.io/<your-user>/evalhub-agentic-adapter:${IMAGE_TAG}"
100+
101+
podman build -t "${ADAPTER_IMAGE}" -f evals/evalhub_adapter/Containerfile .
102+
podman push "${ADAPTER_IMAGE}"
103+
```
104+
105+
Re-register the provider if the image tag changed.
106+
107+
## 5. Submit and Verify
108+
109+
```bash
110+
evalhub eval run --config evals/evalhub_adapter/eval-<agent_name>.yaml --wait --poll-interval 5
111+
evalhub eval results <job-id> --format json
112+
```
113+
114+
Metrics and result interpretation are documented in the
115+
[adapter README](../evals/evalhub_adapter/README.md#8-interpreting-results).
116+
117+
## Files Changed
118+
119+
| File | Action |
120+
|------|--------|
121+
| `agents/<framework>/<agent_name>/evalhub/tool_use.yaml` | Create |
122+
| `evals/evalhub_adapter/Containerfile` | Edit — add `COPY` + assertion |
123+
| `evals/evalhub_adapter/eval-<agent_name>.yaml` | Create |
124+
| `evals/evalhub_adapter/README.md` | Edit — note new agent under "What works now" |
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# EvalHub Agentic Adapter — container image
2+
#
3+
# Uses PYTHONPATH-based source layout (not pip-installed packages) so that
4+
# evaluations.py resolves fixture paths from the fixtures_path parameter.
5+
#
6+
# Build from repo root:
7+
# IMAGE_TAG=$(git rev-parse --short HEAD)
8+
# ADAPTER_IMAGE=quay.io/<your-user>/evalhub-agentic-adapter:${IMAGE_TAG}
9+
# podman build -t "${ADAPTER_IMAGE}" \
10+
# -f evals/evalhub_adapter/Containerfile .
11+
# podman push "${ADAPTER_IMAGE}"
12+
13+
FROM registry.access.redhat.com/ubi9/python-312@sha256:e95978812895b9abb2bdc109b501078da2a47c8dbb9fa23758af40ed50ab6023
14+
WORKDIR /opt/app-root/src
15+
16+
USER 0
17+
18+
COPY --from=ghcr.io/astral-sh/uv@sha256:fc93e9ecd7218e9ec8fba117af89348eef8fd2463c50c13347478769aaedd0ce /uv /usr/local/bin/uv
19+
20+
COPY evals/evalhub_adapter/ ./evalhub_adapter/
21+
COPY evals/harness/ ./harness/
22+
# TODO: auto-discover agents/*/evalhub/ dirs instead of hardcoding per agent
23+
COPY agents/langgraph/react_agent/evalhub/ ./fixtures/langgraph_react/
24+
COPY agents/vanilla_python/openai_responses_agent/evalhub/ ./fixtures/vanilla_python/
25+
26+
# Install runtime deps only — NOT the project itself, to keep __file__ paths intact.
27+
# Includes MLflow for trace enrichment and run logging.
28+
RUN uv pip install --no-cache \
29+
"eval-hub-sdk[adapter]>=0.1.4,<0.2" \
30+
"httpx>=0.27,<0.28" \
31+
"mlflow>=2.0,<3" \
32+
"PyYAML>=6.0,<7"
33+
34+
# Build-time assertion: per-agent fixture directories exist
35+
RUN python -c "from pathlib import Path; assert Path('fixtures/langgraph_react/tool_use.yaml').exists(); assert Path('fixtures/vanilla_python/tool_use.yaml').exists()"
36+
37+
RUN chown -R 1001:0 /opt/app-root/src \
38+
&& chmod -R g=u /opt/app-root/src
39+
40+
USER 1001
41+
42+
ENV PYTHONPATH=/opt/app-root/src
43+
ENV HOME=/opt/app-root
44+
45+
ENTRYPOINT ["python", "-m", "evalhub_adapter.adapter"]

0 commit comments

Comments
 (0)