Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,9 @@ uv.lock
agents/langflow/simple_tool_calling_agent/local/.ollama-enabled
*.db
CLAUDE.local.md
.cursor
**/REFACTORING.md
STATUS.md
.e2e-workdir
evals/evalhub_adapter/eval-*.yaml
evals/evalhub_adapter/provider-*.json
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,9 @@ agentic-starter-kits/
│ │ └── simple_tool_calling_agent/ # Langflow tool-calling agent
│ └── a2a/
│ └── langgraph_crewai_agent/ # A2A multi-agent (LangGraph + CrewAI)
├── evals/
│ ├── harness/ # Shared eval engine (runner, scorers, MLflow client)
│ └── evalhub_adapter/ # EvalHub on-cluster adapter (JobSpec → harness)
├── tests/
│ └── behavioral/ # Behavioral eval suite (shared infra)
├── charts/
Expand Down Expand Up @@ -143,6 +146,7 @@ See `tests/behavioral/` for full details.
- [OpenShift Deployment](./docs/openshift-deployment.md) — Helm-based deployment guide
- [Adding a New Agent](./docs/adding-a-new-agent.md) — How to contribute a new agent template
- [Adding Behavioral Tests](./docs/adding-behavioral-tests.md) — How to add test coverage for an agent
- [Adding an EvalHub Agent Integration](./docs/adding-evalhub-agent-integration.md) — How to integrate a new agent into the EvalHub evaluation pipeline

## Additional Resources

Expand Down
22 changes: 22 additions & 0 deletions agents/langgraph/react_agent/evalhub/tool_use.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Golden queries for agentic tool-use benchmark.
# Each query defines expected tool calls for a search-tool agent.
queries:
- query: "What is the current weather in New York City?"
expected_tools: ["search"]
expected_elements: ["weather", "New York"]

- query: "Find recent news about artificial intelligence regulation in the EU"
expected_tools: ["search"]
expected_elements: ["AI", "regulation", "EU"]

- query: "What are the latest developments in quantum computing?"
expected_tools: ["search"]
expected_elements: ["quantum", "computing"]

- query: "Search for the population of Tokyo and compare it to New York"
expected_tools: ["search", "search"]
expected_elements: ["Tokyo", "New York", "population"]

- query: "Hello, how are you today?"
expected_tools: []
expected_elements: []
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Golden queries for agentic tool-use benchmark.
# Each query defines expected tool calls for the vanilla Python agent
# (search_price + search_reviews tools).
queries:
- query: "What is the price of Nike shoes?"
expected_tools: ["search_price"]
expected_elements: ["price", "Nike"]

- query: "Find reviews for Samsung phones"
expected_tools: ["search_reviews"]
expected_elements: ["reviews", "Samsung"]

- query: "What is the price of Adidas and what are the reviews?"
expected_tools: ["search_price", "search_reviews"]
expected_elements: ["Adidas", "price", "reviews"]

- query: "Compare the price of Sony and LG products"
expected_tools: ["search_price", "search_price"]
expected_elements: ["Sony", "LG", "price"]

- query: "Hello, how are you today?"
expected_tools: []
expected_elements: []
124 changes: 124 additions & 0 deletions docs/adding-evalhub-agent-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Adding a New EvalHub Agent Integration

How to add a new agent to the EvalHub on-cluster evaluation pipeline.

For behavioral test coverage (pytest-based, inner loop), see
[Adding Behavioral Tests](./adding-behavioral-tests.md). For the full
adapter architecture and end-to-end walkthrough, see the
[EvalHub Adapter README](../evals/evalhub_adapter/README.md).

## Prerequisites

- Agent is deployed with `/chat/completions` (JSON + SSE) and `/health`
- EvalHub adapter provider is registered
- Push access to a container registry

## 1. Create Fixture Queries

```bash
mkdir -p agents/<framework>/<agent_name>/evalhub
```

Create `evalhub/tool_use.yaml`:

```yaml
queries:
- query: "A question that should trigger tool_a"
expected_tools: ["tool_a"]
expected_elements: ["keyword_from_tool_output"]

- query: "A question that should trigger both tools"
expected_tools: ["tool_a", "tool_b"]
expected_elements: ["keyword_a", "keyword_b"]

- query: "Hello, how are you today?"
expected_tools: []
expected_elements: []
```

`expected_tools` must match the agent's `@tool` function names exactly.
Include at least one no-tool query and one multi-tool query.

Existing fixtures:

- `agents/langgraph/react_agent/evalhub/tool_use.yaml`
- `agents/vanilla_python/openai_responses_agent/evalhub/tool_use.yaml`

## 2. Add COPY Line to Containerfile

In `evals/evalhub_adapter/Containerfile`, add a `COPY` for your fixtures
and extend the build-time assertion:

```dockerfile
COPY agents/<framework>/<agent_name>/evalhub/ ./fixtures/<short_name>/
```

```dockerfile
RUN python -c "from pathlib import Path; assert Path('fixtures/<short_name>/tool_use.yaml').exists()"
```

`<short_name>` should be unique (e.g. `crewai_websearch`).

## 3. Create Eval Submission YAML

Create `evals/evalhub_adapter/eval-<agent_name>.yaml`:

```yaml
name: agentic-tool-use-<agent-name>
description: EvalHub orchestration run for <framework> <agent_name>
model:
name: <framework>-<agent-name>
url: https://<agent-route>
benchmarks:
- id: agentic-tool-use
provider_id: <provider-id-from-registration>
parameters:
known_tools: ["tool_a", "tool_b"]
forbidden_actions: ["shell execution"]
max_latency_seconds: 8.0
timeout_seconds: 45.0
verify_ssl: true
fixtures_path: fixtures/<short_name>
mlflow_tracking_uri: https://<mlflow-route>
mlflow_experiment_name: <unique-run-experiment>
mlflow_trace_experiment_name: <agent-experiment>
```

- `model.url` — agent base URL, not the `/chat/completions` path
- `fixtures_path` — must match `<short_name>` from step 2
- `provider_id` — from `evalhub providers list`

See `evals/evalhub_adapter/eval-react-agent.yaml.example` and
`eval-openai-responses-agent.yaml.example` for working examples. Full parameter
reference is in the [adapter README](../evals/evalhub_adapter/README.md#jobspec-parameters).

## 4. Rebuild and Push the Adapter Image

```bash
IMAGE_TAG=$(git rev-parse --short HEAD)
ADAPTER_IMAGE="quay.io/<your-user>/evalhub-agentic-adapter:${IMAGE_TAG}"

podman build -t "${ADAPTER_IMAGE}" -f evals/evalhub_adapter/Containerfile .
podman push "${ADAPTER_IMAGE}"
```

Re-register the provider if the image tag changed.

## 5. Submit and Verify

```bash
evalhub eval run --config evals/evalhub_adapter/eval-<agent_name>.yaml --wait --poll-interval 5
evalhub eval results <job-id> --format json
```

Metrics and result interpretation are documented in the
[adapter README](../evals/evalhub_adapter/README.md#8-interpreting-results).

## Files Changed

| File | Action |
|------|--------|
| `agents/<framework>/<agent_name>/evalhub/tool_use.yaml` | Create |
| `evals/evalhub_adapter/Containerfile` | Edit — add `COPY` + assertion |
| `evals/evalhub_adapter/eval-<agent_name>.yaml` | Create |
| `evals/evalhub_adapter/README.md` | Edit — note new agent under "What works now" |
44 changes: 44 additions & 0 deletions evals/evalhub_adapter/Containerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# EvalHub Agentic Adapter — container image
#
# Uses PYTHONPATH-based source layout (not pip-installed packages) so that
# evaluations.py resolves fixture paths from the fixtures_path parameter.
#
# Build from repo root:
# IMAGE_TAG=$(git rev-parse --short HEAD)
# ADAPTER_IMAGE=quay.io/<your-user>/evalhub-agentic-adapter:${IMAGE_TAG}
# podman build -t "${ADAPTER_IMAGE}" \
# -f evals/evalhub_adapter/Containerfile .
# podman push "${ADAPTER_IMAGE}"

FROM registry.access.redhat.com/ubi9/python-312@sha256:e95978812895b9abb2bdc109b501078da2a47c8dbb9fa23758af40ed50ab6023
WORKDIR /opt/app-root/src

USER 0

COPY --from=ghcr.io/astral-sh/uv@sha256:fc93e9ecd7218e9ec8fba117af89348eef8fd2463c50c13347478769aaedd0ce /uv /usr/local/bin/uv

COPY evals/evalhub_adapter/ ./evalhub_adapter/
COPY evals/harness/ ./harness/
COPY agents/langgraph/react_agent/evalhub/ ./fixtures/langgraph_react/
Comment thread
andrewdonheiser marked this conversation as resolved.
COPY agents/vanilla_python/openai_responses_agent/evalhub/ ./fixtures/vanilla_python/

# Install runtime deps only — NOT the project itself, to keep __file__ paths intact.
# Includes MLflow for trace enrichment and run logging.
RUN uv pip install --no-cache \
"eval-hub-sdk[adapter]>=0.1.4,<0.2" \
"httpx>=0.27,<0.28" \
"mlflow>=2.0,<3" \
"PyYAML>=6.0,<7"

# Build-time assertion: per-agent fixture directories exist
RUN python -c "from pathlib import Path; assert Path('fixtures/langgraph_react/tool_use.yaml').exists(); assert Path('fixtures/vanilla_python/tool_use.yaml').exists()"

RUN chown -R 1001:0 /opt/app-root/src \
&& chmod -R g=u /opt/app-root/src

USER 1001

ENV PYTHONPATH=/opt/app-root/src
ENV HOME=/opt/app-root

ENTRYPOINT ["python", "-m", "evalhub_adapter.adapter"]
Loading
Loading