|
| 1 | +# Adding a New EvalHub Agent Integration |
| 2 | + |
| 3 | +How to add a new agent to the EvalHub on-cluster evaluation pipeline. |
| 4 | + |
| 5 | +For behavioral test coverage (pytest-based, inner loop), see |
| 6 | +[Adding Behavioral Tests](./adding-behavioral-tests.md). For the full |
| 7 | +adapter architecture and end-to-end walkthrough, see the |
| 8 | +[EvalHub Adapter README](../evals/evalhub_adapter/README.md). |
| 9 | + |
| 10 | +## Prerequisites |
| 11 | + |
| 12 | +- Agent is deployed with `/chat/completions` (JSON + SSE) and `/health` |
| 13 | +- EvalHub adapter provider is registered |
| 14 | +- Push access to a container registry |
| 15 | + |
| 16 | +## 1. Create Fixture Queries |
| 17 | + |
| 18 | +```bash |
| 19 | +mkdir -p agents/<framework>/<agent_name>/evalhub |
| 20 | +``` |
| 21 | + |
| 22 | +Create `evalhub/tool_use.yaml`: |
| 23 | + |
| 24 | +```yaml |
| 25 | +queries: |
| 26 | + - query: "A question that should trigger tool_a" |
| 27 | + expected_tools: ["tool_a"] |
| 28 | + expected_elements: ["keyword_from_tool_output"] |
| 29 | + |
| 30 | + - query: "A question that should trigger both tools" |
| 31 | + expected_tools: ["tool_a", "tool_b"] |
| 32 | + expected_elements: ["keyword_a", "keyword_b"] |
| 33 | + |
| 34 | + - query: "Hello, how are you today?" |
| 35 | + expected_tools: [] |
| 36 | + expected_elements: [] |
| 37 | +``` |
| 38 | +
|
| 39 | +`expected_tools` must match the agent's `@tool` function names exactly. |
| 40 | +Include at least one no-tool query and one multi-tool query. |
| 41 | + |
| 42 | +Existing fixtures: |
| 43 | + |
| 44 | +- `agents/langgraph/react_agent/evalhub/tool_use.yaml` |
| 45 | +- `agents/vanilla_python/openai_responses_agent/evalhub/tool_use.yaml` |
| 46 | + |
| 47 | +## 2. Add COPY Line to Containerfile |
| 48 | + |
| 49 | +In `evals/evalhub_adapter/Containerfile`, add a `COPY` for your fixtures |
| 50 | +and extend the build-time assertion: |
| 51 | + |
| 52 | +```dockerfile |
| 53 | +COPY agents/<framework>/<agent_name>/evalhub/ ./fixtures/<short_name>/ |
| 54 | +``` |
| 55 | + |
| 56 | +```dockerfile |
| 57 | +RUN python -c "from pathlib import Path; assert Path('fixtures/<short_name>/tool_use.yaml').exists()" |
| 58 | +``` |
| 59 | + |
| 60 | +`<short_name>` should be unique (e.g. `crewai_websearch`). |
| 61 | + |
| 62 | +## 3. Create Eval Submission YAML |
| 63 | + |
| 64 | +Create `evals/evalhub_adapter/eval-<agent_name>.yaml`: |
| 65 | + |
| 66 | +```yaml |
| 67 | +name: agentic-tool-use-<agent-name> |
| 68 | +description: EvalHub orchestration run for <framework> <agent_name> |
| 69 | +model: |
| 70 | + name: <framework>-<agent-name> |
| 71 | + url: https://<agent-route> |
| 72 | +benchmarks: |
| 73 | + - id: agentic-tool-use |
| 74 | + provider_id: <provider-id-from-registration> |
| 75 | + parameters: |
| 76 | + known_tools: ["tool_a", "tool_b"] |
| 77 | + forbidden_actions: ["shell execution"] |
| 78 | + max_latency_seconds: 8.0 |
| 79 | + timeout_seconds: 45.0 |
| 80 | + verify_ssl: true |
| 81 | + fixtures_path: fixtures/<short_name> |
| 82 | + mlflow_tracking_uri: https://<mlflow-route> |
| 83 | + mlflow_experiment_name: <unique-run-experiment> |
| 84 | + mlflow_trace_experiment_name: <agent-experiment> |
| 85 | +``` |
| 86 | + |
| 87 | +- `model.url` — agent base URL, not the `/chat/completions` path |
| 88 | +- `fixtures_path` — must match `<short_name>` from step 2 |
| 89 | +- `provider_id` — from `evalhub providers list` |
| 90 | + |
| 91 | +See `evals/evalhub_adapter/eval-react-agent.yaml.example` and |
| 92 | +`eval-openai-responses-agent.yaml.example` for working examples. Full parameter |
| 93 | +reference is in the [adapter README](../evals/evalhub_adapter/README.md#jobspec-parameters). |
| 94 | + |
| 95 | +## 4. Rebuild and Push the Adapter Image |
| 96 | + |
| 97 | +```bash |
| 98 | +IMAGE_TAG=$(git rev-parse --short HEAD) |
| 99 | +ADAPTER_IMAGE="quay.io/<your-user>/evalhub-agentic-adapter:${IMAGE_TAG}" |
| 100 | +
|
| 101 | +podman build -t "${ADAPTER_IMAGE}" -f evals/evalhub_adapter/Containerfile . |
| 102 | +podman push "${ADAPTER_IMAGE}" |
| 103 | +``` |
| 104 | + |
| 105 | +Re-register the provider if the image tag changed. |
| 106 | + |
| 107 | +## 5. Submit and Verify |
| 108 | + |
| 109 | +```bash |
| 110 | +evalhub eval run --config evals/evalhub_adapter/eval-<agent_name>.yaml --wait --poll-interval 5 |
| 111 | +evalhub eval results <job-id> --format json |
| 112 | +``` |
| 113 | + |
| 114 | +Metrics and result interpretation are documented in the |
| 115 | +[adapter README](../evals/evalhub_adapter/README.md#8-interpreting-results). |
| 116 | + |
| 117 | +## Files Changed |
| 118 | + |
| 119 | +| File | Action | |
| 120 | +|------|--------| |
| 121 | +| `agents/<framework>/<agent_name>/evalhub/tool_use.yaml` | Create | |
| 122 | +| `evals/evalhub_adapter/Containerfile` | Edit — add `COPY` + assertion | |
| 123 | +| `evals/evalhub_adapter/eval-<agent_name>.yaml` | Create | |
| 124 | +| `evals/evalhub_adapter/README.md` | Edit — note new agent under "What works now" | |
0 commit comments