opendatahub-io
diff --git a/‎components/evaluation/README.md‎
Lines changed: 1 addition & 0 deletions b/‎components/evaluation/README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎components/evaluation/evalhub_kserve/OWNERS‎
Lines changed: 6 additions & 0 deletions b/‎components/evaluation/evalhub_kserve/OWNERS‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎components/evaluation/evalhub_kserve/README.md‎
Lines changed: 39 additions & 83 deletions b/‎components/evaluation/evalhub_kserve/README.md‎
Lines changed: 39 additions & 83 deletions
diff --git a/‎components/evaluation/evalhub_kserve/component.py‎
Lines changed: 57 additions & 39 deletions b/‎components/evaluation/evalhub_kserve/component.py‎
Lines changed: 57 additions & 39 deletions
diff --git a/‎components/evaluation/evalhub_kserve/metadata.yaml‎
Lines changed: 1 addition & 0 deletions b/‎components/evaluation/evalhub_kserve/metadata.yaml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎components/evaluation/evalhub_kserve/tests/test_component_unit.py‎
Lines changed: 2 additions & 0 deletions b/‎components/evaluation/evalhub_kserve/tests/test_component_unit.py‎
Lines changed: 2 additions & 0 deletions
@@ -2,4 +2,5 @@
 
 This directory contains components in the **Evaluation** category:
 
+- [Evalhub Kserve](./evalhub_kserve/README.md): Evaluate a model via Eval Hub with a KServe InferenceService.
 - [Lm Eval](./lm_eval/README.md): A Universal LLM Evaluator component using EleutherAI's lm-evaluation-harness.
@@ -0,0 +1,6 @@
+approvers:
+  - briangallagher
+  - Fiona-Waters
+  - kramaranya
+  - MStokluska
+  - szaher
@@ -1,94 +1,42 @@
-# Eval Hub KServe Component
+# Evalhub Kserve ✨
 
-> **Stability: experimental** — This component is under active development and may change.
+> ⚠️ **Stability: experimental** — This asset is not yet stable and may change.
 
-## Overview
+## Overview 🧾
 
-A KFP component that evaluates a fine-tuned model using the
-[Eval Hub](https://github.com/opendatahub-io/eval-hub) service with a
-KServe InferenceService for model serving.
+Evaluate a model via Eval Hub with a KServe InferenceService.
 
-The component:
+Creates a KServe ServingRuntime + InferenceService (matching the RHOAI dashboard deployment pattern) to serve the fine-tuned model from the workspace PVC. The InferenceService URL is submitted to Eval Hub for benchmark evaluation. Both resources are cleaned up after completion.
 
-1. Creates a KServe **ServingRuntime** + **InferenceService** (matching the RHOAI
-   dashboard deployment pattern) to serve the fine-tuned model from the workspace PVC.
-2. Submits benchmark evaluation jobs to Eval Hub, pointing at the InferenceService URL.
-3. Polls for evaluation completion and collects results/metrics.
-4. Optionally logs metrics to **MLflow** (when `mlflow_experiment_name` is provided).
-5. Cleans up both KServe resources after evaluation (or on failure).
-
-Both KServe resources (ServingRuntime and InferenceService) are explicitly deleted
-in a `finally` block after evaluation completes or on failure.
-
-## Inputs
+## Inputs 📥
 
 | Parameter | Type | Default | Description |
 | --------- | ---- | ------- | ----------- |
-| `evalhub_url` | `str` | `""` | Eval Hub API endpoint URL. Empty = skip evaluation entirely. |
-| `benchmarks` | `list` | `[]` | Benchmark specs, e.g. `[{"id": "leaderboard_ifeval", "provider_id": "lm_evaluation_harness"}]`. |
-| `collection_id` | `str` | `""` | Eval Hub collection ID (overrides `benchmarks`). Available: `leaderboard-v2`, `safety-and-fairness-v1`, `toxicity-and-ethical-principles`. |
-| `pvc_mount_path` | `str` | `""` | Workspace PVC mount path (set by KFP workspace config). |
-| `model_artifact` | `dsl.Input[dsl.Model]` | `None` | KFP Model artifact from training step. |
-| `model_path` | `str` | `None` | HuggingFace model ID or local path (fallback if no artifact). |
-| `evalhub_tenant` | `str` | `""` | Eval Hub tenant / X-Tenant header. |
+| `output_metrics` | `dsl.Output[dsl.Metrics]` | `None` | KFP Metrics artifact for evaluation scores. |
+| `output_results` | `dsl.Output[dsl.Artifact]` | `None` | KFP Artifact for full evaluation results JSON. |
+| `evalhub_url` | `str` | `None` | Eval Hub API endpoint (empty = skip evaluation). |
+| `benchmarks` | `list` | `[]` | List of benchmark specs [{"provider_id": "...", "id": "..."}]. |
+| `collection_id` | `str` | `""` | Eval Hub collection ID (alternative to benchmarks list). |
+| `pvc_mount_path` | `str` | `""` | Workspace PVC mount path (triggers KFP PVC mount). |
+| `model_artifact` | `dsl.Input[dsl.Model]` | `None` | Model artifact from upstream training step. |
+| `model_path` | `str` | `None` | Local filesystem path to model directory (if no artifact). |
+| `evalhub_tenant` | `str` | `""` | Eval Hub tenant / namespace header (X-Tenant). |
 | `evalhub_auth_token` | `str` | `""` | Bearer token for Eval Hub auth. |
-| `evalhub_model_name` | `str` | `"finetuned-model"` | Display name for the model in Eval Hub. |
-| `base_model_name` | `str` | `""` | HF model ID for tokenizer resolution (e.g. `Qwen/Qwen2.5-1.5B-Instruct`). |
-| `evalhub_job_name` | `str` | `"pipeline-eval"` | Evaluation job name in Eval Hub. |
-| `evalhub_timeout` | `int` | `7200` | Max seconds to wait for evaluation. |
+| `evalhub_model_name` | `str` | `finetuned-model` | Display name for the model in Eval Hub. |
+| `base_model_name` | `str` | `""` | HF model ID for tokenizer resolution. |
+| `evalhub_job_name` | `str` | `pipeline-eval` | Evaluation job name in Eval Hub. |
+| `evalhub_timeout` | `int` | `7200` | Max seconds to wait for evaluation to complete. |
 | `evalhub_poll_interval` | `int` | `30` | Seconds between eval status polls. |
-| `mlflow_experiment_name` | `str` | `""` | MLflow experiment name. Non-empty enables MLflow tracking; empty disables it. |
+| `mlflow_experiment_name` | `str` | `""` | MLflow experiment name (non-empty enables MLflow). |
 | `gpu_count` | `int` | `1` | Number of GPUs for the InferenceService predictor. |
-| `memory` | `str` | `"8Gi"` | Pod memory request/limit for the predictor. |
-| `cpu` | `str` | `"2"` | CPU request/limit for the predictor. |
-| `runtime_image` | `str` | RHOAI vLLM image | Container image for the ServingRuntime. |
+| `memory` | `str` | `8Gi` | Pod memory request/limit for the predictor (e.g. "8Gi", "32Gi"). |
+| `cpu` | `str` | `2` | CPU request/limit for the predictor (e.g. "2"). |
+| `runtime_image` | `str` | `registry.redhat.io/rhaii/vllm-cuda-rhel9@sha256:ad06abf3bb5235ebb5b2df84cd1b9fd09e823f0ff2eebfc82bb4590275ccfe0b` | Container image for the ServingRuntime (RHOAI vLLM default). |
+| `trust_remote_code` | `bool` | `False` | Pass --trust-remote-code to vLLM (enables arbitrary code from model repos). |
+| `verify_tls` | `bool` | `False` | Verify TLS certificates for Eval Hub API calls (False for self-signed certs). |
 | `isvc_ready_timeout` | `int` | `600` | Max seconds to wait for InferenceService readiness. |
 
-## Outputs
-
-| Artifact | Type | Description |
-| -------- | ---- | ----------- |
-| `output_metrics` | `dsl.Metrics` | Evaluation scores as KFP metrics (logged per benchmark). |
-| `output_results` | `dsl.Artifact` | Full evaluation results JSON from Eval Hub. |
-
-## Prerequisites
-
-1. **Eval Hub** installed on the cluster (operator + CR in the target namespace).
-
-2. **KServe** available (included with RHOAI by default).
-
-3. **RBAC** — the pipeline ServiceAccount needs permissions for KServe resources:
-
-   ```bash
-   oc create role evalhub-kserve-role \
-     --verb=create,delete,get,list,patch \
-     --resource=inferenceservices.serving.kserve.io,servingruntimes.serving.kserve.io,pods,services,secrets \
-     -n <namespace> && \
-   oc create rolebinding evalhub-kserve-binding \
-     --role=evalhub-kserve-role \
-     --serviceaccount=<namespace>:<pipeline-sa> \
-     -n <namespace>
-   ```
-
-4. **Workspace PVC** must use `ReadWriteMany` access mode (NFS-backed) so the KServe
-   predictor pod can mount the fine-tuned model.
-
-5. **`kubernetes-credentials` Secret** containing `KUBERNETES_SERVER_URL` and
-   `KUBERNETES_AUTH_TOKEN` for K8s API access from within the component.
-
-## Known Limitations
-
-- **`trust_remote_code`**: Some HuggingFace datasets used by benchmarks require
-  `trust_remote_code=True`. The 5 default leaderboard benchmarks (ifeval, bbh,
-  mmlu_pro, musr, math_hard) work without it. For other benchmarks, a custom
-  provider ConfigMap with `HF_DATASETS_TRUST_REMOTE_CODE=1` must be created and
-  referenced in the Eval Hub CR.
-
-- **Tokenizer resolution**: Eval Hub's lm_eval adapter uses the served model name
-  to download the tokenizer. Since the served model is a local fine-tuned checkpoint,
-  the `base_model_name` parameter is used to resolve the correct HF tokenizer.
-
-## Metadata
+## Metadata 🗂️
 
 - **Name**: evalhub_kserve
 - **Stability**: experimental
@@ -98,6 +46,7 @@ in a `finally` block after evaluation completes or on failure.
   - External Services:
     - Name: Eval Hub, Version: >=0.1.0
     - Name: KServe, Version: >=0.11.0
+    - Name: Kubernetes, Version: >=1.28.0
     - Name: vLLM (RHOAI), Version: >=0.6.0
 - **Tags**:
   - evaluation
@@ -107,8 +56,15 @@ in a `finally` block after evaluation completes or on failure.
   - benchmarks
   - metrics
   - mlflow
-
-## Additional Resources
-
-- **Eval Hub**: [https://github.com/opendatahub-io/eval-hub](https://github.com/opendatahub-io/eval-hub)
-- **KServe**: [https://kserve.github.io/website/](https://kserve.github.io/website/)
+- **Last Verified**: 2026-05-20 00:00:00+00:00
+- **Owners**:
+  - Approvers:
+    - briangallagher
+    - Fiona-Waters
+    - kramaranya
+    - MStokluska
+    - szaher
+
+## Additional Resources 📚
+
+- **Documentation**: [https://github.com/opendatahub-io/eval-hub](https://github.com/opendatahub-io/eval-hub)
@@ -6,13 +6,10 @@
 Eval Hub, and polls for results. Both resources are cleaned up after evaluation.
 """
 
-import kfp
 from kfp import dsl
 
-
 RHOAI_VLLM_IMAGE = (
-    "registry.redhat.io/rhaii/vllm-cuda-rhel9"
-    "@sha256:ad06abf3bb5235ebb5b2df84cd1b9fd09e823f0ff2eebfc82bb4590275ccfe0b"
+    "registry.redhat.io/rhaii/vllm-cuda-rhel9@sha256:ad06abf3bb5235ebb5b2df84cd1b9fd09e823f0ff2eebfc82bb4590275ccfe0b"
 )
 
 
@@ -47,7 +44,11 @@ def evalhub_evaluator_kserve(
     gpu_count: int = 1,
     memory: str = "8Gi",
     cpu: str = "2",
-    runtime_image: str = RHOAI_VLLM_IMAGE,
+    runtime_image: str = (  # noqa: E501
+        "registry.redhat.io/rhaii/vllm-cuda-rhel9@sha256:ad06abf3bb5235ebb5b2df84cd1b9fd09e823f0ff2eebfc82bb4590275ccfe0b"
+    ),
+    trust_remote_code: bool = False,
+    verify_tls: bool = False,
     isvc_ready_timeout: int = 600,
 ):
     """Evaluate a model via Eval Hub with a KServe InferenceService.
@@ -58,12 +59,14 @@ def evalhub_evaluator_kserve(
     benchmark evaluation. Both resources are cleaned up after completion.
 
     Args:
+        output_metrics: KFP Metrics artifact for evaluation scores.
+        output_results: KFP Artifact for full evaluation results JSON.
         evalhub_url: Eval Hub API endpoint (empty = skip evaluation).
         benchmarks: List of benchmark specs [{"provider_id": "...", "id": "..."}].
         collection_id: Eval Hub collection ID (alternative to benchmarks list).
         pvc_mount_path: Workspace PVC mount path (triggers KFP PVC mount).
         model_artifact: Model artifact from upstream training step.
-        model_path: HuggingFace model ID or local path (if no artifact).
+        model_path: Local filesystem path to model directory (if no artifact).
         evalhub_tenant: Eval Hub tenant / namespace header (X-Tenant).
         evalhub_auth_token: Bearer token for Eval Hub auth.
         evalhub_model_name: Display name for the model in Eval Hub.
@@ -76,6 +79,8 @@ def evalhub_evaluator_kserve(
         memory: Pod memory request/limit for the predictor (e.g. "8Gi", "32Gi").
         cpu: CPU request/limit for the predictor (e.g. "2").
         runtime_image: Container image for the ServingRuntime (RHOAI vLLM default).
+        trust_remote_code: Pass --trust-remote-code to vLLM (enables arbitrary code from model repos).
+        verify_tls: Verify TLS certificates for Eval Hub API calls (False for self-signed certs).
         isvc_ready_timeout: Max seconds to wait for InferenceService readiness.
     """
     import json
@@ -112,8 +117,12 @@ def _k8s_api(method, path, body=None):
             "Content-Type": "application/json",
         }
         resp = requests.request(
-            method, url, headers=headers,
-            json=body, verify=SA_CA_PATH, timeout=30,
+            method,
+            url,
+            headers=headers,
+            json=body,
+            verify=SA_CA_PATH,
+            timeout=30,
         )
         if resp.status_code >= 400:
             logger.error(f"K8s API {method} {path} -> {resp.status_code}: {resp.text[:500]}")
@@ -123,6 +132,7 @@ def _get_own_pod(namespace):
         hostname = os.environ.get("HOSTNAME", "")
         if not hostname:
             import socket
+
             hostname = socket.gethostname()
         resp = _k8s_api("GET", f"/api/v1/namespaces/{namespace}/pods/{hostname}")
         if resp.status_code == 200:
@@ -151,21 +161,26 @@ def _find_workspace_pvc(pod_spec, model_path):
             for vm in c.get("volumeMounts", []):
                 vol_name = vm["name"]
                 mount_path = vm["mountPath"]
-                if vol_name in pvc_volumes and model_path.startswith(mount_path):
+                normalized_mount = mount_path.rstrip("/") + "/"
+                if vol_name in pvc_volumes and (model_path + "/").startswith(normalized_mount):
                     return pvc_volumes[vol_name], mount_path
 
-        raise RuntimeError(
-            f"Could not find workspace PVC for model path {model_path}. "
-            f"PVC volumes: {pvc_volumes}"
-        )
+        raise RuntimeError(f"Could not find workspace PVC for model path {model_path}. PVC volumes: {pvc_volumes}")
 
     # =========================================================================
     # KServe resource helpers
     # =========================================================================
     KSERVE_SR_API = "/apis/serving.kserve.io/v1alpha1"
     KSERVE_ISVC_API = "/apis/serving.kserve.io/v1beta1"
 
-    def _create_serving_runtime(namespace, name, image, served_model_name):
+    def _create_serving_runtime(namespace, name, image, served_model_name, enable_trust_remote_code=False):
+        vllm_args = [
+            "--port=8080",
+            "--model=/mnt/models",
+            f"--served-model-name={served_model_name}",
+        ]
+        if enable_trust_remote_code:
+            vllm_args.append("--trust-remote-code")
         sr = {
             "apiVersion": "serving.kserve.io/v1alpha1",
             "kind": "ServingRuntime",
@@ -187,21 +202,18 @@ def _create_serving_runtime(namespace, name, image, served_model_name):
                     "prometheus.io/path": "/metrics",
                     "prometheus.io/port": "8080",
                 },
-                "containers": [{
-                    "name": "kserve-container",
-                    "image": image,
-                    "command": ["python", "-m", "vllm.entrypoints.openai.api_server"],
-                    "args": [
-                        "--port=8080",
-                        "--model=/mnt/models",
-                        f"--served-model-name={served_model_name}",
-                        "--trust-remote-code",
-                    ],
-                    "env": [
-                        {"name": "HF_HOME", "value": "/tmp/hf_home"},
-                    ],
-                    "ports": [{"containerPort": 8080, "protocol": "TCP"}],
-                }],
+                "containers": [
+                    {
+                        "name": "kserve-container",
+                        "image": image,
+                        "command": ["python", "-m", "vllm.entrypoints.openai.api_server"],
+                        "args": vllm_args,
+                        "env": [
+                            {"name": "HF_HOME", "value": "/tmp/hf_home"},
+                        ],
+                        "ports": [{"containerPort": 8080, "protocol": "TCP"}],
+                    }
+                ],
                 "multiModel": False,
                 "supportedModelFormats": [
                     {"autoSelect": True, "name": "vLLM"},
@@ -214,8 +226,7 @@ def _create_serving_runtime(namespace, name, image, served_model_name):
         logger.info(f"Created ServingRuntime {name}")
         return resp.json()
 
-    def _create_inference_service(namespace, name, runtime_name, pvc_name,
-                                  model_relative_path, n_gpu, mem, n_cpu):
+    def _create_inference_service(namespace, name, runtime_name, pvc_name, model_relative_path, n_gpu, mem, n_cpu):
         storage_uri = f"pvc://{pvc_name}/{model_relative_path}"
 
         isvc = {
@@ -278,7 +289,7 @@ def _wait_for_isvc_ready(namespace, name, timeout_s=600):
                         duration = time.time() - start
                         logger.info(f"InferenceService {name} ready in {duration:.1f}s")
                         return duration
-                reasons = [f"{c['type']}={c.get('status','?')}({c.get('reason','')})" for c in conditions]
+                reasons = [f"{c['type']}={c.get('status', '?')}({c.get('reason', '')})" for c in conditions]
                 logger.info(f"  ISVC conditions: {', '.join(reasons) if reasons else 'none yet'}")
             time.sleep(15)
         raise TimeoutError(f"InferenceService {name} did not become Ready within {timeout_s}s")
@@ -307,8 +318,15 @@ def _get_isvc_url(namespace, name):
 
     def _cleanup_kserve(namespace, sr_name, isvc_name):
         logger.info(f"Cleaning up InferenceService {isvc_name} and ServingRuntime {sr_name}")
-        _k8s_api("DELETE", f"{KSERVE_ISVC_API}/namespaces/{namespace}/inferenceservices/{isvc_name}")
-        _k8s_api("DELETE", f"{KSERVE_SR_API}/namespaces/{namespace}/servingruntimes/{sr_name}")
+        for kind, api, name in [
+            ("InferenceService", KSERVE_ISVC_API, isvc_name),
+            ("ServingRuntime", KSERVE_SR_API, sr_name),
+        ]:
+            resp = _k8s_api("DELETE", f"{api}/namespaces/{namespace}/{kind.lower()}s/{name}")
+            if resp.status_code >= 400 and resp.status_code != 404:
+                logger.warning(f"Failed to delete {kind} {name}: {resp.status_code} {resp.text[:200]}")
+            else:
+                logger.info(f"Deleted {kind} {name}")
         logger.info(f"Cleanup complete for {isvc_name} / {sr_name}")
 
     # =========================================================================
@@ -372,11 +390,11 @@ def _cleanup_kserve(namespace, sr_name, isvc_name):
 
     model_relative_path = final_model_path
     if final_model_path.startswith(workspace_mount):
-        model_relative_path = final_model_path[len(workspace_mount):].lstrip("/")
+        model_relative_path = final_model_path[len(workspace_mount) :].lstrip("/")
     logger.info(f"Model relative path in PVC: {model_relative_path}")
     logger.info(f"storageUri will be: pvc://{workspace_pvc_name}/{model_relative_path}")
 
-    _create_serving_runtime(namespace, sr_name, runtime_image, resolved_model_name)
+    _create_serving_runtime(namespace, sr_name, runtime_image, resolved_model_name, trust_remote_code)
 
     try:
         _create_inference_service(
@@ -443,7 +461,7 @@ def _cleanup_kserve(namespace, sr_name, isvc_name):
         logger.info(f"Submitting evaluation job to {submit_url}")
         logger.info(f"Config: {json.dumps(eval_config, indent=2)}")
 
-        resp = requests.post(submit_url, json=eval_config, headers=headers, timeout=30, verify=False)
+        resp = requests.post(submit_url, json=eval_config, headers=headers, timeout=30, verify=verify_tls)
         if resp.status_code not in (200, 201, 202):
             raise RuntimeError(f"Eval Hub returned {resp.status_code}: {resp.text}")
 
@@ -462,7 +480,7 @@ def _cleanup_kserve(namespace, sr_name, isvc_name):
             time.sleep(evalhub_poll_interval)
 
             try:
-                resp = requests.get(job_url, headers=headers, timeout=30, verify=False)
+                resp = requests.get(job_url, headers=headers, timeout=30, verify=verify_tls)
                 if resp.status_code != 200:
                     logger.warning(f"Poll returned {resp.status_code}, retrying...")
                     continue
@@ -484,7 +502,7 @@ def _cleanup_kserve(namespace, sr_name, isvc_name):
             logger.error(f"Evaluation timed out after {evalhub_timeout}s")
             try:
                 cancel_url = f"{evalhub_url.rstrip('/')}/api/v1/evaluations/jobs/{job_id}"
-                requests.delete(cancel_url, headers=headers, timeout=10, verify=False)
+                requests.delete(cancel_url, headers=headers, timeout=10, verify=verify_tls)
                 logger.info(f"Cancelled job {job_id}")
             except Exception:
                 pass
 
@@ -1,3 +1,4 @@
+---
 name: evalhub_kserve
 stability: experimental
 dependencies:
 
@@ -39,6 +39,8 @@ def test_component_has_expected_parameters(self):
             "memory",
             "cpu",
             "runtime_image",
+            "trust_remote_code",
+            "verify_tls",
             "isvc_ready_timeout",
         ]
Original file line number	Diff line number	Diff line change
`@@ -2,4 +2,5 @@`
`2`	`2`
`3`	`3`	`This directory contains components in the Evaluation category:`
`4`	`4`
	`5`	`+- [Evalhub Kserve](./evalhub_kserve/README.md): Evaluate a model via Eval Hub with a KServe InferenceService.`
`5`	`6`	`- [Lm Eval](./lm_eval/README.md): A Universal LLM Evaluator component using EleutherAI's lm-evaluation-harness.`
Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,4 @@`
	`1`	`+---`
`1`	`2`	`name: evalhub_kserve`
`2`	`3`	`stability: experimental`
`3`	`4`	`dependencies:`
Original file line number	Diff line number	Diff line change
`@@ -39,6 +39,8 @@ def test_component_has_expected_parameters(self):`
`39`	`39`	`"memory",`
`40`	`40`	`"cpu",`
`41`	`41`	`"runtime_image",`
	`42`	`+ "trust_remote_code",`
	`43`	`+ "verify_tls",`
`42`	`44`	`"isvc_ready_timeout",`
`43`	`45`	`]`
`44`	`46`