opendatahub-io · BSynRedhat · Sep 5, 2025 · Sep 5, 2025 · Sep 8, 2025 · coderabbitai
diff --git a/assemblies/configuring-the-guardrails-orchestrator-service.adoc b/assemblies/configuring-the-guardrails-orchestrator-service.adoc
@@ -37,7 +37,6 @@ include::modules/configuring-the-opentelemetry-exporter.adoc[leveloffset=+1]
 include::modules/using-hugging-face-models-with-guardrails-orchestrator.adoc[leveloffset=+1]
 include::modules/configuring-the-guardrails-detector-hugging-face-serving-runtime.adoc[leveloffset=+1]
 include::modules/using-a-hugging-face-prompt-injection-detector-with-the-guardrails-orchestrator.adoc[leveloffset=+1]
-include::modules/using-guardrails-orchestrator-with-llama-stack.adoc[leveloffset=+1]
 
 
 

diff --git a/modules/running-custom-evaluations-with-LMEval-and-llama-stack.adoc b/modules/running-custom-evaluations-with-LMEval-and-llama-stack.adoc
@@ -0,0 +1,163 @@
+:_module-type: PROCEDURE
+
+ifdef::context[:parent-context: {context}]
+[id="running-custom-evaluations-with-LMEval-and-llama-stack_{context}"]
+= Running custom evaluations with LMEval and Llama Stack 
+[role='_abstract']
+
+This example demonstrates how to use the link:https://github.com/trustyai-explainability/llama-stack-provider-lmeval[LMEval Llama Stack external eval provider] to evaluate a language model with a custom dataset. Creating a custom task is useful for evaluating specific model knowledge and behavior. 
+The process involves three steps: uploading the task dataset to your {productname-short} cluster, registering it as a custom benchmark dataset with Llama Stack, and running a benchmark evaluation job on a language model.
+
+.Prerequisites
+
+ifdef::upstream[]
+* You have installed {productname-long}, version 2.29 or later.
+endif::[]
+ifndef::upstream[]
+* You have installed {productname-long}, version 2.20 or later.
+endif::[]
+
+* You have cluster administrator privileges for your {productname-short} cluster.
+
+* You have downloaded and installed the {productname-short}  command-line interface (CLI). For more information, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/cli_tools/openshift-cli-oc[Installing the OpenShift CLI^].
+
+* You have a large language model (LLM) for chat generation or text classification, or both, deployed in your namespace.
+
+* You have installed TrustyAI Operator in your {productname-short} cluster.
+
+* You have set KServe to Raw Deployment mode in your cluster.
+
+* You have a language model deployed on vLLM Serving Runtime in your {productname-short} cluster.
+
+
+.Procedure
+
+Upload your custom dataset to your OpenShift cluster using PersistentVolumeClaim (PVC) and a temporary pod.  Create a PVC named `my-pvc` to store your dataset. Run the following command in your CLI, replacing <MODEL_NAMESPACE> with the namespace of your language model: 
++	
+[source,bash]
+----
+oc apply -n <MODEL_NAMESPACE> -f - << EOF
+apiVersion: v1
+kind: PersistentVolumeClaim
-.Procedure
-
-Upload your custom dataset to your OpenShift cluster using PersistentVolumeClaim (PVC) and a temporary pod.  Create a PVC named `my-pvc` to store your dataset. Run the following command in your CLI, replacing <MODEL_NAMESPACE> with the namespace of your language model: 
-+	
-[source,bash]
----
-oc apply -n <MODEL_NAMESPACE> -f - << EOF
-apiVersion: v1
-kind: PersistentVolumeClaim
+.Procedure
+
+. Upload your custom dataset to your OpenShift cluster using PersistentVolumeClaim (PVC) and a temporary pod. Create a PVC named `my-pvc` to store your dataset. Run the following command in your CLI, replacing <MODEL_NAMESPACE> with the namespace of your language model:
+  
+[source,bash]
+----
+oc apply -n <MODEL_NAMESPACE> -f - << EOF
+apiVersion: v1
+kind: PersistentVolumeClaim
-.Procedure
-
-Upload your custom dataset to your OpenShift cluster using PersistentVolumeClaim (PVC) and a temporary pod.  Create a PVC named `my-pvc` to store your dataset. Run the following command in your CLI, replacing <MODEL_NAMESPACE> with the namespace of your language model: 
-+	
-[source,bash]
----
-oc apply -n <MODEL_NAMESPACE> -f - << EOF
-apiVersion: v1
-kind: PersistentVolumeClaim
+.Procedure
+
+. Upload your custom dataset to your OpenShift cluster using PersistentVolumeClaim (PVC) and a temporary pod. Create a PVC named `my-pvc` to store your dataset. Run the following command in your CLI, replacing <MODEL_NAMESPACE> with the namespace of your language model:
+  
+[source,bash]
+----
+oc apply -n <MODEL_NAMESPACE> -f - << EOF
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+    name: my-pvc
+spec:
+  accessModes:
+  - ReadWriteOnce
+  resources:
+    requests:
+      storage: 5Gi
+EOF
+----
+. Create a pod object named `dataset-storage-pod` to download the task dataset into the PVC. This pod is used to copy your dataset from your local machine to the {productname-short} cluster:
++
+[source,bash]
+----
+oc apply -n <MODEL_NAMESPACE> -f - << EOF
+apiVersion: v1
+kind: Pod
+metadata:
+  name: dataset-storage-pod
+spec:
+  containers:
+  - name: dataset-container
+    image: 'quay.io/prometheus/busybox:latest'
+    command: ["/bin/sh", "-c", "sleep 3600"]
+    volumeMounts:
+    - mountPath: "/data/upload_files"
+      name: dataset-storage
+  volumes:
+  - name: dataset-storage
+    persistentVolumeClaim:
+      claimName: my-pvc
+EOF
+----
+. Copy your locally stored task dataset to the pod to place it within the PVC. . In this example, the dataset is named `example-dk-bench-input-bmo.jsonl` and it is copied to the `dataset-storage-pod` under the path `/data/upload_files/`. Replace <MODEL_NAMESPACE> with the namespace where the language model you wish to evaluate lives:
++
+[source,bash]
+----
+oc cp example-dk-bench-input-bmo.jsonl dataset-storage-pod:/data/upload_files/example-dk-bench-input-bmo.jsonl -n <MODEL_NAMESPACE>
+----
+. Once the custom dataset is uploaded to the PVC, register it as a benchmark for evaluations. At a minimum, provide the following metadata: The TrustyAI LM-Eval Tasks GitHub web address, your branch, the commit hash, and path of the custom task. Ensure that you replace the `DK_BENCH_DATASET_PATH` and any other metadata fields to match your specific configuration:
++
+[source, bash]
+----
+client.benchmarks.register(
+    benchmark_id="trustyai_lmeval::dk-bench",
+    dataset_id="trustyai_lmeval::dk-bench",
+    scoring_functions=["string"],
+    provider_benchmark_id="string",
+    provider_id="trustyai_lmeval",
+    metadata={
+        "custom_task": {
+            "git": {
+                "url": "https://github.com/trustyai-explainability/lm-eval-tasks.git",
+                "branch": "main",
+                "commit": "8220e2d73c187471acbe71659c98bccecfe77958",
+                "path": "tasks/",
+            }
+        },
+        "env": {
+            # Path of the dataset inside the PVC
+            "DK_BENCH_DATASET_PATH": "/opt/app-root/src/hf_home/example-dk-bench-input-bmo.jsonl",
+            "JUDGE_MODEL_URL": "http://phi-3-predictor:8080/v1/chat/completions",
+            # For simplicity, we use the same model as the one being evaluated
+            "JUDGE_MODEL_NAME": "phi-3",
+            "JUDGE_API_KEY": "",
+        },
+        "tokenized_requests": False,
+        "tokenizer": "google/flan-t5-small",
+        "input": {"storage": {"pvc": "my-pvc"}}
+    },
+)
+
+----
+. Run a benchmark evaluation on your model:
++
+[source,bash]
+----
+job = client.eval.run_eval(
+    benchmark_id="trustyai_lmeval::dk-bench",
+    benchmark_config={
+        "eval_candidate": {
+            "type": "model",
+            "model": "phi-3",
+            "provider_id": "trustyai_lmeval",
+            "sampling_params": {
+                "temperature": 0.7,
+                "top_p": 0.9,
+                "max_tokens": 256
+            },
+        },
+        "num_examples": 1000,
+     },
+)
+
+print(f"Starting job '{job.job_id}'")
+
+----
+. Monitor the status of the evaluation job. The job runs asynchronously, so you can check its status periodically:
++
+[source,python]
+----
+def get_job_status(job_id, benchmark_id):
+    return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)
+
+while True:
+    job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk_bench")
+    print(job)
+
-[source,python]
----
-def get_job_status(job_id, benchmark_id):
-    return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)
-
-while True:
-    job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk_bench")
-    print(job)
+[source,python]
+----
+def get_job_status(job_id, benchmark_id):
+    return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)
+
+while True:
+    job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk-bench")
+    print(job)
-[source,python]
----
-def get_job_status(job_id, benchmark_id):
-    return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)
-
-while True:
-    job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk_bench")
-    print(job)
+[source,python]
+----
+def get_job_status(job_id, benchmark_id):
+    return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)
+
+while True:
+    job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk-bench")
+    print(job)
+    if job.status in ['failed', 'completed']:
+        print(f"Job ended with status: {job.status}")
+        break
+
+    time.sleep(20)
+
+----
-[source,python]
----
-def get_job_status(job_id, benchmark_id):
-    return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)
-
-while True:
-    job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk_bench")
-    print(job)
-
-    if job.status in ['failed', 'completed']:
-        print(f"Job ended with status: {job.status}")
-        break
-
-    time.sleep(20)
-
----
+[source,python]
+----
+import time
+
+def get_job_status(job_id, benchmark_id):
+    return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)
+
+while True:
+    job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk-bench")
+    print(job)
+
+    if job.status in ['failed', 'completed']:
+        print(f"Job ended with status: {job.status}")
+        break
+
+    time.sleep(20)
+----
-[source,python]
----
-def get_job_status(job_id, benchmark_id):
-    return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)
-
-while True:
-    job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk_bench")
-    print(job)
-
-    if job.status in ['failed', 'completed']:
-        print(f"Job ended with status: {job.status}")
-        break
-
-    time.sleep(20)
-
----
+[source,python]
+----
+import time
+
+def get_job_status(job_id, benchmark_id):
+    return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)
+
+while True:
+    job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk-bench")
+    print(job)
+
+    if job.status in ['failed', 'completed']:
+        print(f"Job ended with status: {job.status}")
+        break
+
+    time.sleep(20)
+----
+. Get the job results:
++
+[source,python]
+----
+pprint.pprint(client.eval.jobs.retrieve(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk-bench").scores)
+----
-[source,python]
----
-pprint.pprint(client.eval.jobs.retrieve(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk-bench").scores)
----
+[source,python]
+----
+import pprint
+pprint.pprint(client.eval.jobs.retrieve(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk-bench").scores)
+----
-[source,python]
----
-pprint.pprint(client.eval.jobs.retrieve(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk-bench").scores)
----
+[source,python]
+----
+import pprint
+pprint.pprint(client.eval.jobs.retrieve(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk-bench").scores)
+----
+
diff --git a/modules/using-guardrails-orchestrator-with-llama-stack.adoc b/modules/using-guardrails-orchestrator-with-llama-stack.adoc
@@ -23,7 +23,7 @@ This example demonstrates how to use the built-in link:https://github.com/trusty
 ifdef::upstream[]
 * You have installed {productname-long}, version 2.29 or later.
 endif::[]
-ifdef::upstream[]
+ifndef::upstream[]
 * You have installed {productname-long}, version 2.20 or later.
 endif::[]
 

diff --git a/...-llama-stack-external-eval-provider-with-lm-evaluation-harness-in-TrustyAI.adoc b/...-llama-stack-external-eval-provider-with-lm-evaluation-harness-in-TrustyAI.adoc
@@ -0,0 +1,167 @@
+:_module-type: PROCEDURE
+
+ifdef::context[:parent-context: {context}]
+[id="using-llama-stack-external-eval-provider-with-lm-evaluation-harness-in-TrustyAI_{context}"]
+= Using Llama Stack external eval provider with lm-evaluation-harness in TrustyAI
+[role='_abstract']
+
+This example demonstrates how to evaluate a language model in {productname-long} using the LMEval Llama Stack external eval provider in a Python workbench. To do this, configure a Llama Stack server to use the LMEval Eval provider, register a benchmark dataset, and run a benchmark evaluation job on a language model.
+
+.Prerequisites
+
+ifdef::upstream[]
+* You have installed {productname-long}, version 2.29 or later.
+endif::[]
+ifndef::upstream[]
+* You have installed {productname-long}, version 2.20 or later.
+endif::[]
+
+* You have cluster administrator privileges for your {productname-short} cluster.
+
+* You have downloaded and installed the {productname-short}  command-line interface (CLI). For more information, see link:https://docs.redhat.com/en/documentation/openshift_container_platform/{ocp-latest-version}/html/cli_tools/openshift-cli-oc[Installing the OpenShift CLI^].
+
+* You have a large language model (LLM) for chat generation or text classification, or both, deployed in your namespace.
+
+* You have installed TrustyAI Operator in your {productname-short} cluster.
+
+* You have set KServe to Raw Deployment mode in your cluster.
+
+
+.Procedure
+
+. Configure a Python virtual environment for this tutorial in your `DataScienceCluster`:
++	
+[source,bash]
+----
+python3 -m venv .venv
+source .venv/bin/activate
+----
+. Install the link:https://pypi.org/project/llama-stack/[Llama Stack provider] from the Python Package Index (PyPI):
++
+[source,bash]
+----
+pip install llama-stack-provider-lmeval
+----
+. Configure the Llama Stack server. Set the variables to configure the runtime endpoint and namespace. The VLLM_URL value should be the `v1/completions` endpoint of your model route and the TRUSTYAI_LM_EVAL_NAMESPACE should be the namespace where your model is deployed. For example: 
++
+[source,bash]
+----
+export VLLM_URL=https://$(oc get $(oc get ksvc -o name | grep predictor) --template='{{.status.url}}')/v1/completions
+export TRUSTYAI_LM_EVAL_NAMESPACE=$(oc project | cut -d '"' -f2)
+----
+. Download the `providers.d` provider configuration directory and the `run.yaml` execution file:
++
+[source, bash]
+----
+curl --create-dirs --output providers.d/remote/eval/trustyai_lmeval.yaml https://raw.githubusercontent.com/trustyai-explainability/llama-stack-provider-lmeval/refs/heads/main/providers.d/remote/eval/trustyai_lmeval.yaml
+
+curl --create-dirs --output run.yaml https://raw.githubusercontent.com/trustyai-explainability/llama-stack-provider-lmeval/refs/heads/main/run.yaml
+----
+. Start the Llama Stack server in a virtual environment, which uses port `8321` by default: 
++
+[source,bash]
+----
+llama stack run run.yaml --image-type venv
+----
+. Create a Python script in a Jupyter workbench and import the following libraries and modules, to interact with the server and run an evaluation:
++
+[source,python]
+----
+import os
+import subprocess
+
+import logging
+
+import time
+import pprint 
+----
+. Start the Llama Stack Python client to interact with the running Llama Stack server:
++
+[source,python]
+----
+BASE_URL = "http://localhost:8321"
+
+def create_http_client():
+    from llama_stack_client import LlamaStackClient
+    return LlamaStackClient(base_url=BASE_URL)
+
+client = create_http_client()
+----
+. Print a list of the current available benchmarks:
++
+[source,python]
+----
+benchmarks = client.benchmarks.list()
+
+pprint.print(f"Available benchmarks: {benchmarks}")
+----
+. LMEval provides access to over 100 preconfigured evaluation datasets. Register the ARC-Easy benchmark, a dataset of grade-school level, multiple-choice science questions:
++
+[source,python]
+----
+client.benchmarks.register(
+    benchmark_id="trustyai_lmeval::arc_easy",
+    dataset_id="trustyai_lmeval::arc_easy",
+    scoring_functions=["string"],
+    provider_benchmark_id="string",
+    provider_id="trustyai_lmeval",
+     metadata={
+        "tokenizer": "google/flan-t5-small",
+        "tokenized_requests": False,
+   }
+)
+----
+. Verify that the benchmark has been registered successfully:
++
+[source,python]
+----
+benchmarks = client.benchmarks.list()
+pprint.print(f"Available benchmarks: {benchmarks}")
+----
+. Run a benchmark evaluation job on your deployed model using the following input. Replace phi-3 with the name of your deployed model:
++
+[source,python]
+----
+job = client.eval.run_eval(
+    benchmark_id="trustyai_lmeval::arc_easy",
+    benchmark_config={
+        "eval_candidate": {
+            "type": "model",
+            "model": "phi-3",
+            "provider_id": "trustyai_lmeval",
+            "sampling_params": {
+                "temperature": 0.7,
+                "top_p": 0.9,
+                "max_tokens": 256
+            },
+        },
+        "num_examples": 1000,
+     },
+)
+
+print(f"Starting job '{job.job_id}'")
+----
+. Monitor the status of the evaluation job using the following code. The job will run asynchronously, so you can check its status periodically: 
+[source, python]
+----
+def get_job_status(job_id, benchmark_id):
+    return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)
+
+while True:
+    job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::arc_easy")
+    print(job)
+
+    if job.status in ['failed', 'completed']:
+        print(f"Job ended with status: {job.status}")
+        break
+
+    time.sleep(20)
+----
+. Retrieve the evaluation job results once the job status reports back as `completed`:
++
+[source,python]
+----
+pprint.pprint(client.eval.jobs.retrieve(job_id=job.job_id, benchmark_id="trustyai_lmeval::arc_easy").scores)
+----
+
+
diff --git a/modules/using-llama-stack-with-trustyai.adoc b/modules/using-llama-stack-with-trustyai.adoc
@@ -0,0 +1,22 @@
+:_module-type: ASSEMBLY
+
+ifdef::context[:parent-context: {context}]
+[id="modules/using-llama-stack-with-trustyai_{context}"]
+= Using llama stack with Trusty AI
+
+This section contains tutorials for working with Llama Stack in Trusty AI. These tutorials demonstrate how to use various Llama Stack components and providers to evaluate and work with language models.
+
+The following sections describe how to work with llama stack and provide example use cases:
+
+* Using the Llama Stack external eval provider with lm-evaluation-harness in Trusty AI
+* Running custom evaluations with LMEval Llama Stack external eval provider
+* Use the trustyai-fms Guardrails Orchestrator with Llama-stack
+
+
+include::../modules/using-llama-stack-external-eval-provider-with-lm-evaluation-harness-in-TrustyAI.adoc[leveloffset=+1]
+include::../modules/using-guardrails-orchestrator-with-llama-stack.adoc[leveloffset=+1]
-include::../modules/using-llama-stack-external-eval-provider-with-lm-evaluation-harness-in-TrustyAI.adoc[leveloffset=+1]
-include::../modules/using-guardrails-orchestrator-with-llama-stack.adoc[leveloffset=+1]
+include::../modules/using-llama-stack-external-eval-provider-with-lm-evaluation-harness-in-TrustyAI.adoc[leveloffset=+1]
+include::../modules/running-custom-evaluations-with-LMEval-and-llama-stack.adoc[leveloffset=+1]
+include::../modules/using-guardrails-orchestrator-with-llama-stack.adoc[leveloffset=+1]
-include::../modules/using-llama-stack-external-eval-provider-with-lm-evaluation-harness-in-TrustyAI.adoc[leveloffset=+1]
-include::../modules/using-guardrails-orchestrator-with-llama-stack.adoc[leveloffset=+1]
+include::../modules/using-llama-stack-external-eval-provider-with-lm-evaluation-harness-in-TrustyAI.adoc[leveloffset=+1]
+include::../modules/running-custom-evaluations-with-LMEval-and-llama-stack.adoc[leveloffset=+1]
+include::../modules/using-guardrails-orchestrator-with-llama-stack.adoc[leveloffset=+1]
+
+
+
+ifdef::parent-context[:context: {parent-context}]
+ifndef::parent-context[:!context:]
diff --git a/monitoring-data-science-models.adoc b/monitoring-data-science-models.adoc
@@ -31,4 +31,7 @@ include::assemblies/evaluating-large-language-models.adoc[leveloffset=+1]
 
 include::assemblies/configuring-the-guardrails-orchestrator-service.adoc[leveloffset=+1]
 
+include::assemblies/using-llama-stack-with-trustyai.adoc[leveloffset=+1]
+
+// currently bias-monitoring is only in ODH
 include::assemblies/bias-monitoring-tutorial.adoc[leveloffset=+1]