- Runtime: ~15 minutes
- Difficulty: Easy
TrustyAI's LM-Eval framework brings popular open-source evaluation toolkits to OpenShift AI. Currently, we support the lm-evaluation-harness with more toolkits on the way soon!
In this example, we'll deploy a Phi3 model and run an Arc-Easy evaluation against it. Arc is an immensely popular evaluation that measures a model against a number of grade-school level, multiple-choice science questions.
You'll need to set up your cluster for a GPU deployment.
Create a DSCInitialization default-dsci and subsequently a DataScienceCluster default-dsc by going to Operators > Installed Operators > Red Hat OpenShift AI / Open Data Hub Operator in the OpenShift console. Then:
- Navigate to the
DSCInitializationtab and create the defaultDSCInitialization - Navigate to the
Data Science Clustertab and create the defaultDataScienceCluster
Note: this demo was most recently tested and verified on RHOAI 2.25.0
By default, TrustyAI prevents evaluation jobs from accessing the internet or running downloaded code. A typical evaluation job will download two items from Huggingface:
- The dataset of the evaluation task, and any dataset processing code
- The tokenizer of your model
If you trust the source of your dataset and tokenizer, you can override TrustyAI's default setting. In our case, we'll be downloading:
If you are happy for TrustyAI to automatically download those two resources, run:
oc patch datasciencecluster default-dsc \
-n redhat-ods-applications \
--type merge \
-p '{"spec":{"components":{"trustyai":{"eval":{"lmeval":{"permitCodeExecution":"allow","permitOnline":"allow"}}}}}}'Wait for your trustyai-service-operator-controller-manager pod in the redhat-ods-applications namespace
to restart, and then TrustyAI should be ready to go.
oc new-project model-namespace || oc project model-namespace
oc apply -f model_storage_container.yamlThe model container can take a while to spin up- it's downloading a Phi-3-mini from Huggingface and saving it into an emulated AWS data connection.
oc apply -f phi3.yamlWait for the model pod to spin up, should look something like phi3-predictor-XXXXXX
You can test the model by sending some inferences to it:
oc port-forward $(oc get pods -o name | grep phi3) 8080:8080python3 ../common/prompt.py --url http://localhost:8080/v1/chat/completions --model phi3 --message "Hi, can you tell me about yourself?"❗NOTE: ../common/prompt.py is a Python script included in this repository for sending chat/completions requests to your deployed model. To run prompt.py, make sure the requests library is installed: pip install requests.
To start an evaluation, apply an LMEvalJob custom resource:
oc apply -f evaluation_job.yamlCheck out evaluation_job.yaml to learn more about the LMEvalJob specification.
Note: the evaluation job container image is quite large, so the first evaluation job you run on your cluster might take a while to start up
If everything has worked, you should see a pod called arc-easy-eval-job running in your namespace.
You can watch the progress of your evaluation job by running:
oc logs -f arc-easy-eval-jobAfter the evaluation finishes (it took about 8.5 minutes on my cluster), you can take a look at the results. These are stored in the status.results field of the LMEvalJob resource:
oc get LMEvalJob arc-easy-eval-job -o template --template '{{.status.results}}' | jq .resultsreturns:
{
"arc_easy": {
"alias": "arc_easy",
"acc,none": 0.8186026936026936,
"acc_stderr,none": 0.007907153952801706,
"acc_norm,none": 0.7836700336700336,
"acc_norm_stderr,none": 0.00844876352205705
}
}Now you're free to play around with evaluations! You can see the full list of evaluation supported by lm-evaluation-harness here.