You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/docs/llm_as_judge.rst
+5-5
Original file line number
Diff line number
Diff line change
@@ -46,7 +46,7 @@ An LLM as a Judge metric consists of several essential components:
46
46
1. The judge model, such as *Llama-3-8B-Instruct* or *gpt-3.5-turbo*, which evaluates the performance of other models.
47
47
2. The platform responsible for executing the judge model, such as Huggingface, OpenAI API and IBM's deployment platforms such as WatsonX and RITS.
48
48
A lot of these model and catalog combinations are already predefined in our catalog. The models are prefixed by metrics.llm_as_judge.direct followed by the platform and the model name.
49
-
For instance, *metrics.llm_as_judge.direct.rits.llama3_1_70b* refers to *llama3 70B* model that uses RITS deployment service.
49
+
For instance, *metrics.llm_as_judge.direct.rits.llama3_3_70b* refers to *llama3 70B* model that uses RITS deployment service.
50
50
51
51
3. The criteria to evaluate the model's response. There are predefined criteria in the catalog and the user can also define a custom criteria.
52
52
Each criteria specifies fine-grained options that help steer the model to evaluate the response more precisely.
@@ -86,7 +86,7 @@ We pass the criteria to the judge model's metric as criteria and the question as
Once the metric is created, a dataset is created for the appropriate task.
@@ -155,13 +155,13 @@ Below is an example where the user mandates that the model respond with the temp
155
155
End to end Direct example
156
156
----------------------------
157
157
Unitxt can also obtain model's responses for a given dataset and then run LLM-as-a-judge evaluations on the model's responses.
158
-
Here, we will get *llama-3.2 1B* instruct's responses and then evaluate them for answer relevance, coherence and conciseness using *llama3_1_70b* judge model
158
+
Here, we will get *llama-3.2 1B* instruct's responses and then evaluate them for answer relevance, coherence and conciseness using *llama3_3_70b* judge model
0 commit comments