You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DeepEval streamlines the process of evaluating and tracing your OpenAI applications through an **OpenAI client wrapper**, and supports both end-to-end and component-level evaluations.
13
+
DeepEval streamlines the process of evaluating and tracing your OpenAI applications through an **OpenAI client wrapper**, and supports both end-to-end and component-level evaluations, and online evaluations in production.
10
14
11
15
## End-to-End Evaluation
12
16
13
-
To begin evaluating your OpenAI application, attach the `@observe` decorator to your LLM application, and replace your OpenAI client with DeepEval's OpenAI client.
17
+
To begin evaluating your OpenAI application, simply replace your OpenAI client with DeepEval's OpenAI client, and pass in the `metrics` you wish to use.
from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
51
+
from deepeval.dataset import Golden
52
+
from deepeval import evaluate
37
53
38
-
DeepEval's OpenAI integration also supports component-level evaluations. As with end-to-end evaluation, simply add the `@observe` decorator to any OpenAI function component in your LLM application, and replace your existing OpenAI clients with DeepEval's OpenAI client.
54
+
client = OpenAI()
55
+
user_input ="Hello, how are you?"
39
56
40
-
```python showLineNumbers {2,25}
57
+
client.responses.create(
58
+
model="gpt-4o",
59
+
instructions="You are a helpful assistant.",
60
+
input=user_input,
61
+
metrics=[AnswerRelevancyMetric(), BiasMetric()]
62
+
)
63
+
```
64
+
65
+
</TabItem>
66
+
</Tabs>
67
+
68
+
There are **FIVE** optional parameters when using DeepEval's OpenAI client's chat completion and response methods:
69
+
70
+
-[Optional]`metrics`: a list of metrics of type `BaseMetric`
71
+
-[Optional]`expected_output`: a string specifying the expected output of your OpenAI generation.
72
+
-[Optional]`retrieval_context`: a list of strings, representing the retrieved contexts to be passed into your OpenAI generation.
73
+
-[Optional]`context`: a list of strings, representing the ideal retrieved contexts to be passed into your OpenAI generation.
74
+
-[Optional]`expected_tools`: a list of strings, representing the expected tools to be called during OpenAI generation.
75
+
76
+
:::info
77
+
DeepEval’s OpenAI client automatically extracts the `input` and `actual_output` from each API response, enabling you to use metrics like **Answer Relevancy** out of the box. For metrics such as **Faithfulness**—which rely on additional parameters such as retrieval context—you’ll need to explicitly set these parameters when invoking the client.
78
+
:::
79
+
80
+
## Using OpenAI in Component-Level Evaluation
81
+
82
+
You can also use DeepEval's OpenAI client **within component-level evaluations**. To set up component-level evaluations, add the `@observe` decorator to your llm_application's components, and simply replace existing OpenAI clients with DeepEval's OpenAI client, passing in the metrics you wish to use.
evaluate(observed_callback=llm_app, goldens=[Golden("Hello, how are you?")])
115
+
evaluate(observed_callback=llm_app, goldens=[Golden(input="Tell me about Paris")])
66
116
```
67
117
68
-
## How it works
118
+
</TabItem>
119
+
<TabItemvalue="responses"label="Responses">
69
120
70
-
When you integrate DeepEval’s OpenAI client, DeepEval automatically:
121
+
```python showLineNumbers {2,17,22}
122
+
from deepeval.tracing import observe
123
+
from deepeval.openai import OpenAI
71
124
72
-
- Populates span-level and trace-level `LLMTestCase`s with inputs, outputs, and tool calls from OpenAI
73
-
- Records span-level `LLMAttributes` including input, output, and token usage
74
-
- Logs hyperparameters such as model and system prompt for experiment analysis
75
-
- Converts `BaseSpan`s to `LlmSpan`s without the need to define a span type
125
+
from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
126
+
from deepeval.dataset import Golden
127
+
from deepeval import evaluate
76
128
77
-
### Evaluating Retrieval
129
+
@observe()
130
+
defretrieve_docs(query):
131
+
return [
132
+
"Paris is the capital and most populous city of France.",
133
+
"It has been a major European center of finance, diplomacy, commerce, and science."
134
+
]
78
135
79
-
Since retrieval context is not available through the OpenAI API, DeepEval **cannot** automatically populate the retrieval context field in the `LLMTestCase`.
80
-
:::info
81
-
To evaluate retrieval-based metrics like Contextual Relevancy, you'll need to manually populate and set the test case using `update_current_span`
136
+
@observe()
137
+
defllm_app(input):
138
+
client = OpenAI()
139
+
response = client.responses.create(
140
+
model="gpt-4o",
141
+
instructions="You are a helpful assistant.",
142
+
input=input,
143
+
metrics=[AnswerRelevancyMetric(), BiasMetric()]
144
+
)
145
+
return response.output_text
82
146
83
-
```python showLineNumbers
84
-
...
147
+
evaluate(observed_callback=llm_app, goldens=[Golden(input="Tell me about Paris")])
148
+
```
85
149
86
-
client = OpenAI()
87
-
user_input ="Hello, how are you?"
150
+
</TabItem>
88
151
89
-
client.chat.completions.create(
90
-
model="gpt-4o",
91
-
messages=[
92
-
{"role": "system", "content": "You are a helpful assistant."},
0 commit comments