Skip to content

Commit 6d5b8e0

Browse files
committed
update integration docs
1 parent 8f4700e commit 6d5b8e0

1 file changed

Lines changed: 111 additions & 43 deletions

File tree

docs/integrations/frameworks/openai.mdx

Lines changed: 111 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,22 @@ title: OpenAI
44
sidebar_label: OpenAI
55
---
66

7+
import Tabs from "@theme/Tabs";
8+
import TabItem from "@theme/TabItem";
9+
import VideoDisplayer from "@site/src/components/VideoDisplayer";
10+
711
## Quick Summary
812

9-
DeepEval streamlines the process of evaluating and tracing your OpenAI applications through an **OpenAI client wrapper**, and supports both end-to-end and component-level evaluations.
13+
DeepEval streamlines the process of evaluating and tracing your OpenAI applications through an **OpenAI client wrapper**, and supports both end-to-end and component-level evaluations, and online evaluations in production.
1014

1115
## End-to-End Evaluation
1216

13-
To begin evaluating your OpenAI application, attach the `@observe` decorator to your LLM application, and replace your OpenAI client with DeepEval's OpenAI client.
17+
To begin evaluating your OpenAI application, simply replace your OpenAI client with DeepEval's OpenAI client, and pass in the `metrics` you wish to use.
18+
19+
<Tabs>
20+
<TabItem value="chat-completions" label="Chat Completions">
1421

15-
```python showLineNumbers {2,19}
22+
```python showLineNumbers {2,8,17}
1623
from deepeval.tracing import observe
1724
from deepeval.openai import OpenAI
1825

@@ -33,71 +40,132 @@ client.chat.completions.create(
3340
)
3441
```
3542

36-
## Component-level Evaluation
43+
</TabItem>
44+
<TabItem value="response" label="Responses">
45+
46+
```python showLineNumbers {2,8,15}
47+
from deepeval.tracing import observe
48+
from deepeval.openai import OpenAI
49+
50+
from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
51+
from deepeval.dataset import Golden
52+
from deepeval import evaluate
3753

38-
DeepEval's OpenAI integration also supports component-level evaluations. As with end-to-end evaluation, simply add the `@observe` decorator to any OpenAI function component in your LLM application, and replace your existing OpenAI clients with DeepEval's OpenAI client.
54+
client = OpenAI()
55+
user_input = "Hello, how are you?"
3956

40-
```python showLineNumbers {2,25}
57+
client.responses.create(
58+
model="gpt-4o",
59+
instructions="You are a helpful assistant.",
60+
input=user_input,
61+
metrics=[AnswerRelevancyMetric(), BiasMetric()]
62+
)
63+
```
64+
65+
</TabItem>
66+
</Tabs>
67+
68+
There are **FIVE** optional parameters when using DeepEval's OpenAI client's chat completion and response methods:
69+
70+
- [Optional] `metrics`: a list of metrics of type `BaseMetric`
71+
- [Optional] `expected_output`: a string specifying the expected output of your OpenAI generation.
72+
- [Optional] `retrieval_context`: a list of strings, representing the retrieved contexts to be passed into your OpenAI generation.
73+
- [Optional] `context`: a list of strings, representing the ideal retrieved contexts to be passed into your OpenAI generation.
74+
- [Optional] `expected_tools`: a list of strings, representing the expected tools to be called during OpenAI generation.
75+
76+
:::info
77+
DeepEval’s OpenAI client automatically extracts the `input` and `actual_output` from each API response, enabling you to use metrics like **Answer Relevancy** out of the box. For metrics such as **Faithfulness**—which rely on additional parameters such as retrieval context—you’ll need to explicitly set these parameters when invoking the client.
78+
:::
79+
80+
## Using OpenAI in Component-Level Evaluation
81+
82+
You can also use DeepEval's OpenAI client **within component-level evaluations**. To set up component-level evaluations, add the `@observe` decorator to your llm_application's components, and simply replace existing OpenAI clients with DeepEval's OpenAI client, passing in the metrics you wish to use.
83+
84+
<Tabs>
85+
<TabItem value="chat-completions" label="Chat Completions">
86+
87+
```python showLineNumbers {2,17,24}
4188
from deepeval.tracing import observe
4289
from deepeval.openai import OpenAI
4390

44-
from deepeval.metrics import AnswerRelevancyMetric
91+
from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
4592
from deepeval.dataset import Golden
4693
from deepeval import evaluate
4794

48-
@observe(type="agent", metrics=[AnswerRelevancyMetric()])
49-
def llm_app(input: str):
95+
@observe()
96+
def retrieve_docs(query):
97+
return [
98+
"Paris is the capital and most populous city of France.",
99+
"It has been a major European center of finance, diplomacy, commerce, and science."
100+
]
50101

51-
# OpenAI integration
102+
@observe()
103+
def llm_app(input):
52104
client = OpenAI()
53105
response = client.chat.completions.create(
54106
model="gpt-4o",
55107
messages=[
56108
{"role": "system", "content": "You are a helpful assistant."},
57-
{"role": "user", "content": input}
109+
{"role": "user", "content": retrieve_docs(input) + "\n\nQuestion: " + input}
58110
],
59-
metrics=[BiasMetric()]
111+
metrics=[AnswerRelevancyMetric(), BiasMetric()]
60112
)
61-
62-
# Response
63113
return response.choices[0].message.content
64114

65-
evaluate(observed_callback=llm_app, goldens=[Golden("Hello, how are you?")])
115+
evaluate(observed_callback=llm_app, goldens=[Golden(input="Tell me about Paris")])
66116
```
67117

68-
## How it works
118+
</TabItem>
119+
<TabItem value="responses" label="Responses">
69120

70-
When you integrate DeepEval’s OpenAI client, DeepEval automatically:
121+
```python showLineNumbers {2,17,22}
122+
from deepeval.tracing import observe
123+
from deepeval.openai import OpenAI
71124

72-
- Populates span-level and trace-level `LLMTestCase`s with inputs, outputs, and tool calls from OpenAI
73-
- Records span-level `LLMAttributes` including input, output, and token usage
74-
- Logs hyperparameters such as model and system prompt for experiment analysis
75-
- Converts `BaseSpan`s to `LlmSpan`s without the need to define a span type
125+
from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
126+
from deepeval.dataset import Golden
127+
from deepeval import evaluate
76128

77-
### Evaluating Retrieval
129+
@observe()
130+
def retrieve_docs(query):
131+
return [
132+
"Paris is the capital and most populous city of France.",
133+
"It has been a major European center of finance, diplomacy, commerce, and science."
134+
]
78135

79-
Since retrieval context is not available through the OpenAI API, DeepEval **cannot** automatically populate the retrieval context field in the `LLMTestCase`.
80-
:::info
81-
To evaluate retrieval-based metrics like Contextual Relevancy, you'll need to manually populate and set the test case using `update_current_span`
136+
@observe()
137+
def llm_app(input):
138+
client = OpenAI()
139+
response = client.responses.create(
140+
model="gpt-4o",
141+
instructions="You are a helpful assistant.",
142+
input=input,
143+
metrics=[AnswerRelevancyMetric(), BiasMetric()]
144+
)
145+
return response.output_text
82146

83-
```python showLineNumbers
84-
...
147+
evaluate(observed_callback=llm_app, goldens=[Golden(input="Tell me about Paris")])
148+
```
85149

86-
client = OpenAI()
87-
user_input = "Hello, how are you?"
150+
</TabItem>
88151

89-
client.chat.completions.create(
90-
model="gpt-4o",
91-
messages=[
92-
{"role": "system", "content": "You are a helpful assistant."},
93-
{"role": "user", "content": user_input}
94-
],
95-
metrics=[AnswerRelevancyMetric(), BiasMetric()]
96-
# Optional test case parameters
97-
expected_output="Hello, how are you?",
98-
retrieval_context=["Document 1", "Document 2", "Document 3"]
99-
context="Document 1"
100-
)
101-
```
152+
</Tabs>
102153

103-
:::
154+
When used inside `@observe` components, DeepEval’s OpenAI client automatically:
155+
156+
- Generates an LLM span for every OpenAI API call, including nested Tool spans for any tool invocations.
157+
- Attaches an `LLMTestCase` to each generated LLM span, capturing inputs, outputs, and tools called.
158+
- Records span-level attributes `LLMAttributes` such as the input prompt, generated output and token usage.
159+
- Logs hyperparameters such as model name and system prompt for comprehensive experiment analysis.
160+
161+
<div style={{ margin: "2rem 0" }}>
162+
<VideoDisplayer
163+
src="https://deepeval-docs.s3.us-east-1.amazonaws.com/integrations:frameworks:openai.mp4"
164+
label="OpenAI Integration"
165+
confidentUrl="/llm-tracing/integrations/openai"
166+
/>
167+
</div>
168+
169+
## Online Evaluation in Production
170+
171+
...To be documented

0 commit comments

Comments
 (0)