update integration docs

kritinv · kritinv · commit 6d5b8e06980f · 2025-06-16T12:49:04.000+07:00
diff --git a/docs/integrations/frameworks/openai.mdx b/docs/integrations/frameworks/openai.mdx
@@ -4,15 +4,22 @@ title: OpenAI
 sidebar_label: OpenAI
 ---
 
+import Tabs from "@theme/Tabs";
+import TabItem from "@theme/TabItem";
+import VideoDisplayer from "@site/src/components/VideoDisplayer";
+
 ## Quick Summary
 
-DeepEval streamlines the process of evaluating and tracing your OpenAI applications through an **OpenAI client wrapper**, and supports both end-to-end and component-level evaluations.
+DeepEval streamlines the process of evaluating and tracing your OpenAI applications through an **OpenAI client wrapper**, and supports both end-to-end and component-level evaluations, and online evaluations in production.
 
 ## End-to-End Evaluation
 
-To begin evaluating your OpenAI application, attach the `@observe` decorator to your LLM application, and replace your OpenAI client with DeepEval's OpenAI client.
+To begin evaluating your OpenAI application, simply replace your OpenAI client with DeepEval's OpenAI client, and pass in the `metrics` you wish to use.
+
+<Tabs>
+<TabItem value="chat-completions" label="Chat Completions">
 
-```python showLineNumbers {2,19}
+```python showLineNumbers {2,8,17}
 from deepeval.tracing import observe
 from deepeval.openai import OpenAI
 
@@ -33,71 +40,132 @@ client.chat.completions.create(
 )
 ```
 
-## Component-level Evaluation
+</TabItem>
+<TabItem value="response" label="Responses">
+
+```python showLineNumbers {2,8,15}
+from deepeval.tracing import observe
+from deepeval.openai import OpenAI
+
+from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
+from deepeval.dataset import Golden
+from deepeval import evaluate
 
-DeepEval's OpenAI integration also supports component-level evaluations. As with end-to-end evaluation, simply add the `@observe` decorator to any OpenAI function component in your LLM application, and replace your existing OpenAI clients with DeepEval's OpenAI client.
+client = OpenAI()
+user_input = "Hello, how are you?"
 
-```python showLineNumbers {2,25}
+client.responses.create(
+    model="gpt-4o",
+    instructions="You are a helpful assistant.",
+    input=user_input,
+    metrics=[AnswerRelevancyMetric(), BiasMetric()]
+)
+```
+
+</TabItem>
+</Tabs>
+
+There are **FIVE** optional parameters when using DeepEval's OpenAI client's chat completion and response methods:
+
+- [Optional] `metrics`: a list of metrics of type `BaseMetric`
+- [Optional] `expected_output`: a string specifying the expected output of your OpenAI generation.
+- [Optional] `retrieval_context`: a list of strings, representing the retrieved contexts to be passed into your OpenAI generation.
+- [Optional] `context`: a list of strings, representing the ideal retrieved contexts to be passed into your OpenAI generation.
+- [Optional] `expected_tools`: a list of strings, representing the expected tools to be called during OpenAI generation.
+
+:::info
+DeepEval’s OpenAI client automatically extracts the `input` and `actual_output` from each API response, enabling you to use metrics like **Answer Relevancy** out of the box. For metrics such as **Faithfulness**—which rely on additional parameters such as retrieval context—you’ll need to explicitly set these parameters when invoking the client.
+:::
+
+## Using OpenAI in Component-Level Evaluation
+
+You can also use DeepEval's OpenAI client **within component-level evaluations**. To set up component-level evaluations, add the `@observe` decorator to your llm_application's components, and simply replace existing OpenAI clients with DeepEval's OpenAI client, passing in the metrics you wish to use.
+
+<Tabs>
+<TabItem value="chat-completions" label="Chat Completions">
+
+```python showLineNumbers {2,17,24}
 from deepeval.tracing import observe
 from deepeval.openai import OpenAI
 
-from deepeval.metrics import AnswerRelevancyMetric
+from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
 from deepeval.dataset import Golden
 from deepeval import evaluate
 
-@observe(type="agent", metrics=[AnswerRelevancyMetric()])
-def llm_app(input: str):
+@observe()
+def retrieve_docs(query):
+    return [
+        "Paris is the capital and most populous city of France.",
+        "It has been a major European center of finance, diplomacy, commerce, and science."
+    ]
 
-    # OpenAI integration
+@observe()
+def llm_app(input):
     client = OpenAI()
     response = client.chat.completions.create(
         model="gpt-4o",
         messages=[
             {"role": "system", "content": "You are a helpful assistant."},
-            {"role": "user", "content": input}
+            {"role": "user", "content": retrieve_docs(input) + "\n\nQuestion: " + input}
         ],
-        metrics=[BiasMetric()]
+        metrics=[AnswerRelevancyMetric(), BiasMetric()]
     )
-
-    # Response
     return response.choices[0].message.content
 
-evaluate(observed_callback=llm_app, goldens=[Golden("Hello, how are you?")])
+evaluate(observed_callback=llm_app, goldens=[Golden(input="Tell me about Paris")])
 ```
 
-## How it works
+</TabItem>
+<TabItem value="responses" label="Responses">
 
-When you integrate DeepEval’s OpenAI client, DeepEval automatically:
+```python showLineNumbers {2,17,22}
+from deepeval.tracing import observe
+from deepeval.openai import OpenAI
 
-- Populates span-level and trace-level `LLMTestCase`s with inputs, outputs, and tool calls from OpenAI
-- Records span-level `LLMAttributes` including input, output, and token usage
-- Logs hyperparameters such as model and system prompt for experiment analysis
-- Converts `BaseSpan`s to `LlmSpan`s without the need to define a span type
+from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
+from deepeval.dataset import Golden
+from deepeval import evaluate
 
-### Evaluating Retrieval
+@observe()
+def retrieve_docs(query):
+    return [
+        "Paris is the capital and most populous city of France.",
+        "It has been a major European center of finance, diplomacy, commerce, and science."
+    ]
 
-Since retrieval context is not available through the OpenAI API, DeepEval **cannot** automatically populate the retrieval context field in the `LLMTestCase`.
-:::info
-To evaluate retrieval-based metrics like Contextual Relevancy, you'll need to manually populate and set the test case using `update_current_span`
+@observe()
+def llm_app(input):
+    client = OpenAI()
+    response = client.responses.create(
+        model="gpt-4o",
+        instructions="You are a helpful assistant.",
+        input=input,
+        metrics=[AnswerRelevancyMetric(), BiasMetric()]
+    )
+    return response.output_text
 
-```python showLineNumbers
-...
+evaluate(observed_callback=llm_app, goldens=[Golden(input="Tell me about Paris")])
+```
 
-client = OpenAI()
-user_input = "Hello, how are you?"
+</TabItem>
 
-client.chat.completions.create(
-    model="gpt-4o",
-    messages=[
-        {"role": "system", "content": "You are a helpful assistant."},
-        {"role": "user", "content": user_input}
-    ],
-    metrics=[AnswerRelevancyMetric(), BiasMetric()]
-    # Optional test case parameters
-    expected_output="Hello, how are you?",
-    retrieval_context=["Document 1", "Document 2", "Document 3"]
-    context="Document 1"
-)
-```
+</Tabs>
 
-:::
+When used inside `@observe` components, DeepEval’s OpenAI client automatically:
+
+- Generates an LLM span for every OpenAI API call, including nested Tool spans for any tool invocations.
+- Attaches an `LLMTestCase` to each generated LLM span, capturing inputs, outputs, and tools called.
+- Records span-level attributes `LLMAttributes` such as the input prompt, generated output and token usage.
+- Logs hyperparameters such as model name and system prompt for comprehensive experiment analysis.
+
+<div style={{ margin: "2rem 0" }}>
+  <VideoDisplayer
+    src="https://deepeval-docs.s3.us-east-1.amazonaws.com/integrations:frameworks:openai.mp4"
+    label="OpenAI Integration"
+    confidentUrl="/llm-tracing/integrations/openai"
+  />
+</div>
+
+## Online Evaluation in Production
+
+...To be documented