Skip to content

Commit ca6a659

Browse files
committed
updated docs
1 parent 82389a4 commit ca6a659

1 file changed

Lines changed: 14 additions & 75 deletions

File tree

docs/docs/metrics-task-completion.mdx

Lines changed: 14 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -27,18 +27,11 @@ Task Completion analyzes your **agent's full trace** to determine task success,
2727
To begin, [set up tracing](/docs/evaluation-llm-tracing) and simply supply the `TaskCompletionMetric()` to your agent's `@observe` tag.
2828

2929
```python
30-
from deepeval.metrics import TaskCompletionMetric
3130
from deepeval.tracing import observe
32-
from deepeval.dataset import Golden
33-
from deepeval import evaluate
34-
35-
task_completion = TaskCompletionMetric(
36-
threshold=0.7,
37-
model="gpt-4o",
38-
include_reason=True
39-
)
31+
from deepeval.dataset import Golden, EvaluationDataset
32+
from deepeval.metrics import TaskCompletionMetric
4033

41-
@observe(metrics=[task_completion])
34+
@observe()
4235
def trip_planner_agent(input):
4336
destination = "Paris"
4437
days = 2
@@ -54,13 +47,18 @@ def trip_planner_agent(input):
5447
itinerary = itinerary_generator(destination, days)
5548
restaurants = restaurant_finder(destination)
5649

57-
output = []
58-
for i in range(days):
59-
output.append(f"{itinerary[i]} and eat at {restaurants[i]}")
50+
return itinerary + restaurants
51+
6052

61-
return ". ".join(output) + "."
53+
# Create dataset
54+
dataset = EvaluationDataset(goldens=[Golden(input="This is a test query")])
6255

63-
evaluate(observed_callback=trip_planner_agent, goldens=[Golden(input="Paris, 2")])
56+
# Initialize metric
57+
task_completion = TaskCompletionMetric(threshold=0.7, model="gpt-4o")
58+
59+
# Loop through dataset
60+
for goldens in dataset.evals_iterator(metrics=[task_completion]):
61+
trip_planner_agent(golden.input)
6462
```
6563

6664
There are **SEVEN** optional parameters when creating a `TaskCompletionMetric`:
@@ -73,66 +71,7 @@ There are **SEVEN** optional parameters when creating a `TaskCompletionMetric`:
7371
- [Optional] `async_mode`: a boolean which when set to `True`, enables [concurrent execution within the `measure()` method.](/docs/metrics-introduction#measuring-a-metric-in-async) Defaulted to `True`.
7472
- [Optional] `verbose_mode`: a boolean which when set to `True`, prints the intermediate steps used to calculate said metric to the console, as outlined in the [How Is It Calculated](#how-is-it-calculated) section. Defaulted to `False`.
7573

76-
### End-to-End
77-
78-
You can also run the `TaskCompletionMetric` for [end-to-end](/docs/evaluation-end-to-end-llm-evals) evaluation as a standalone, though this is not the recommended way to use it, since the full trace is required for thorough evaluation.
79-
80-
```python
81-
from deepeval import evaluate
82-
from deepeval.test_case import LLMTestCase
83-
from deepeval.metrics import TaskCompletionMetric
84-
from deepeval.test_case import ToolCall
85-
86-
metric = TaskCompletionMetric(
87-
threshold=0.7,
88-
model="gpt-4.1",
89-
include_reason=True
90-
)
91-
test_case = LLMTestCase(
92-
input="Plan a 3-day itinerary for Paris with cultural landmarks and local cuisine.",
93-
actual_output=(
94-
"Day 1: Eiffel Tower, dinner at Le Jules Verne. "
95-
"Day 2: Louvre Museum, lunch at Angelina Paris. "
96-
"Day 3: Montmartre, evening at a wine bar."
97-
),
98-
tools_called=[
99-
ToolCall(
100-
name="Itinerary Generator",
101-
description="Creates travel plans based on destination and duration.",
102-
input_parameters={"destination": "Paris", "days": 3},
103-
output=[
104-
"Day 1: Eiffel Tower, Le Jules Verne.",
105-
"Day 2: Louvre Museum, Angelina Paris.",
106-
"Day 3: Montmartre, wine bar.",
107-
],
108-
),
109-
ToolCall(
110-
name="Restaurant Finder",
111-
description="Finds top restaurants in a city.",
112-
input_parameters={"city": "Paris"},
113-
output=["Le Jules Verne", "Angelina Paris", "local wine bars"],
114-
),
115-
],
116-
)
117-
118-
# To run metric as a standalone
119-
# metric.measure(test_case)
120-
# print(metric.score, metric.reason)
121-
122-
evaluate(test_cases=[test_case], metrics=[metric])
123-
```
124-
125-
To use the `TaskCompletionMetric` for end-to-end evaluation, you'll have to provide the following arguments when creating an [`LLMTestCase`](/docs/evaluation-test-cases#llm-test-case):
126-
127-
- `input`
128-
- `actual_output`
129-
- `tools_called`
130-
131-
Read the [How Is It Calculated](#how-is-it-calculated) section below to learn how test case parameters are used for metric calculation.
132-
133-
:::caution
134-
This is not recommended and will be deprecated soon as a test case does not represent the full execution of an agent.
135-
:::
74+
To learn more about how the `evals_iterator` work, [click here.](/docs/evaluation-end-to-end-llm-evals#e2e-evals-for-tracing)
13675

13776
## How Is It Calculated?
13877

0 commit comments

Comments
 (0)