Feature: AgentLoopDetectionMetric — detect infinite loops and cyclical tool-call patterns in agent traces

## Problem

DeepEval's current agentic metrics (`TaskCompletionMetric`, `StepEfficiencyMetric`, 
`PlanAdherenceMetric`, etc.) all evaluate the **quality** of a completed agent run. 
None of them detect a critical class of production failure: **the agent getting stuck 
in a loop before it ever finishes**.

In practice, ReAct-style agents and LangGraph agents fail silently in three ways:

1. **Repeated tool calls** — the agent calls the same tool with identical (or nearly 
   identical) arguments multiple times, making no progress
2. **Circular reasoning** — the LLM produces the same reasoning step in a cycle, 
   spinning context and burning tokens without advancing toward the goal
3. **Stagnation spirals** — each iteration adds context but produces no measurable 
   progress toward task completion, leading to runaway cost

These failures are **invisible to existing metrics** because:
- `StepEfficiencyMetric` penalizes unnecessary steps, but only scores a *completed* run
- `TaskCompletionMetric` requires the agent to have reached a final output to evaluate
- Neither metric examines the *shape* of the trace for cyclical patterns

This is a real production problem. I've encountered it building episodic evaluation 
frameworks for LangGraph agents (see: [AGeval](https://pypi.org/project/ageval/)), 
and it's one of the most common ways agents silently fail and rack up cost.

## Proposed Solution: `AgentLoopDetectionMetric`

A new trace-only agentic metric that analyzes the tool call sequence and LLM 
reasoning steps for loop/cycle patterns.

**Scoring logic (three sub-signals, combined into a 0–1 score):**

| Sub-signal | What it detects | Method |
|---|---|---|
| Tool call repetition | Same tool + same args called N times | Hash (tool_name, args) → count duplicates |
| Reasoning stagnation | LLM output similarity across steps | Sliding window cosine similarity or n-gram overlap |
| Call graph cycles | Circular dependency in tool call sequence | DFS cycle detection on call DAG |

**Score interpretation:**
- `1.0` = No loops detected, agent execution was linear and progressed
- `0.0` = Severe looping detected (agent clearly stuck)
- Intermediate = partial repetition detected with `reason` explaining which pattern was found

**API design (mirrors existing agentic metrics):**

```python
from deepeval.metrics import AgentLoopDetectionMetric
from deepeval.tracing import observe
from deepeval.dataset import Golden, EvaluationDataset

loop_metric = AgentLoopDetectionMetric(
    threshold=0.5,
    repetition_threshold=3,      # flag if same tool+args called ≥ N times
    similarity_threshold=0.85,   # flag if consecutive LLM outputs are ≥ X similar
    model="gpt-4o",              # for reasoning stagnation detection
)

@observe(type="agent", metrics=[loop_metric])
def my_agent(input: str):
    ...

dataset = EvaluationDataset(goldens=[Golden(input="...")])
for golden in dataset.evals_iterator():
    my_agent(golden.input)
```

**Metric output:**
AgentLoopDetectionMetric
Score:  0.12
Reason: Detected tool call repetition — 'search_web' called 4 times with identical
arguments {'query': 'Paris weather'}. Reasoning stagnation detected in steps
3-6 (cosine similarity 0.91). Agent likely stuck in retrieval loop.

## Why this fits DeepEval's architecture

- **Trace-only metric** — follows the same pattern as `StepEfficiencyMetric`, 
  `PlanAdherenceMetric`, and `PlanQualityMetric` (no `LLMTestCase` required, 
  operates directly on the `@observe` trace)
- **Referenceless** — works in production without any labeled data (critical for 
  online evaluation)
- **Additive** — doesn't touch any existing metrics or evaluation logic
- **Complementary** — pairs naturally with `TaskCompletionMetric`: 
  loop detection tells you *why* a task failed, task completion tells you *that* it failed

## My background / intent

I'm building this. I've built [AGeval](https://pypi.org/project/ageval/), an 
episodic evaluation framework for LangGraph agents published as a pip package, and 
have direct experience with this failure class. I'll read the existing agentic 
metric source code (starting from `step_efficiency`) before writing a line.

Happy to discuss the scoring design here before opening a PR. If this is already 
in progress internally, please let me know — I don't want to duplicate work.

## Related

- `StepEfficiencyMetric` — scores efficiency of completed runs, not loop detection
- `TaskCompletionMetric` — scores outcome, not execution pathology
- `ConversationalDAGMetric` — DAG cycle detection in *evaluation graph*, not agent traces

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: AgentLoopDetectionMetric — detect infinite loops and cyclical tool-call patterns in agent traces #2643

Problem

Proposed Solution: `AgentLoopDetectionMetric`

Why this fits DeepEval's architecture

My background / intent

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Sub-signal	What it detects	Method
Tool call repetition	Same tool + same args called N times	Hash (tool_name, args) → count duplicates
Reasoning stagnation	LLM output similarity across steps	Sliding window cosine similarity or n-gram overlap
Call graph cycles	Circular dependency in tool call sequence	DFS cycle detection on call DAG

Feature: AgentLoopDetectionMetric — detect infinite loops and cyclical tool-call patterns in agent traces #2643

Description

Problem

Proposed Solution: AgentLoopDetectionMetric

Why this fits DeepEval's architecture

My background / intent

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Proposed Solution: `AgentLoopDetectionMetric`