Skip to content

Feature: AgentLoopDetectionMetric — detect infinite loops and cyclical tool-call patterns in agent traces #2643

@Jeel3011

Description

@Jeel3011

Problem

DeepEval's current agentic metrics (TaskCompletionMetric, StepEfficiencyMetric,
PlanAdherenceMetric, etc.) all evaluate the quality of a completed agent run.
None of them detect a critical class of production failure: the agent getting stuck
in a loop before it ever finishes
.

In practice, ReAct-style agents and LangGraph agents fail silently in three ways:

  1. Repeated tool calls — the agent calls the same tool with identical (or nearly
    identical) arguments multiple times, making no progress
  2. Circular reasoning — the LLM produces the same reasoning step in a cycle,
    spinning context and burning tokens without advancing toward the goal
  3. Stagnation spirals — each iteration adds context but produces no measurable
    progress toward task completion, leading to runaway cost

These failures are invisible to existing metrics because:

  • StepEfficiencyMetric penalizes unnecessary steps, but only scores a completed run
  • TaskCompletionMetric requires the agent to have reached a final output to evaluate
  • Neither metric examines the shape of the trace for cyclical patterns

This is a real production problem. I've encountered it building episodic evaluation
frameworks for LangGraph agents (see: AGeval),
and it's one of the most common ways agents silently fail and rack up cost.

Proposed Solution: AgentLoopDetectionMetric

A new trace-only agentic metric that analyzes the tool call sequence and LLM
reasoning steps for loop/cycle patterns.

Scoring logic (three sub-signals, combined into a 0–1 score):

Sub-signal What it detects Method
Tool call repetition Same tool + same args called N times Hash (tool_name, args) → count duplicates
Reasoning stagnation LLM output similarity across steps Sliding window cosine similarity or n-gram overlap
Call graph cycles Circular dependency in tool call sequence DFS cycle detection on call DAG

Score interpretation:

  • 1.0 = No loops detected, agent execution was linear and progressed
  • 0.0 = Severe looping detected (agent clearly stuck)
  • Intermediate = partial repetition detected with reason explaining which pattern was found

API design (mirrors existing agentic metrics):

from deepeval.metrics import AgentLoopDetectionMetric
from deepeval.tracing import observe
from deepeval.dataset import Golden, EvaluationDataset

loop_metric = AgentLoopDetectionMetric(
    threshold=0.5,
    repetition_threshold=3,      # flag if same tool+args called ≥ N times
    similarity_threshold=0.85,   # flag if consecutive LLM outputs are ≥ X similar
    model="gpt-4o",              # for reasoning stagnation detection
)

@observe(type="agent", metrics=[loop_metric])
def my_agent(input: str):
    ...

dataset = EvaluationDataset(goldens=[Golden(input="...")])
for golden in dataset.evals_iterator():
    my_agent(golden.input)

Metric output:
AgentLoopDetectionMetric
Score: 0.12
Reason: Detected tool call repetition — 'search_web' called 4 times with identical
arguments {'query': 'Paris weather'}. Reasoning stagnation detected in steps
3-6 (cosine similarity 0.91). Agent likely stuck in retrieval loop.

Why this fits DeepEval's architecture

  • Trace-only metric — follows the same pattern as StepEfficiencyMetric,
    PlanAdherenceMetric, and PlanQualityMetric (no LLMTestCase required,
    operates directly on the @observe trace)
  • Referenceless — works in production without any labeled data (critical for
    online evaluation)
  • Additive — doesn't touch any existing metrics or evaluation logic
  • Complementary — pairs naturally with TaskCompletionMetric:
    loop detection tells you why a task failed, task completion tells you that it failed

My background / intent

I'm building this. I've built AGeval, an
episodic evaluation framework for LangGraph agents published as a pip package, and
have direct experience with this failure class. I'll read the existing agentic
metric source code (starting from step_efficiency) before writing a line.

Happy to discuss the scoring design here before opening a PR. If this is already
in progress internally, please let me know — I don't want to duplicate work.

Related

  • StepEfficiencyMetric — scores efficiency of completed runs, not loop detection
  • TaskCompletionMetric — scores outcome, not execution pathology
  • ConversationalDAGMetric — DAG cycle detection in evaluation graph, not agent traces

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions