Problem
DeepEval's current agentic metrics (TaskCompletionMetric, StepEfficiencyMetric,
PlanAdherenceMetric, etc.) all evaluate the quality of a completed agent run.
None of them detect a critical class of production failure: the agent getting stuck
in a loop before it ever finishes.
In practice, ReAct-style agents and LangGraph agents fail silently in three ways:
- Repeated tool calls — the agent calls the same tool with identical (or nearly
identical) arguments multiple times, making no progress
- Circular reasoning — the LLM produces the same reasoning step in a cycle,
spinning context and burning tokens without advancing toward the goal
- Stagnation spirals — each iteration adds context but produces no measurable
progress toward task completion, leading to runaway cost
These failures are invisible to existing metrics because:
StepEfficiencyMetric penalizes unnecessary steps, but only scores a completed run
TaskCompletionMetric requires the agent to have reached a final output to evaluate
- Neither metric examines the shape of the trace for cyclical patterns
This is a real production problem. I've encountered it building episodic evaluation
frameworks for LangGraph agents (see: AGeval),
and it's one of the most common ways agents silently fail and rack up cost.
Proposed Solution: AgentLoopDetectionMetric
A new trace-only agentic metric that analyzes the tool call sequence and LLM
reasoning steps for loop/cycle patterns.
Scoring logic (three sub-signals, combined into a 0–1 score):
| Sub-signal |
What it detects |
Method |
| Tool call repetition |
Same tool + same args called N times |
Hash (tool_name, args) → count duplicates |
| Reasoning stagnation |
LLM output similarity across steps |
Sliding window cosine similarity or n-gram overlap |
| Call graph cycles |
Circular dependency in tool call sequence |
DFS cycle detection on call DAG |
Score interpretation:
1.0 = No loops detected, agent execution was linear and progressed
0.0 = Severe looping detected (agent clearly stuck)
- Intermediate = partial repetition detected with
reason explaining which pattern was found
API design (mirrors existing agentic metrics):
from deepeval.metrics import AgentLoopDetectionMetric
from deepeval.tracing import observe
from deepeval.dataset import Golden, EvaluationDataset
loop_metric = AgentLoopDetectionMetric(
threshold=0.5,
repetition_threshold=3, # flag if same tool+args called ≥ N times
similarity_threshold=0.85, # flag if consecutive LLM outputs are ≥ X similar
model="gpt-4o", # for reasoning stagnation detection
)
@observe(type="agent", metrics=[loop_metric])
def my_agent(input: str):
...
dataset = EvaluationDataset(goldens=[Golden(input="...")])
for golden in dataset.evals_iterator():
my_agent(golden.input)
Metric output:
AgentLoopDetectionMetric
Score: 0.12
Reason: Detected tool call repetition — 'search_web' called 4 times with identical
arguments {'query': 'Paris weather'}. Reasoning stagnation detected in steps
3-6 (cosine similarity 0.91). Agent likely stuck in retrieval loop.
Why this fits DeepEval's architecture
- Trace-only metric — follows the same pattern as
StepEfficiencyMetric,
PlanAdherenceMetric, and PlanQualityMetric (no LLMTestCase required,
operates directly on the @observe trace)
- Referenceless — works in production without any labeled data (critical for
online evaluation)
- Additive — doesn't touch any existing metrics or evaluation logic
- Complementary — pairs naturally with
TaskCompletionMetric:
loop detection tells you why a task failed, task completion tells you that it failed
My background / intent
I'm building this. I've built AGeval, an
episodic evaluation framework for LangGraph agents published as a pip package, and
have direct experience with this failure class. I'll read the existing agentic
metric source code (starting from step_efficiency) before writing a line.
Happy to discuss the scoring design here before opening a PR. If this is already
in progress internally, please let me know — I don't want to duplicate work.
Related
StepEfficiencyMetric — scores efficiency of completed runs, not loop detection
TaskCompletionMetric — scores outcome, not execution pathology
ConversationalDAGMetric — DAG cycle detection in evaluation graph, not agent traces
Problem
DeepEval's current agentic metrics (
TaskCompletionMetric,StepEfficiencyMetric,PlanAdherenceMetric, etc.) all evaluate the quality of a completed agent run.None of them detect a critical class of production failure: the agent getting stuck
in a loop before it ever finishes.
In practice, ReAct-style agents and LangGraph agents fail silently in three ways:
identical) arguments multiple times, making no progress
spinning context and burning tokens without advancing toward the goal
progress toward task completion, leading to runaway cost
These failures are invisible to existing metrics because:
StepEfficiencyMetricpenalizes unnecessary steps, but only scores a completed runTaskCompletionMetricrequires the agent to have reached a final output to evaluateThis is a real production problem. I've encountered it building episodic evaluation
frameworks for LangGraph agents (see: AGeval),
and it's one of the most common ways agents silently fail and rack up cost.
Proposed Solution:
AgentLoopDetectionMetricA new trace-only agentic metric that analyzes the tool call sequence and LLM
reasoning steps for loop/cycle patterns.
Scoring logic (three sub-signals, combined into a 0–1 score):
Score interpretation:
1.0= No loops detected, agent execution was linear and progressed0.0= Severe looping detected (agent clearly stuck)reasonexplaining which pattern was foundAPI design (mirrors existing agentic metrics):
Metric output:
AgentLoopDetectionMetric
Score: 0.12
Reason: Detected tool call repetition — 'search_web' called 4 times with identical
arguments {'query': 'Paris weather'}. Reasoning stagnation detected in steps
3-6 (cosine similarity 0.91). Agent likely stuck in retrieval loop.
Why this fits DeepEval's architecture
StepEfficiencyMetric,PlanAdherenceMetric, andPlanQualityMetric(noLLMTestCaserequired,operates directly on the
@observetrace)online evaluation)
TaskCompletionMetric:loop detection tells you why a task failed, task completion tells you that it failed
My background / intent
I'm building this. I've built AGeval, an
episodic evaluation framework for LangGraph agents published as a pip package, and
have direct experience with this failure class. I'll read the existing agentic
metric source code (starting from
step_efficiency) before writing a line.Happy to discuss the scoring design here before opening a PR. If this is already
in progress internally, please let me know — I don't want to duplicate work.
Related
StepEfficiencyMetric— scores efficiency of completed runs, not loop detectionTaskCompletionMetric— scores outcome, not execution pathologyConversationalDAGMetric— DAG cycle detection in evaluation graph, not agent traces