feat: add AgentLoopDetectionMetric — deterministic loop detection for agent traces by Jeel3011 · Pull Request #2645 · confident-ai/deepeval

Jeel3011 · 2026-05-01T16:54:46Z

Summary

Adds AgentLoopDetectionMetric — a fully deterministic (no LLM / API key required) metric that detects infinite loops and cyclical execution patterns in agent traces. It analyzes three independent sub-signals and returns a weighted score from 0.0 (severe looping) to 1.0 (clean execution).

This metric is designed to run in production at zero cost and zero latency — every sub-signal is computed with hashing, set operations, or sequence comparison.

Sub-signals

Signal	Weight	What it detects
Tool Call Repetition	40%	Same tool called with identical `(name, args)` ≥ N times
Reasoning Stagnation	35%	Consecutive LLM outputs that are structurally identical or reordered
Call Graph Cycles	25%	A span that recursively invokes itself (same type + name + input)

Tool Call Repetition

Hashes each tool call by (name, sorted_args). Score degrades at repetition_threshold (0.5) and at 2× threshold (0.0). Different arguments = different calls = not penalized.

Reasoning Stagnation

Uses two complementary signals and takes the maximum:

Bigram Jaccard — catches literal repetition after stop-word removal
SequenceMatcher ratio (difflib) — catches reordered-but-identical phrasing

Outputs shorter than 20 meaningful words are skipped (Jaccard is noisy at that scale).

Call Graph Cycles

DFS on the nested children tree. Labels each span as type:name:input_hash. A cycle is flagged when the same label appears twice on the same root-to-leaf ancestry path. Including input_hash prevents false positives when two different agents share a name but receive different inputs.

Files changed

File	Change
`deepeval/metrics/agent_loop_detection/agent_loop_detection.py`	Core metric implementation
`deepeval/metrics/agent_loop_detection/__init__.py`	Module export (unchanged from initial)
`tests/test_agent_loop_detection.py`	10 passing tests, zero API key required
`docs/content/docs/(agentic)/metrics-agent-loop-detection.mdx`	Full doc page (Usage, How It Works, Limitations)
`docs/content/docs/(agentic)/meta.json`	Registered in sidebar

Design decisions & known limitations

Why no `model` parameter?

Deterministic by design. Accepting model would be misleading since no sub-signal uses an LLM. A future LLM-as-judge stagnation mode could be added behind a feature flag.

Cycle detection uses `type:name:input_hash`, not span UUIDs

create_nested_spans_dict strips UUIDs from the trace dict, so exact span identity is unavailable. The input_hash heuristic is the best available — true recursive loops pass the same input back, so the hash matches. Two same-name spans with different inputs are correctly treated as distinct.

Stagnation will miss semantically-different wording

"Search for Paris weather" vs "Look up the forecast in Paris" won't be caught. Fixing this would require an LLM-as-judge, sacrificing the zero-cost property.

Test matrix (10 tests, all pass)

#	Test	Validates
1	`test_clean_trace_passes`	Clean agent → 1.0
2	`test_repeated_tool_calls_detected`	4× identical calls → ≤ 0.5
3	`test_no_trace_returns_zero`	Missing trace → 0.0 + reason
4	`test_reasoning_stagnation_detected`	Identical outputs → stagnation flagged
5	`test_disable_tool_repetition_check`	Disabled check → score unaffected
6	`test_score_combines_with_correct_weights`	Weight normalization math
7	`test_call_graph_cycle_detected`	True recursive loop → 0.0
8	`test_sequential_same_name_not_a_cycle`	Siblings ≠ cycle (no false positive)
9	`test_same_name_different_input_not_a_cycle`	Same name, different input ≠ cycle
10	`test_reordered_stagnation_detected`	Reordered phrasing caught by SequenceMatcher

CI note

The test_core/ collection error (async_app.py:67 instantiating AnswerRelevancyMetric() at module level without an API key) is a pre-existing issue on main (commit 90f398afc). This PR does not modify that file.

…ent traces Adds a new trace-only, referenceless agentic metric that detects infinite loops, cyclical tool-call patterns, and reasoning stagnation in agent execution traces. Three configurable sub-signals: - Tool call repetition: hash-based duplicate counting - Reasoning stagnation: bigram sliding window overlap - Call graph cycles: DFS back-edge detection Closes confident-ai#2643

vercel · 2026-05-01T16:54:51Z

@Jeel3011 is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

…mprove stagnation scorer - _score_call_graph_cycles: rewrite to traverse the real parent→child span tree (nested 'children' list) instead of treating sequential order as edges. DFS now tracks the ancestry path per root-to-leaf walk — a back-edge is only reported when a type:name label appears twice on the *same* path, so sibling repetition (tool A called twice at the same depth) is correctly ignored. - Remove 'model' parameter from __init__: it was accepted but never used in any scoring logic. Metric is now fully deterministic — no LLM / API key required (document as a feature, not a limitation). - _score_reasoning_stagnation: add _STOP_WORDS filter and a minimum-words guard (< 20 meaningful tokens → skip pair) to prevent boilerplate preambles from inflating Jaccard similarity into false stagnation alerts.

…nation signal, document limitations Cycle detection: - DFS labels now include a truncated input hash (type:name:input_hash) so two same-name spans with different inputs are correctly treated as distinct nodes. True recursive loops (same name + same input) are still caught. Documented trade-off vs UUID-based identity. Stagnation detection: - Added difflib.SequenceMatcher ratio as a secondary signal alongside bigram Jaccard. Take max(jaccard, seq_ratio) to catch both literal repetition and reordered-but-identical phrasing. Tests: - Test 7 updated: true cycle now has matching input on both spans. - Test 9 (new): same-name agents with different inputs → no false positive. - Test 10 (new): reordered phrasing caught by SequenceMatcher. - All 10 tests pass, zero API key required. Documentation: - Added 'Limitations & Design Decisions' section to the MDX doc page explaining why model param is absent, why input_hash is used for cycle detection, and why dual-signal stagnation is strictly better but still has known gaps.

Jeel3011 · 2026-05-12T13:13:36Z

I would like here from managers soon!!!

- Add Required Arguments section (trace-only, needs input + actual_output) - Add 'Within components' usage pattern with code example - Add 'As a standalone' usage with caution note - Document all 8 parameters including include_reason and async_mode - Add explicit note that no model parameter exists and why (deterministic design) - Expand How Is It Calculated with per-tier score tables for all 3 sub-signals - Add Score Breakdown section with score_breakdown attribute example - Add Configuring Sub-signals section with toggle examples - Add community attribution note (author, AGeval, issue confident-ai#2643) Also fix tests/test_core/test_tracing/apps/async_app.py: avoid instantiating metrics at module level inside decorator arguments to prevent API-key validation at import time in keyless CI environments.

Jeel3011 · 2026-05-12T19:26:34Z

Hey @jeffreyip — just pushed the completed documentation page for AgentLoopDetectionMetric as requested.

The doc page (docs/content/docs/(agentic)/metrics-agent-loop-detection.mdx) now follows the full structure of the existing agentic metric docs:

Required Arguments — clarifies this is a trace-only metric (no tools_called / expected_tools needed)
Usage — evals_iterator, within components, and standalone patterns with full code examples
All 8 parameters documented including include_reason, async_mode, and an explicit note that there is no model parameter and why
How Is It Calculated — expanded with per-tier score tables (1.0 / 0.5 / 0.0) for each of the three sub-signals
Score Breakdown — documents the score_breakdown attribute
Configuring Sub-signals — shows how to toggle individual checks
Limitations & Design Decisions — explains the deterministic design choice, cycle detection heuristics, and stagnation trade-offs
Community attribution note at the bottom linking to AGeval and issue Feature: AgentLoopDetectionMetric — detect infinite loops and cyclical tool-call patterns in agent traces #2643

Happy to adjust anything — structure, wording, depth of any section. Just let me know 🙏

Jeel3011 added 3 commits May 12, 2026 18:10

style: apply black formatting to agent_loop_detection and test file

09444d0

Jeel3011 changed the title ~~feat: add AgentLoopDetectionMetric for detecting infinite loops in agent traces~~ feat: add AgentLoopDetectionMetric — deterministic loop detection for agent traces May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add AgentLoopDetectionMetric — deterministic loop detection for agent traces#2645

feat: add AgentLoopDetectionMetric — deterministic loop detection for agent traces#2645
Jeel3011 wants to merge 5 commits into
confident-ai:mainfrom
Jeel3011:feat/agent-loop-detection-metric

Jeel3011 commented May 1, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 1, 2026

Uh oh!

Jeel3011 commented May 12, 2026

Uh oh!

Jeel3011 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Jeel3011 commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Sub-signals

Tool Call Repetition

Reasoning Stagnation

Call Graph Cycles

Files changed

Design decisions & known limitations

Why no model parameter?

Cycle detection uses type:name:input_hash, not span UUIDs

Stagnation will miss semantically-different wording

Test matrix (10 tests, all pass)

CI note

Uh oh!

vercel Bot commented May 1, 2026

Uh oh!

Jeel3011 commented May 12, 2026

Uh oh!

Jeel3011 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Jeel3011 commented May 1, 2026 •

edited

Loading

Why no `model` parameter?

Cycle detection uses `type:name:input_hash`, not span UUIDs