Skip to content

feat: add AgentLoopDetectionMetric — deterministic loop detection for agent traces#2645

Open
Jeel3011 wants to merge 5 commits into
confident-ai:mainfrom
Jeel3011:feat/agent-loop-detection-metric
Open

feat: add AgentLoopDetectionMetric — deterministic loop detection for agent traces#2645
Jeel3011 wants to merge 5 commits into
confident-ai:mainfrom
Jeel3011:feat/agent-loop-detection-metric

Conversation

@Jeel3011

@Jeel3011 Jeel3011 commented May 1, 2026

Copy link
Copy Markdown

Summary

Adds AgentLoopDetectionMetric — a fully deterministic (no LLM / API key required) metric that detects infinite loops and cyclical execution patterns in agent traces. It analyzes three independent sub-signals and returns a weighted score from 0.0 (severe looping) to 1.0 (clean execution).

This metric is designed to run in production at zero cost and zero latency — every sub-signal is computed with hashing, set operations, or sequence comparison.


Sub-signals

Signal Weight What it detects
Tool Call Repetition 40% Same tool called with identical (name, args) ≥ N times
Reasoning Stagnation 35% Consecutive LLM outputs that are structurally identical or reordered
Call Graph Cycles 25% A span that recursively invokes itself (same type + name + input)

Tool Call Repetition

Hashes each tool call by (name, sorted_args). Score degrades at repetition_threshold (0.5) and at 2× threshold (0.0). Different arguments = different calls = not penalized.

Reasoning Stagnation

Uses two complementary signals and takes the maximum:

  • Bigram Jaccard — catches literal repetition after stop-word removal
  • SequenceMatcher ratio (difflib) — catches reordered-but-identical phrasing

Outputs shorter than 20 meaningful words are skipped (Jaccard is noisy at that scale).

Call Graph Cycles

DFS on the nested children tree. Labels each span as type:name:input_hash. A cycle is flagged when the same label appears twice on the same root-to-leaf ancestry path. Including input_hash prevents false positives when two different agents share a name but receive different inputs.


Files changed

File Change
deepeval/metrics/agent_loop_detection/agent_loop_detection.py Core metric implementation
deepeval/metrics/agent_loop_detection/__init__.py Module export (unchanged from initial)
tests/test_agent_loop_detection.py 10 passing tests, zero API key required
docs/content/docs/(agentic)/metrics-agent-loop-detection.mdx Full doc page (Usage, How It Works, Limitations)
docs/content/docs/(agentic)/meta.json Registered in sidebar

Design decisions & known limitations

Why no model parameter?

Deterministic by design. Accepting model would be misleading since no sub-signal uses an LLM. A future LLM-as-judge stagnation mode could be added behind a feature flag.

Cycle detection uses type:name:input_hash, not span UUIDs

create_nested_spans_dict strips UUIDs from the trace dict, so exact span identity is unavailable. The input_hash heuristic is the best available — true recursive loops pass the same input back, so the hash matches. Two same-name spans with different inputs are correctly treated as distinct.

Stagnation will miss semantically-different wording

"Search for Paris weather" vs "Look up the forecast in Paris" won't be caught. Fixing this would require an LLM-as-judge, sacrificing the zero-cost property.


Test matrix (10 tests, all pass)

# Test Validates
1 test_clean_trace_passes Clean agent → 1.0
2 test_repeated_tool_calls_detected 4× identical calls → ≤ 0.5
3 test_no_trace_returns_zero Missing trace → 0.0 + reason
4 test_reasoning_stagnation_detected Identical outputs → stagnation flagged
5 test_disable_tool_repetition_check Disabled check → score unaffected
6 test_score_combines_with_correct_weights Weight normalization math
7 test_call_graph_cycle_detected True recursive loop → 0.0
8 test_sequential_same_name_not_a_cycle Siblings ≠ cycle (no false positive)
9 test_same_name_different_input_not_a_cycle Same name, different input ≠ cycle
10 test_reordered_stagnation_detected Reordered phrasing caught by SequenceMatcher

CI note

The test_core/ collection error (async_app.py:67 instantiating AnswerRelevancyMetric() at module level without an API key) is a pre-existing issue on main (commit 90f398afc). This PR does not modify that file.

…ent traces

Adds a new trace-only, referenceless agentic metric that detects infinite
loops, cyclical tool-call patterns, and reasoning stagnation in agent
execution traces.

Three configurable sub-signals:
- Tool call repetition: hash-based duplicate counting
- Reasoning stagnation: bigram sliding window overlap
- Call graph cycles: DFS back-edge detection

Closes confident-ai#2643
@vercel

vercel Bot commented May 1, 2026

Copy link
Copy Markdown

@Jeel3011 is attempting to deploy a commit to the Confident AI Team on Vercel.

A member of the Team first needs to authorize it.

Jeel3011 added 3 commits May 12, 2026 18:10
…mprove stagnation scorer

- _score_call_graph_cycles: rewrite to traverse the real parent→child span
  tree (nested 'children' list) instead of treating sequential order as edges.
  DFS now tracks the ancestry path per root-to-leaf walk — a back-edge is only
  reported when a type:name label appears twice on the *same* path, so sibling
  repetition (tool A called twice at the same depth) is correctly ignored.

- Remove 'model' parameter from __init__: it was accepted but never used in
  any scoring logic. Metric is now fully deterministic — no LLM / API key
  required (document as a feature, not a limitation).

- _score_reasoning_stagnation: add _STOP_WORDS filter and a minimum-words
  guard (< 20 meaningful tokens → skip pair) to prevent boilerplate preambles
  from inflating Jaccard similarity into false stagnation alerts.
…nation signal, document limitations

Cycle detection:
- DFS labels now include a truncated input hash (type:name:input_hash)
  so two same-name spans with different inputs are correctly treated as
  distinct nodes. True recursive loops (same name + same input) are
  still caught. Documented trade-off vs UUID-based identity.

Stagnation detection:
- Added difflib.SequenceMatcher ratio as a secondary signal alongside
  bigram Jaccard. Take max(jaccard, seq_ratio) to catch both literal
  repetition and reordered-but-identical phrasing.

Tests:
- Test 7 updated: true cycle now has matching input on both spans.
- Test 9 (new): same-name agents with different inputs → no false positive.
- Test 10 (new): reordered phrasing caught by SequenceMatcher.
- All 10 tests pass, zero API key required.

Documentation:
- Added 'Limitations & Design Decisions' section to the MDX doc page
  explaining why model param is absent, why input_hash is used for
  cycle detection, and why dual-signal stagnation is strictly better
  but still has known gaps.
@Jeel3011 Jeel3011 changed the title feat: add AgentLoopDetectionMetric for detecting infinite loops in agent traces feat: add AgentLoopDetectionMetric — deterministic loop detection for agent traces May 12, 2026
@Jeel3011

Copy link
Copy Markdown
Author

I would like here from managers soon!!!

- Add Required Arguments section (trace-only, needs input + actual_output)
- Add 'Within components' usage pattern with code example
- Add 'As a standalone' usage with caution note
- Document all 8 parameters including include_reason and async_mode
- Add explicit note that no model parameter exists and why (deterministic design)
- Expand How Is It Calculated with per-tier score tables for all 3 sub-signals
- Add Score Breakdown section with score_breakdown attribute example
- Add Configuring Sub-signals section with toggle examples
- Add community attribution note (author, AGeval, issue confident-ai#2643)

Also fix tests/test_core/test_tracing/apps/async_app.py: avoid
instantiating metrics at module level inside decorator arguments to
prevent API-key validation at import time in keyless CI environments.
@Jeel3011

Copy link
Copy Markdown
Author

Hey @jeffreyip — just pushed the completed documentation page for AgentLoopDetectionMetric as requested.

The doc page (docs/content/docs/(agentic)/metrics-agent-loop-detection.mdx) now follows the full structure of the existing agentic metric docs:

  • Required Arguments — clarifies this is a trace-only metric (no tools_called / expected_tools needed)
  • Usageevals_iterator, within components, and standalone patterns with full code examples
  • All 8 parameters documented including include_reason, async_mode, and an explicit note that there is no model parameter and why
  • How Is It Calculated — expanded with per-tier score tables (1.0 / 0.5 / 0.0) for each of the three sub-signals
  • Score Breakdown — documents the score_breakdown attribute
  • Configuring Sub-signals — shows how to toggle individual checks
  • Limitations & Design Decisions — explains the deterministic design choice, cycle detection heuristics, and stagnation trade-offs
  • Community attribution note at the bottom linking to AGeval and issue Feature: AgentLoopDetectionMetric — detect infinite loops and cyclical tool-call patterns in agent traces #2643

Happy to adjust anything — structure, wording, depth of any section. Just let me know 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant