feat: add AgentLoopDetectionMetric — deterministic loop detection for agent traces#2645
Open
Jeel3011 wants to merge 5 commits into
Open
feat: add AgentLoopDetectionMetric — deterministic loop detection for agent traces#2645Jeel3011 wants to merge 5 commits into
Jeel3011 wants to merge 5 commits into
Conversation
…ent traces Adds a new trace-only, referenceless agentic metric that detects infinite loops, cyclical tool-call patterns, and reasoning stagnation in agent execution traces. Three configurable sub-signals: - Tool call repetition: hash-based duplicate counting - Reasoning stagnation: bigram sliding window overlap - Call graph cycles: DFS back-edge detection Closes confident-ai#2643
|
@Jeel3011 is attempting to deploy a commit to the Confident AI Team on Vercel. A member of the Team first needs to authorize it. |
…mprove stagnation scorer - _score_call_graph_cycles: rewrite to traverse the real parent→child span tree (nested 'children' list) instead of treating sequential order as edges. DFS now tracks the ancestry path per root-to-leaf walk — a back-edge is only reported when a type:name label appears twice on the *same* path, so sibling repetition (tool A called twice at the same depth) is correctly ignored. - Remove 'model' parameter from __init__: it was accepted but never used in any scoring logic. Metric is now fully deterministic — no LLM / API key required (document as a feature, not a limitation). - _score_reasoning_stagnation: add _STOP_WORDS filter and a minimum-words guard (< 20 meaningful tokens → skip pair) to prevent boilerplate preambles from inflating Jaccard similarity into false stagnation alerts.
…nation signal, document limitations Cycle detection: - DFS labels now include a truncated input hash (type:name:input_hash) so two same-name spans with different inputs are correctly treated as distinct nodes. True recursive loops (same name + same input) are still caught. Documented trade-off vs UUID-based identity. Stagnation detection: - Added difflib.SequenceMatcher ratio as a secondary signal alongside bigram Jaccard. Take max(jaccard, seq_ratio) to catch both literal repetition and reordered-but-identical phrasing. Tests: - Test 7 updated: true cycle now has matching input on both spans. - Test 9 (new): same-name agents with different inputs → no false positive. - Test 10 (new): reordered phrasing caught by SequenceMatcher. - All 10 tests pass, zero API key required. Documentation: - Added 'Limitations & Design Decisions' section to the MDX doc page explaining why model param is absent, why input_hash is used for cycle detection, and why dual-signal stagnation is strictly better but still has known gaps.
Author
|
I would like here from managers soon!!! |
- Add Required Arguments section (trace-only, needs input + actual_output) - Add 'Within components' usage pattern with code example - Add 'As a standalone' usage with caution note - Document all 8 parameters including include_reason and async_mode - Add explicit note that no model parameter exists and why (deterministic design) - Expand How Is It Calculated with per-tier score tables for all 3 sub-signals - Add Score Breakdown section with score_breakdown attribute example - Add Configuring Sub-signals section with toggle examples - Add community attribution note (author, AGeval, issue confident-ai#2643) Also fix tests/test_core/test_tracing/apps/async_app.py: avoid instantiating metrics at module level inside decorator arguments to prevent API-key validation at import time in keyless CI environments.
Author
|
Hey @jeffreyip — just pushed the completed documentation page for The doc page (
Happy to adjust anything — structure, wording, depth of any section. Just let me know 🙏 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
AgentLoopDetectionMetric— a fully deterministic (no LLM / API key required) metric that detects infinite loops and cyclical execution patterns in agent traces. It analyzes three independent sub-signals and returns a weighted score from 0.0 (severe looping) to 1.0 (clean execution).This metric is designed to run in production at zero cost and zero latency — every sub-signal is computed with hashing, set operations, or sequence comparison.
Sub-signals
(name, args)≥ N timesTool Call Repetition
Hashes each tool call by
(name, sorted_args). Score degrades atrepetition_threshold(0.5) and at2× threshold(0.0). Different arguments = different calls = not penalized.Reasoning Stagnation
Uses two complementary signals and takes the maximum:
SequenceMatcherratio (difflib) — catches reordered-but-identical phrasingOutputs shorter than 20 meaningful words are skipped (Jaccard is noisy at that scale).
Call Graph Cycles
DFS on the nested
childrentree. Labels each span astype:name:input_hash. A cycle is flagged when the same label appears twice on the same root-to-leaf ancestry path. Includinginput_hashprevents false positives when two different agents share a name but receive different inputs.Files changed
deepeval/metrics/agent_loop_detection/agent_loop_detection.pydeepeval/metrics/agent_loop_detection/__init__.pytests/test_agent_loop_detection.pydocs/content/docs/(agentic)/metrics-agent-loop-detection.mdxdocs/content/docs/(agentic)/meta.jsonDesign decisions & known limitations
Why no
modelparameter?Deterministic by design. Accepting
modelwould be misleading since no sub-signal uses an LLM. A future LLM-as-judge stagnation mode could be added behind a feature flag.Cycle detection uses
type:name:input_hash, not span UUIDscreate_nested_spans_dictstrips UUIDs from the trace dict, so exact span identity is unavailable. Theinput_hashheuristic is the best available — true recursive loops pass the same input back, so the hash matches. Two same-name spans with different inputs are correctly treated as distinct.Stagnation will miss semantically-different wording
"Search for Paris weather" vs "Look up the forecast in Paris" won't be caught. Fixing this would require an LLM-as-judge, sacrificing the zero-cost property.
Test matrix (10 tests, all pass)
test_clean_trace_passestest_repeated_tool_calls_detectedtest_no_trace_returns_zerotest_reasoning_stagnation_detectedtest_disable_tool_repetition_checktest_score_combines_with_correct_weightstest_call_graph_cycle_detectedtest_sequential_same_name_not_a_cycletest_same_name_different_input_not_a_cycletest_reordered_stagnation_detectedCI note
The
test_core/collection error (async_app.py:67instantiatingAnswerRelevancyMetric()at module level without an API key) is a pre-existing issue onmain(commit90f398afc). This PR does not modify that file.