Skip to content

Conversation

@Steve-Dusty
Copy link
Contributor

@Steve-Dusty Steve-Dusty commented Dec 23, 2025

Problem

ReflexionAgent had the same death spiral issue as IRE agent, causing excessive iterations and API calls:

  1. Fragile score extraction - Failed to parse scores from LLM responses with markdown or varied formatting
  2. Never triggered early termination - Score defaulted to 0.5 when parsing failed, which never exceeded 0.9 threshold
  3. Always ran full iterations - Even simple tasks used all max_loops iterations
  4. Exceeded timeout thresholds - Simple tasks: 61s, Complex tasks: 203s
  5. Wasted API calls - 9-15 LLM calls per task instead of 3

Root Cause

Identical to IRE agent issue:

  • Early termination logic existed but depended on score extraction
  • Score parsing used single fragile regex pattern: r"(?:final|overall)\s+score:?\s*(\d+(?:\.\d+)?)"
  • LLM responses with markdown (**Score**: 8/10) or different formats (Rating: 8/10) failed to parse
  • When extraction failed → defaulted to 0.5 → never met 0.9 threshold → ran all iterations

Solution

Applied the proven fix pattern from IRE agent with robust score extraction and improved termination logic.

1. Robust Score Extraction

Added _extract_score_robust() method with multiple fallback strategies:

def _extract_score_robust(self, evaluation: str) -> float:
    # Strategy 1: Multiple regex patterns (handles markdown, different formats)
    score_patterns = [
        r"(?:final|overall)\s+score:?\s*(\d+(?:\.\d+)?)",
        r"score:?\s*(\d+(?:\.\d+)?)\s*/\s*10",
        r"(?:rating|grade):?\s*(\d+(?:\.\d+)?)\s*/\s*10",
        r"(?:rating|grade):?\s*(\d+(?:\.\d+)?)",
    ]

    # Strategy 2: Context-aware patterns (X/10, X out of 10)
    # Strategy 3: Sentiment analysis fallback
    # Default: 0.6 (better than old 0.5)

Now handles:
- **Final Score**: 8/10- Rating: 8.6/10- Grade: 8 out of 10- Markdown formatting- Sentiment-based scoring2. Configuration Constants

EARLY_TERMINATION_THRESHOLD = 0.8  # Lower than 0.9 for realistic termination
DEFAULT_SCORE = 0.6  # Higher than 0.5 to increase termination chance
SCORE_IMPROVEMENT_THRESHOLD = 0.05  # Minimum improvement to continue

3. Dual Termination Conditions

# Condition 1: Score is high enough
if current_score >= EARLY_TERMINATION_THRESHOLD:  # 0.8 instead of 0.9
    logger.info(f"✓ High score achieved ({current_score:.2f}). Stopping early.")
    break

# Condition 2: Score not improving
if iteration > 0 and (current_score - prev_score) < SCORE_IMPROVEMENT_THRESHOLD:
    logger.info(f"✓ Score improvement minimal. Stopping early.")
    break

4. Progress Logging

============================================================
Processing task 1/1
============================================================
Task: Explain photosynthesis in one sentence...

--- Iteration 1/3 ---
Evaluation complete. Score: 0.80
Iteration 1 complete | Score: 0.80 | Best: 0.80High score achieved (0.80 >= 0.8). Stopping early.

============================================================
Task complete | Iterations used: 1/3 | Best score: 0.80
============================================================

Changes

Modified Files:
- swarms/agents/flexion_agent.py - Core ReflexionAgent implementation

Key Changes:
- Line 2: Added import re
- Lines 11-14: Added configuration constants
- Lines 306-371: Added _extract_score_robust() method
- Lines 439-443: Updated evaluate() to use robust extraction
- Lines 603-673: Enhanced termination logic and progress logging


<!-- readthedocs-preview swarms start -->
----
📚 Documentation preview 📚: https://swarms--1266.org.readthedocs.build/en/1266/

<!-- readthedocs-preview swarms end -->

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant