The Self-Evolving Agent now supports two operational modes:
- Decoupled Mode (Recommended): Separates execution from learning - see ARCHITECTURE_DECOUPLED.md
- Legacy Mode (Documented Below): Traditional synchronous self-improvement loop
For the new decoupled architecture with DoerAgent and ObserverAgent, please refer to ARCHITECTURE_DECOUPLED.md.
The Self-Evolving Agent implements a continuous improvement loop where the agent learns from its mistakes and improves its behavior over time.
┌─────────────────────────────────────────────────────────────┐
│ SelfEvolvingAgent │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Memory │ │ AgentTools │ │ OpenAI API │ │
│ │ System │ │ │ │ Client │ │
│ └─────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ system_instructions.json │ │
│ │ { │ │
│ │ "version": 1, │ │
│ │ "instructions": "...", │ │
│ │ "improvements": [...] │ │
│ │ } │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
START
│
├──► Attempt 1
│ │
│ ├──► TASK: Receive query from user
│ │
│ ├──► ACT: Agent processes query
│ │ • Load system instructions from JSON
│ │ • Add tool information to context
│ │ • Call LLM to generate response
│ │ • Return agent response
│ │
│ ├──► REFLECT: Evaluate response
│ │ • Call reflection LLM with query + response
│ │ • Get score (0-1) and critique
│ │
│ ├──► Check score >= 0.8?
│ │ │
│ │ ├─YES─► SUCCESS! Return results
│ │ │
│ │ └─NO──► EVOLVE
│ │ • Call evolution LLM with critique
│ │ • Generate new system instructions
│ │ • Save to JSON with version++
│ │ • Log improvement
│ │
├──► Attempt 2 (with evolved instructions)
│ │
│ └──► [Same flow as Attempt 1]
│
├──► Attempt 3 (with further evolved instructions)
│ │
│ └──► [Same flow as Attempt 1]
│
└──► Max retries reached
• Return best attempt
• Mark as failure if score < 0.8
Purpose: Persist and manage system instructions
class MemorySystem:
load_instructions() → dict
save_instructions(dict) → None
get_system_prompt() → str
update_instructions(new_text, critique) → NoneFile Format (system_instructions.json):
{
"version": 2,
"instructions": "Current system prompt text",
"improvements": [
{
"version": 2,
"timestamp": "2024-01-01T12:00:00",
"critique": "What was wrong that led to this update"
}
]
}Purpose: Provide capabilities the agent can use
Available tools:
calculate(expression)- Mathematical evaluationget_current_time()- Current date/timestring_length(text)- String length calculationget_available_tools()- List all available tools
LLM Call 1: Agent Execution
Input:
• System: Current instructions + Tool descriptions
• User: Query
Processing:
• LLM analyzes query
• Determines if tools are needed
• Generates response
Output:
• Agent response (string)
LLM Call 2: Response Evaluation
Input:
• Query: Original user question
• Response: Agent's answer
• Criteria: Correctness, Completeness, Clarity, Tool Usage
Processing:
• Evaluator LLM scores response
• Generates detailed critique
• Returns as JSON
Output:
• score: float (0-1)
• critique: string
LLM Call 3: Instruction Evolution
Input:
• Current instructions
• Recent query & response
• Critique explaining issues
Processing:
• Evolution LLM analyzes critique
• Identifies what to improve
• Rewrites system instructions
Output:
• new_instructions: string
Side Effect:
• Updates system_instructions.json
• Increments version number
• Logs improvement history
Initial Instructions (v1):
"You are a helpful AI assistant. Provide accurate responses."
Query: "What is 15 * 24 + 100?"
Attempt 1:
- Act: "15 times 24 is 360, plus 100 equals 460."
- Reflect: Score: 0.6
- Critique: "The agent calculated correctly but didn't mention the calculator tool. The agent should explicitly state when tools are available and appropriate to use."
- Evolve: Generate new instructions
Evolved Instructions (v2):
"You are a helpful AI assistant with access to tools. When you have a
calculation tool available, always mention it explicitly and explain
that you would use it. Be clear about which tools apply to each query."
Attempt 2:
- Act: "I can use the calculate() tool for this. Let me solve 15 * 24 + 100. The result would be 460."
- Reflect: Score: 0.9
- Success! ✓
-
Separation of Concerns: Three separate LLM calls for different purposes
- Agent (task execution)
- Reflector (evaluation)
- Evolver (improvement)
-
Persistence: All improvements saved to JSON for continuity
-
Iterative Learning: Each attempt uses lessons from previous failures
-
Bounded Retries: Maximum 3 attempts to prevent infinite loops
-
Threshold-based: Clear success criteria (score >= 0.8)
-
Audit Trail: Full history of improvements tracked
Environment variables control behavior:
OPENAI_API_KEY=sk-... # Required
AGENT_MODEL=gpt-4o-mini # Model for acting
REFLECTION_MODEL=gpt-4o-mini # Model for reflection
EVOLUTION_MODEL=gpt-4o-mini # Model for evolution
SCORE_THRESHOLD=0.8 # Success threshold
MAX_RETRIES=3 # Maximum attemptsThe system can be extended by:
- Adding Tools: Extend
AgentToolsclass - Custom Evaluators: Modify reflection criteria
- Different Models: Use different models for each phase
- Persistence: Add database instead of JSON
- Monitoring: Add logging, metrics, dashboards
- Multi-turn: Support conversation history
- Tool Execution: Actually execute tools, not just describe them