In productivity tools, a single success metric doesn't work:
The Trapped User Scenario:
- User asks: "How do I reset my password?"
- 20 turns later, they're still talking to the bot
- They are not "engaged" — they are trapped
- High engagement = FAILURE
The Creative Conversation Scenario:
- User asks: "Help me design a microservices architecture"
- 20 turns of deep exploration
- This is a valuable brainstorming session
- High engagement = SUCCESS
Key Insight: We cannot use a single metric for success. We must detect Intent in the first interaction.
Characteristics:
- User has a specific problem
- Wants quick resolution
- Examples: "How do I reset my password?", "Why isn't this working?", "Fix this error"
Success Metric: Time-to-Resolution
- Success: Resolved in ≤ 3 turns
- Failure: > 3 turns means user is trapped, not engaged
Reasoning:
- Users want to get unstuck and move on
- Every additional turn is friction
- 20 turns = trapped in a support loop
Characteristics:
- User wants to explore ideas
- Open-ended discussion
- Examples: "Help me design a system", "Let's explore approaches", "What are the trade-offs?"
Success Metric: Depth of Context
- Success: ≥ 5 turns with rich discussion
- Failure: Too short means we failed to be creative enough
Reasoning:
- Users want deep exploration
- Short conversations miss opportunities
- 2 turns = insufficient creative depth
The IntentDetector class analyzes the first user query to determine intent:
from intent_detection import IntentDetector
detector = IntentDetector()
result = detector.detect_intent("How do I reset my password?")
# Result: {"intent": "troubleshooting", "confidence": 0.95, "reasoning": "..."}Detection Process:
- User's first query is sent to LLM
- LLM classifies as "troubleshooting" or "brainstorming"
- Returns intent type, confidence score, and reasoning
- Intent is stored in telemetry for the entire conversation
The system tracks multi-turn conversations:
# Turn 1: Intent detected
doer.run(
query="How do I reset my password?",
conversation_id="conv-123",
turn_number=1 # Intent detected here
)
# Turn 2+: Same conversation
doer.run(
query="I tried that, still not working",
conversation_id="conv-123",
turn_number=2 # Same intent used
)Key Features:
conversation_id: Groups related turns togetherturn_number: Tracks position in conversation (1-indexed)intent_type: Set on first turn, inherited by subsequent turnsintent_confidence: Confidence in the detected intent
The ObserverAgent evaluates conversations using intent-specific metrics:
from observer import ObserverAgent
observer = ObserverAgent()
evaluation = observer.evaluate_conversation_by_intent("conv-123")
# For troubleshooting:
# {"success": False, "turn_count": 5, "reasoning": "User trapped..."}
# For brainstorming:
# {"success": True, "turn_count": 10, "depth_score": 0.8, "reasoning": "..."}Evaluation Process:
- Observer collects all events for a conversation
- Retrieves intent from first turn
- Applies intent-specific success criteria
- Returns evaluation with success/failure status
from intent_detection import IntentMetrics
# Quick resolution = SUCCESS
result = IntentMetrics.evaluate_troubleshooting(turn_count=2, resolved=True)
# {"success": True, "metric": "time_to_resolution", ...}
# User trapped = FAILURE
result = IntentMetrics.evaluate_troubleshooting(turn_count=5, resolved=True)
# {"success": False, "reasoning": "User trapped in conversation..."}Thresholds:
- Max acceptable turns: 3
-
3 turns = Trapped user (failure)
# Deep exploration = SUCCESS
result = IntentMetrics.evaluate_brainstorming(
turn_count=10,
context_depth_score=0.8
)
# {"success": True, "metric": "depth_of_context", ...}
# Too short = FAILURE
result = IntentMetrics.evaluate_brainstorming(
turn_count=2,
context_depth_score=0.5
)
# {"success": False, "reasoning": "Too short, failed to be creative..."}Thresholds:
- Min acceptable turns: 5
- Min acceptable depth: 0.6 (on scale of 0-1)
- Too few turns or low depth = Failed to engage creatively
Context Depth Calculation:
- Analyzes conversation history
- Considers response length, diversity of topics
- Returns score 0-1 (0 = shallow, 1 = deep)
from agent import DoerAgent
from observer import ObserverAgent
import uuid
# Initialize
doer = DoerAgent()
conversation_id = str(uuid.uuid4())
# Multi-turn conversation
doer.run("How do I reset my password?", conversation_id=conversation_id, turn_number=1)
doer.run("Thanks!", conversation_id=conversation_id, turn_number=2)
# Evaluate
observer = ObserverAgent()
observer.process_events() # Applies intent-based evaluationpython example_intent_detection.pyThis demonstrates:
- Troubleshooting with quick resolution (SUCCESS)
- Troubleshooting with user trapped (FAILURE)
- Brainstorming with deep exploration (SUCCESS)
- Brainstorming that's too shallow (FAILURE)
python test_intent_detection.pyTests the intent detection system without requiring API keys.
Environment variables (in .env):
# Model for intent detection (optional, defaults to gpt-4o-mini)
INTENT_MODEL=gpt-4o-mini-
Accurate Success Measurement
- Different intents have different success criteria
- No more false positives from "engaged" trapped users
-
Better Learning
- Observer learns from intent-specific failures
- Troubleshooting: "Resolve faster"
- Brainstorming: "Go deeper"
-
Prevents Misinterpretation
- 20-turn troubleshooting = FAILURE (trapped)
- 20-turn brainstorming = SUCCESS (engaged)
-
Automatic Detection
- Intent detected from first interaction
- No manual labeling required
- Works with any query type
{
"event_type": "task_complete",
"timestamp": "2024-01-01T12:00:00",
"query": "How do I reset my password?",
"agent_response": "...",
"conversation_id": "conv-123",
"turn_number": 1,
"intent_type": "troubleshooting",
"intent_confidence": 0.95
}Intent-Based Evaluation Statistics:
🔧 Troubleshooting Conversations: 2
❌ Failed (>3 turns): 1
💡 Brainstorming Conversations: 2
❌ Failed (too shallow): 1
intent_detection.py: Core intent detection and metricsexample_intent_detection.py: Demo scripttest_intent_detection.py: Test suiteINTENT_DETECTION.md: This documentation
Potential improvements:
- More intent types (research, comparison, tutorial)
- Dynamic threshold adjustment based on domain
- Intent confidence thresholds
- Intent transition detection (troubleshooting → brainstorming)
- Per-user intent patterns
- Main README: README.md
- Agent Architecture: ARCHITECTURE.md
- Decoupled Architecture: ARCHITECTURE_DECOUPLED.md