Android World Agent - Phase 3: Evaluation & Improvement

This phase implements advanced evaluation capabilities, few-shot prompting, and self-reflection for the Android World agent.

🚀 Features

1. Enhanced Prompting Strategies

Few-shot Examples: Pre-defined examples to improve model understanding
Chain-of-Thought: Step-by-step reasoning prompts
Enhanced Templates: More detailed and structured prompts

2. Self-Reflection

Automatic Reflection: Agent analyzes its own decisions
Learning Feedback: Identifies areas for improvement
Decision Tracking: Records reasoning behind each action

3. Comprehensive Evaluation Metrics

Step-level Accuracy: Per-step performance measurement
Episode Success Rate: Complete task success tracking
Task-specific Analysis: Performance breakdown by task type
Error Pattern Analysis: Common mistake identification
Visualization: Charts and graphs for performance analysis

📊 Latest Benchmark Results (2025-07-17)

Summary Statistics

Total Episodes: 27
Total Steps: 85
Correct Steps: 51
Step Accuracy: 60.00%
Successful Episodes: 7
Episode Success Rate: 25.93%
Average Steps per Episode: 3.1

Task-Specific Performance

- uninstall_slack: 66.67%
- take_photo: 100.00%
- send_message: 33.33%
- search_wifi: 66.67%
- set_alarm: 75.00%
- check_battery: 0.00%
- turn_on_bluetooth: 50.00%
- search_weather_chrome: 100.00%
- add_contact_alice: 100.00%
- mute_phone: 50.00%
- directions_airport: 33.33%
- change_wallpaper: 16.67%
- share_photo_email: 20.00%
- enable_dark_mode: 50.00%
- update_software: 33.33%
- connect_bluetooth_speaker: 100.00%
- block_spam_number: 33.33%
- turn_on_airplane_mode: 50.00%
- delete_all_alarms: 0.00%
- reply_email_bob: 66.67%
- change_language_spanish: 100.00%
- RecipeAddMultipleRecipesFromImage: 100.00%
- RecipeAddMultipleRecipesFromMarkor: 100.00%
- RecipeAddMultipleRecipesFromMarkor2: 60.00%
- RecipeAddSingleRecipe: 50.00%
- RecipeDeleteDuplicateRecipes: 75.00%
- RecipeDeleteDuplicateRecipes2: 75.00%

Error Analysis

Wrong Element Clicked: 27 occurrences
Wrong Action Type (Click vs Type): 4 occurrences
Wrong Text Typed: 2 occurrences
Wrong Action Type (Type vs Click): 1 occurrences

Sample Errors

Error 1:
- Episode: uninstall_slack
- Goal: Uninstall the Slack app
- App: Apps
- Predicted: CLICK("Unknown")
- Ground Truth: CLICK("Slack")

Error 2:
- Episode: send_message
- Goal: Send a message to John
- App: Chat
- Predicted: TYPE("Text Input", "Hello John")
- Ground Truth: TYPE("Text Input", "Hello John!")

Error 3:
- Episode: send_message
- Goal: Send a message to John
- App: Chat
- Predicted: TYPE("Text Input", "Hello John")
- Ground Truth: CLICK("Send")

📈 Latest Visualization

Full PDF report and detailed logs:

See c_reports/ for the latest PDF report (with all stepwise logs and visualizations)
See results/ for all raw data, logs, and per-run reports

📁 File Structure

llm_agent_eval/
├── src/
│   ├── agent.py              # Enhanced agent with reflection
│   ├── prompts.py            # Few-shot and enhanced prompts
│   ├── evaluation.py         # Comprehensive evaluation metrics
│   └── __init__.py
├── test_enhanced_agent.py    # Phase 3 test script
├── requirements.txt          # Dependencies
├── README_Phase3.md         # This file
├── evaluation_summary.png    # Latest run visualization
├── c_reports/                # PDF reports
└── results/                  # All run data, logs, and visualizations

🛠️ Installation

Install Dependencies:
```
pip install -r requirements.txt
```
Ensure Ollama is Running:
```
ollama serve
```
Verify Model Availability:
```
ollama list
```

🧪 Usage

Basic Enhanced Agent Test

python test_enhanced_agent.py

Custom Evaluation

from src.agent import OllamaProvider, AndroidWorldAgent
from src.evaluation import EvaluationAnalyzer

# Initialize agent with enhanced features
provider = OllamaProvider(model="gemma3:12b-it-qat")
agent = AndroidWorldAgent(
    provider, 
    prompt_template="enhanced",  # Use few-shot examples
    enable_reflection=True       # Enable self-reflection
)

# Run evaluation
analyzer = EvaluationAnalyzer()
# ... add episode results ...
report = analyzer.generate_report()

📊 Evaluation Metrics

Basic Metrics

Step Accuracy: Percentage of correct individual actions
Episode Success Rate: Percentage of fully completed tasks
Average Steps per Episode: Efficiency measurement

Advanced Metrics

Task-specific Accuracy: Performance by task type
App-specific Accuracy: Performance by application
Error Pattern Analysis: Common mistake categories
Response Time Analysis: Performance timing

Self-Reflection Metrics

Reflection Quality: Depth of self-analysis
Learning Improvement: Performance improvement over time
Decision Confidence: Certainty in predictions

🎯 Prompting Strategies

1. Enhanced (Few-shot)

agent = AndroidWorldAgent(provider, prompt_template="enhanced")

Includes 4 pre-defined examples
Shows reasoning for each example
Structured step-by-step thinking

2. Chain-of-Thought (CoT)

agent = AndroidWorldAgent(provider, prompt_template="cot")

Explicit reasoning steps
Goal analysis → Context understanding → Strategy → Action
More detailed thought process

3. Simple (Basic)

agent = AndroidWorldAgent(provider, prompt_template="simple")

Basic prompt without examples
Minimal guidance
Baseline comparison

📈 Visualization

The evaluation system automatically generates:

Accuracy Comparison Charts: Task and app-specific performance
Error Pattern Distribution: Pie chart of common mistakes
Overall Performance Metrics: Bar charts of key indicators

🔍 Error Analysis

Error Categories

Wrong Element Clicked: Correct action type, wrong target
Wrong Text Typed: Incorrect text input
Wrong Action Type: Click vs Type confusion
Format Error: Malformed response

📝 Self-Reflection Example

# Enable reflection
agent = AndroidWorldAgent(provider, enable_reflection=True)

# After running an episode
for reflection in agent.reflection_history:
    print(f"Step {reflection['step_index']}:")
    print(f"Correct: {reflection['was_correct']}")
    print(f"Reflection: {reflection['reflection']}")

🎛️ Configuration Options

Agent Configuration

agent = AndroidWorldAgent(
    llm_provider=provider,
    prompt_template="enhanced",    # "enhanced", "cot", "simple"
    enable_reflection=True         # True/False
)

Evaluation Configuration

analyzer = EvaluationAnalyzer()
analyzer.add_episode_result(result)
analyzer.generate_report("report.md")
analyzer.create_visualizations("plots/")

🚀 Next Steps

Scale Testing: Run on larger episode datasets
Model Comparison: Test different Ollama models
Prompt Engineering: Optimize few-shot examples
Real Android World: Integrate with actual benchmark data
Performance Optimization: Improve response times

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
c_reports		c_reports
episodes		episodes
results		results
src		src
.DS_Store		.DS_Store
README.md		README.md
README_Phase3.md		README_Phase3.md
create_results_structure.py		create_results_structure.py
debug_accuracy.py		debug_accuracy.py
evaluation_summary.png		evaluation_summary.png
generate_pdf_report.py		generate_pdf_report.py
requirements.txt		requirements.txt
run_evaluation.py		run_evaluation.py
test_enhanced_agent.py		test_enhanced_agent.py
test_fix.py		test_fix.py
test_ollama_agent.py		test_ollama_agent.py

Folders and files

Latest commit

History

Repository files navigation

Android World Agent - Phase 3: Evaluation & Improvement

🚀 Features

1. Enhanced Prompting Strategies

2. Self-Reflection

3. Comprehensive Evaluation Metrics

📊 Latest Benchmark Results (2025-07-17)

Summary Statistics

Task-Specific Performance

Error Analysis

Sample Errors

📈 Latest Visualization

📁 File Structure

🛠️ Installation

🧪 Usage

Basic Enhanced Agent Test

Custom Evaluation

📊 Evaluation Metrics

Basic Metrics

Advanced Metrics

Self-Reflection Metrics

🎯 Prompting Strategies

1. Enhanced (Few-shot)

2. Chain-of-Thought (CoT)

3. Simple (Basic)

📈 Visualization

🔍 Error Analysis

Error Categories

📝 Self-Reflection Example

🎛️ Configuration Options

Agent Configuration

Evaluation Configuration

🚀 Next Steps

🔧 Troubleshooting

Common Issues

Performance Tips

📚 References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages