Detailed documentation for the agent frameworks used to evaluate Next-Gen CAPTCHAs. For a quick start, see the README.
- POMDP Model
- Browser-Use CLI (
browseruse_cli.py) - Extended CLI (
browseruse_extended_cli.py) - CrewAI CLI (
crewai_cli.py) - Framework Comparison: Browser-Use vs CrewAI
We model the VLM web agent as an extended POMDP (Partially Observable Markov Decision Process):
Note: The extended CLI supports additional components (
A_api, separateU_mem/U_think) not in the paper's core formulation.
Below is how each theoretical component maps to our browser-use implementation, with corresponding code variable names.
| Theory | Description | Code Variable |
|---|---|---|
| Screenshot | browser_state.screenshot |
|
| DOM tree |
page.evaluate() extracts window.currentPuzzle
|
|
| URL, viewport |
args.url, browser_kwargs['viewport']
|
| Theory | Description | Code Variable |
|---|---|---|
| Procedural memory |
MemoryConfig(memory_interval=N), enable_memory
|
|
| Thought trace |
agent_output in step callback, use_thinking=True
|
| Theory | Description | Code Variable |
|---|---|---|
| Browser actions | Default browser-use actions: click, scroll, type, drag | |
| External API calls (optional, not in paper) | Extended CLI only: http_get, http_post
|
|
| Planner reasoning |
planner_llm, planner_interval, is_planner_reasoning
|
| Theory | Description | Code Variable |
|---|---|---|
| Policy (action selection) |
llm param to Agent(), created by llm_factories[llm_name](args)
|
|
| World transition | Playwright browser via Browser(**browser_kwargs)
|
|
| State update |
memory_interval in MemoryConfig, agent state updates |
|
| Reward signal |
detect_result_from_page() returns is_correct
|
|
| Deliberation cost |
usage dict: input_tokens, output_tokens, reasoning_tokens
|
| Component | Basic CLI | Extended CLI | Dependencies |
|---|---|---|---|
| Observations ( |
Screenshot + DOM | Same | - |
| Memory ( |
Default | Custom interval | mem0 |
| Thought ( |
Built-in | Planner LLM | - |
| Web Actions ( |
All | Same | - |
| API Actions ( |
Extensible | Optional extension | aiohttp |
| Cost Tracking ( |
Token logging | Same | - |
Standard browser-use agent with:
- Screenshot + DOM observations
- Web actions (click, type, scroll, drag)
- LLM-based policy
- Token cost tracking
# Run with OpenAI
uv run agent_frameworks/browseruse_cli.py --llm openai --model gpt-4o
# Run with Anthropic
uv run agent_frameworks/browseruse_cli.py --llm anthropic --model claude-sonnet-4-20250514
# Run with Google
uv run agent_frameworks/browseruse_cli.py --llm google --model gemini-2.0-flashAdds POMDP-aligned features:
| Feature | Flag | POMDP Component |
|---|---|---|
| Procedural Memory | --procedural-memory-interval N |
U_mem (memory consolidation) |
| Disable Memory | --disable-procedural-memory |
Disable m_t updates |
| Planner LLM | --enable-planner |
A_think (separate reasoning) |
| Planner Model | --planner-model MODEL |
Different model for planning |
| Planner Interval | --planner-interval N |
Plan every N steps |
| Planner Reasoning | --planner-reasoning |
Extended reasoning format |
| API Actions | --enable-api-actions |
A_api (external HTTP calls) |
| API Timeout | --api-timeout N |
Timeout for API calls |
| API Domains | --api-allowed-domains |
Security: restrict allowed domains |
# With procedural memory (consolidates every 5 steps)
uv run agent_frameworks/browseruse_extended_cli.py \
--llm openai --model gpt-4o \
--procedural-memory-interval 5
# With separate planner (uses cheaper model for planning)
uv run agent_frameworks/browseruse_extended_cli.py \
--llm openai --model gpt-4o \
--enable-planner --planner-model gpt-4o-mini
# With API actions enabled (allows agent to call external HTTP APIs)
uv run agent_frameworks/browseruse_extended_cli.py \
--llm openai --model gpt-4o \
--enable-api-actions
# With API actions and domain restrictions (security)
uv run agent_frameworks/browseruse_extended_cli.py \
--llm openai --model gpt-4o \
--enable-api-actions --api-allowed-domains "api.example.com,api.openai.com"
# Full configuration with all features
uv run agent_frameworks/browseruse_extended_cli.py \
--llm openai --model gpt-4o \
--procedural-memory-interval 5 \
--enable-planner --planner-model gpt-4o-mini --planner-interval 3 \
--enable-api-actions --api-timeout 60# Memory features require mem0 and sentence-transformers
pip install mem0 sentence-transformers
# API actions require aiohttp
pip install aiohttp
# Or with uv
uv add mem0 sentence-transformers aiohttpIf not installed, features are gracefully disabled with a warning.
An alternative agent framework using CrewAI for benchmarking comparisons. This implementation mirrors browseruse_cli.py functionality for fair performance testing.
Key Differences from Browser-Use:
| Feature | browseruse_cli.py | crewai_cli.py |
|---|---|---|
| Browser Control | Direct Playwright via browser-use Agent | CrewAI's BrowserTool abstraction |
| Agent Architecture | Single agent with built-in reasoning | CrewAI Agent + Crew orchestration |
| Step Callbacks | Custom step hooks with screenshots | CrewAI's built-in step_callback |
| Token Capture | HTTP interception for all providers | Same HTTP interception approach |
Supported LLM Providers:
openai- GPT-4o, GPT-5, etc.anthropic- Claude modelsgoogle- Gemini modelsgroq- Groq-hosted modelsazure-openai- Azure OpenAI Servicevllm- Self-hosted vLLM inferenceqwen- Alibaba Qwen models (via DashScope)doubao- ByteDance Doubao models
Basic Usage:
# Run with OpenAI
uv run agent_frameworks/crewai_cli.py \
--url http://127.0.0.1:7860 \
--llm openai --model gpt-4o
# Run with Anthropic Claude
uv run agent_frameworks/crewai_cli.py \
--url http://127.0.0.1:7860 \
--llm anthropic --model claude-sonnet-4-20250514
# Run with Google Gemini (thinking model)
uv run agent_frameworks/crewai_cli.py \
--url http://127.0.0.1:7860 \
--llm google --model gemini-2.5-pro-preview-05-06 \
--max-output-tokens 32768 --thinking-budget 16384
# Run with vLLM (self-hosted)
uv run agent_frameworks/crewai_cli.py \
--url http://127.0.0.1:7860 \
--llm vllm --model Qwen/Qwen3-VL-8B-Thinking \
--base-url http://localhost:8000/v1 \
--max-output-tokens 65536
# Run with Qwen (thinking mode enabled)
uv run agent_frameworks/crewai_cli.py \
--url http://127.0.0.1:7860 \
--llm qwen --model qwen-vl-max-latestCLI Arguments:
uv run agent_frameworks/crewai_cli.py [OPTIONS]
Required:
--url URL Target URL (e.g., http://127.0.0.1:7860)
LLM Provider:
--llm PROVIDER openai, anthropic, google, groq, azure-openai, vllm, qwen, doubao
--model MODEL Model name (provider-specific)
--base-url URL API base URL (vLLM, Azure)
--api-key KEY API key override
Model Parameters:
--temperature FLOAT Sampling temperature (default: provider-specific)
--reasoning-effort LVL OpenAI: none, low, medium, high, xhigh
--max-output-tokens N Max output tokens (Gemini, vLLM)
--thinking-budget N Gemini 2.5 thinking budget
--thinking-level LVL Gemini 3 thinking: minimal, low, medium, high
--disable-thinking Disable thinking for vLLM Qwen models
Execution:
--max-steps N Max steps per puzzle (default: 1000)
--max-actions-per-step Max actions per step (default: 10)
--max-failures N Max consecutive failures (default: 5)
--llm-timeout SECS LLM request timeout (default: 1800)
--step-timeout SECS Step execution timeout (default: 1800)
--isolate-puzzles Fresh agent per puzzle (no memory)
--headless Run browser headless
Logging:
--no-log-llm Disable LLM logging
--llm-log-dir DIR Log directory (default: llm_logs)
--run-id ID Shared run ID for logs
--debug-vllm Debug output for vLLMUse this guide to choose the right framework for your benchmarking needs:
| Use Case | Recommended Framework | Reason |
|---|---|---|
| Production benchmarking | browseruse_cli.py | More mature, direct Playwright control |
| CrewAI performance testing | crewai_cli.py | Direct comparison with browseruse |
| Custom step callbacks | browseruse_cli.py | Better step-level hooks |
| Multi-agent scenarios | crewai_cli.py | CrewAI's orchestration features |
| Token cost analysis | Both | Same HTTP interception approach |
Running Comparative Benchmarks:
# Test the same puzzle with both frameworks
# Browser-Use framework
./test_benchmark.sh --llm openai --model gpt-4o \
--puzzles 'Dice_Roll_Path:5' --isolate-puzzles --seed 0 --headless
# CrewAI framework (run directly)
uv run agent_frameworks/crewai_cli.py \
--url http://127.0.0.1:7860 \
--llm openai --model gpt-4o \
--isolate-puzzles --headlessBoth frameworks produce compatible output:
benchmark_results_*.json- Results compatible with analysis toolsllm_logs/- Per-puzzle logs with token usage