An intelligent web automation agent that uses vision models and reasoning agents to perform tasks on web applications.
The system consists of four main components:
- Web Agent (
agents/web_agent.py) - Playwright-based browser controller - Vision Agent (
agents/vision_agent.py) - Qwen3-VL visual interpreter via Ollama - Reasoning Agent (
agents/reasoning_agent.py) - Cloud LLM task planner via Ollama - Orchestrator (
agents/orchestrator.py) - Pipeline controller coordinating the flow
pip install -r requirements.txt
playwright install chromiumEnsure Ollama is running and you have the required models:
# Pull vision model
ollama pull qwen3-vl:4b
# Pull reasoning model
ollama pull deepseek-v3.1:671b-cloudEdit configs/settings.yaml to adjust:
- Model names
- API endpoints
- Browser settings
- Timeouts
python main.py --task "Create a project in Linear" --url "https://linear.app"python main.py \
--task "Create a project in Linear" \
--url "https://linear.app" \
--task-name "linear_create_project"--task: Natural language task description (required)--url: Initial URL to start from (optional)--task-name: Custom task name for data storage (optional)--config: Path to configuration file (default:configs/settings.yaml)
- Task Description → Reasoning Agent plans initial action
- Web Agent → Opens browser, navigates, captures screenshot
- Vision Agent → Analyzes screenshot, returns structured UI description
- Reasoning Agent → Plans next action based on UI state
- Web Agent → Executes action
- State Manager → Saves step data
- Loop continues until task is complete or max steps reached
All task data is stored in data/{task_name}/:
step_XX.png- Screenshots for each stepmetadata.json- Complete execution history with vision/reasoning data
data/linear_create_project/
├── step_00.png
├── step_01.png
├── step_02_modal.png
└── metadata.json
The metadata.json contains:
- Task information
- Step-by-step execution history
- Vision agent descriptions
- Reasoning agent decisions
- DOM tree snapshots
Provides browser automation:
navigate(url)- Navigate to URLcapture_screenshot(path)- Save screenshotclick(text/selector)- Click elementfill(field, value)- Fill form fieldwait_for_modal(text)- Wait for modalget_dom_tree()- Extract DOM structure
Analyzes screenshots:
- Takes screenshot path
- Returns structured JSON with:
- Title
- Buttons
- Form fields
- Links
- Text content
- Layout description
Plans actions:
- Receives task description + vision state
- Returns structured action:
action: click, fill, navigate, wait, donetarget: element identifiervalue: value for fill actionsconfidence: confidence scoredone: completion flag
Main execution loop:
- Coordinates all agents
- Manages step iteration
- Handles errors
- Saves state
See configs/settings.yaml for:
- Model configurations
- API endpoints
- Browser settings
- Timeouts and limits
- Logging configuration
The agent supports human-in-the-loop functionality, allowing you to manually intervene when needed (e.g., entering email addresses, passwords, or other sensitive information).
-
Automatic Pausing: When the agent detects it needs to fill sensitive fields (email, password, username), it will automatically pause and wait for manual input.
-
Manual Typing: During the pause, you can:
- Click on the browser window
- Type your email, password, or any other information manually
- The browser is fully interactive during this time
-
Configuration: In
configs/settings.yaml:task: manual_intervention: enabled: true # Enable/disable manual intervention wait_on_fill: true # Pause when filling sensitive fields wait_time: 30 # Seconds to wait for manual input
-
Example Flow:
- Agent navigates to login page
- Agent identifies email field
- Agent pauses → You manually type your email
- Agent continues and fills password (or you can type that too)
- Agent proceeds with the task
- The browser window stays open and interactive during execution
- You can manually click, type, or interact at any time
- The agent will continue after the wait time expires
- For best results, type during the pause period when prompted
- Ollama connection errors: Ensure Ollama is running on
localhost:11434 - Model not found: Pull required models using
ollama pull <model-name> - Browser errors: Run
playwright install chromium - Timeout errors: Increase timeout values in
configs/settings.yaml - 405 Method Not Allowed in Chrome: This is normal - Ollama API requires POST requests, not GET. The browser will show 405, but the agent works fine.