UI Capture Agent

An intelligent web automation agent that uses vision models and reasoning agents to perform tasks on web applications.

Architecture

The system consists of four main components:

Web Agent (agents/web_agent.py) - Playwright-based browser controller
Vision Agent (agents/vision_agent.py) - Qwen3-VL visual interpreter via Ollama
Reasoning Agent (agents/reasoning_agent.py) - Cloud LLM task planner via Ollama
Orchestrator (agents/orchestrator.py) - Pipeline controller coordinating the flow

Setup

1. Install Dependencies

pip install -r requirements.txt
playwright install chromium

2. Setup Ollama

Ensure Ollama is running and you have the required models:

# Pull vision model
ollama pull qwen3-vl:4b

# Pull reasoning model
ollama pull deepseek-v3.1:671b-cloud

3. Configure Settings

Edit configs/settings.yaml to adjust:

Model names
API endpoints
Browser settings
Timeouts

Usage

Basic Usage

python main.py --task "Create a project in Linear" --url "https://linear.app"

With Custom Task Name

python main.py \
  --task "Create a project in Linear" \
  --url "https://linear.app" \
  --task-name "linear_create_project"

Command Line Arguments

--task: Natural language task description (required)
--url: Initial URL to start from (optional)
--task-name: Custom task name for data storage (optional)
--config: Path to configuration file (default: configs/settings.yaml)

Data Flow

Task Description → Reasoning Agent plans initial action
Web Agent → Opens browser, navigates, captures screenshot
Vision Agent → Analyzes screenshot, returns structured UI description
Reasoning Agent → Plans next action based on UI state
Web Agent → Executes action
State Manager → Saves step data
Loop continues until task is complete or max steps reached

Output

All task data is stored in data/{task_name}/:

step_XX.png - Screenshots for each step
metadata.json - Complete execution history with vision/reasoning data

Example Output Structure

data/linear_create_project/
├── step_00.png
├── step_01.png
├── step_02_modal.png
└── metadata.json

The metadata.json contains:

Task information
Step-by-step execution history
Vision agent descriptions
Reasoning agent decisions
DOM tree snapshots

Components

Web Agent

Provides browser automation:

navigate(url) - Navigate to URL
capture_screenshot(path) - Save screenshot
click(text/selector) - Click element
fill(field, value) - Fill form field
wait_for_modal(text) - Wait for modal
get_dom_tree() - Extract DOM structure

Vision Agent

Analyzes screenshots:

Takes screenshot path
Returns structured JSON with:
- Title
- Buttons
- Form fields
- Links
- Text content
- Layout description

Reasoning Agent

Plans actions:

Receives task description + vision state
Returns structured action:
- action: click, fill, navigate, wait, done
- target: element identifier
- value: value for fill actions
- confidence: confidence score
- done: completion flag

Orchestrator

Main execution loop:

Coordinates all agents
Manages step iteration
Handles errors
Saves state

Configuration

See configs/settings.yaml for:

Model configurations
API endpoints
Browser settings
Timeouts and limits
Logging configuration

Manual Intervention

The agent supports human-in-the-loop functionality, allowing you to manually intervene when needed (e.g., entering email addresses, passwords, or other sensitive information).

How It Works

Automatic Pausing: When the agent detects it needs to fill sensitive fields (email, password, username), it will automatically pause and wait for manual input.
Manual Typing: During the pause, you can:
- Click on the browser window
- Type your email, password, or any other information manually
- The browser is fully interactive during this time

Configuration: In configs/settings.yaml:

task:
  manual_intervention:
    enabled: true        # Enable/disable manual intervention
    wait_on_fill: true  # Pause when filling sensitive fields
    wait_time: 30       # Seconds to wait for manual input

Example Flow:
- Agent navigates to login page
- Agent identifies email field
- Agent pauses → You manually type your email
- Agent continues and fills password (or you can type that too)
- Agent proceeds with the task

Tips

The browser window stays open and interactive during execution
You can manually click, type, or interact at any time
The agent will continue after the wait time expires
For best results, type during the pause period when prompted

Troubleshooting

Ollama connection errors: Ensure Ollama is running on localhost:11434
Model not found: Pull required models using ollama pull <model-name>
Browser errors: Run playwright install chromium
Timeout errors: Increase timeout values in configs/settings.yaml
405 Method Not Allowed in Chrome: This is normal - Ollama API requires POST requests, not GET. The browser will show 405, but the agent works fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UI Capture Agent

Architecture

Setup

1. Install Dependencies

2. Setup Ollama

3. Configure Settings

Usage

Basic Usage

With Custom Task Name

Command Line Arguments

Data Flow

Output

Example Output Structure

Components

Web Agent

Vision Agent

Reasoning Agent

Orchestrator

Configuration

Manual Intervention

How It Works

Tips

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agents		agents
configs		configs
data		data
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

UI Capture Agent

Architecture

Setup

1. Install Dependencies

2. Setup Ollama

3. Configure Settings

Usage

Basic Usage

With Custom Task Name

Command Line Arguments

Data Flow

Output

Example Output Structure

Components

Web Agent

Vision Agent

Reasoning Agent

Orchestrator

Configuration

Manual Intervention

How It Works

Tips

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages