DOM Processing and Element Extraction

Overview

The Browser.AI DOM Processing Service is a sophisticated system that analyzes web pages, extracts interactive elements, and provides AI agents with structured information about the DOM. It combines JavaScript execution in the browser with Python-based processing to create a comprehensive understanding of web page structure and interactivity.

Architecture

DOM Service Components

graph TB
    subgraph "DOM Processing Pipeline"
        DS[DOM Service]
        JS[JavaScript Injector]
        Builder[DOM Tree Builder]
        Parser[Node Parser]
        Processor[History Tree Processor]
    end
    
    subgraph "Browser Integration"
        Page[Playwright Page]
        Eval[JavaScript Evaluation]
        Elements[DOM Elements]
        Viewport[Viewport State]
    end
    
    subgraph "Data Structures"
        Tree[DOM Tree]
        SelectorMap[Selector Map]
        Coordinates[Coordinate System]
        History[Element History]
    end
    
    DS --> JS
    JS --> Builder
    Builder --> Parser
    Parser --> Processor
    Page --> Eval
    Eval --> Elements
    Elements --> Tree
    Tree --> SelectorMap
    Viewport --> Coordinates
    SelectorMap --> History
    
    style DS fill:#fff3e0
    style Tree fill:#e8f5e8
    style SelectorMap fill:#e1f5fe

Core DOM Service Structure

class DomService:
    def __init__(self, page: Page):
        self.page = page
        self.xpath_cache = {}
        self.js_code = resources.read_text('browser_ai.dom', 'buildDomTree.js')
    
    async def get_clickable_elements(
        self,
        highlight_elements: bool = True,
        focus_element: int = -1,
        viewport_expansion: int = 0,
    ) -> DOMState

JavaScript DOM Tree Builder

Client-Side DOM Analysis

The DOM processing begins with sophisticated JavaScript code executed in the browser that analyzes the page structure:

// buildDomTree.js - Core DOM analysis engine
(args = {
    doHighlightElements: true,
    focusHighlightIndex: -1,
    viewportExpansion: 0,
}) => {
    const { doHighlightElements, focusHighlightIndex, viewportExpansion } = args;
    let highlightIndex = 0;
    
    const DOM_HASH_MAP = {};
    
    // Element highlighting and analysis functions
    function highlightElement(element, index, parentIframe = null) {
        // Visual highlighting with color-coded indices
        // Coordinate calculation and viewport analysis
    }
    
    function extractElementInfo(element) {
        // Extract all relevant element properties
        // Calculate positions and visibility
    }
}

Element Analysis Process

sequenceDiagram
    participant Python as DOM Service
    participant JS as JavaScript Engine
    participant Browser as Browser DOM
    participant Elements as DOM Elements
    
    Python->>JS: Execute buildDomTree.js
    JS->>Browser: Traverse DOM Tree
    Browser->>Elements: Query Element Properties
    Elements-->>Browser: Element Data
    Browser-->>JS: Processed Information
    
    loop For Each Interactive Element
        JS->>Browser: Check Visibility
        JS->>Browser: Calculate Coordinates
        JS->>Browser: Extract Attributes
        JS->>Browser: Apply Highlighting
    end
    
    JS-->>Python: Return DOM Map + Metadata
    Python->>Python: Build Python DOM Tree

Element Selection Criteria

The JavaScript engine identifies interactive elements based on:

Tag-based Selection
- Input elements (input, textarea, select)
- Interactive elements (button, a, label)
- Form elements (form, fieldset)
Attribute-based Selection
- Elements with onclick handlers
- Elements with href attributes
- Elements with tabindex
- Elements with ARIA roles
Computed Style Analysis
- cursor: pointer elements
- Non-zero dimensions
- Visible elements (not display: none)
Contextual Analysis
- Parent container interactivity
- Event listener presence
- Focus capability

Python DOM Tree Structure

DOM Node Hierarchy

classDiagram
    class DOMBaseNode {
        +bool is_visible
        +Optional[DOMElementNode] parent
    }
    
    class DOMElementNode {
        +str tag_name
        +Optional[int] highlight_index
        +dict[str, str] attributes
        +list[DOMBaseNode] children
        +str xpath
        +Optional[CoordinateSet] viewport_coordinates
        +Optional[CoordinateSet] page_coordinates
        +Optional[ViewportInfo] viewport_info
        +bool shadow_root
        +get_advanced_css_selector()
        +to_dict()
        +from_dict()
    }
    
    class DOMTextNode {
        +str text
        +str type = 'TEXT_NODE'
        +has_parent_with_highlight_index()
    }
    
    DOMBaseNode <|-- DOMElementNode
    DOMBaseNode <|-- DOMTextNode
    DOMElementNode --> DOMBaseNode : children

Data Structures

Coordinate System

class Coordinates(BaseModel):
    x: int
    y: int

class CoordinateSet(BaseModel):
    top_left: Coordinates
    top_right: Coordinates
    bottom_left: Coordinates
    bottom_right: Coordinates
    center: Coordinates
    width: int
    height: int

Viewport Information

class ViewportInfo(BaseModel):
    scroll_x: int
    scroll_y: int
    width: int
    height: int

DOM State

class DOMState(BaseModel):
    element_tree: DOMElementNode      # Root of the DOM tree
    selector_map: SelectorMap         # Index -> Element mapping

Element Processing Workflow

Complete DOM Processing Pipeline

graph TD
    Start[Page Load] --> Validate[Validate JavaScript]
    Validate --> Inject[Inject buildDomTree.js]
    Inject --> Execute[Execute with Parameters]
    Execute --> Parse[Parse JavaScript Results]
    Parse --> BuildTree[Build Python DOM Tree]
    BuildTree --> ProcessNodes[Process Node Relationships]
    ProcessNodes --> MapSelectors[Create Selector Map]
    MapSelectors --> Cleanup[Memory Cleanup]
    Cleanup --> Return[Return DOM State]
    
    subgraph "Node Processing"
        ProcessNodes --> ElementNodes[Process Element Nodes]
        ProcessNodes --> TextNodes[Process Text Nodes]
        ProcessNodes --> Coordinates[Calculate Coordinates]
        ProcessNodes --> Attributes[Extract Attributes]
    end
    
    subgraph "Tree Building"
        BuildTree --> ParentChild[Establish Parent-Child]
        BuildTree --> Highlighting[Apply Highlighting]
        BuildTree --> Indexing[Create Index Map]
    end
    
    style Start fill:#e8f5e8
    style Return fill:#e8f5e8

Node Parsing Algorithm

def _parse_node(self, node_data: dict) -> tuple[Optional[DOMBaseNode], list[int]]:
    """Parse JavaScript node data into Python DOM node"""
    
    # Handle text nodes
    if node_data.get('type') == 'TEXT_NODE':
        return DOMTextNode(
            text=node_data['text'],
            is_visible=node_data['isVisible'],
            parent=None
        ), []
    
    # Process element coordinates
    viewport_coordinates = None
    page_coordinates = None
    viewport_info = None
    
    # Extract coordinate information
    if 'viewportCoordinates' in node_data:
        viewport_coordinates = CoordinateSet(...)
        
    if 'pageCoordinates' in node_data:
        page_coordinates = CoordinateSet(...)
        
    if 'viewport' in node_data:
        viewport_info = ViewportInfo(...)
    
    # Create element node
    return DOMElementNode(...), children_ids

History Tree Processing

Element Tracking and Persistence

The History Tree Processor provides capabilities for tracking elements across page changes and maintaining element identity:

graph LR
    subgraph "History Processing"
        Convert[Convert to History Element]
        Hash[Generate Element Hash]
        Find[Find in Current Tree]
        Match[Match Elements]
    end
    
    subgraph "Hashing Strategy"
        Branch[Branch Path Hash]
        Attr[Attributes Hash]
        XPath[XPath Hash]
        Combined[Combined Hash]
    end
    
    Convert --> Hash
    Hash --> Branch
    Hash --> Attr
    Hash --> XPath
    Branch --> Combined
    Attr --> Combined
    XPath --> Combined
    Combined --> Find
    Find --> Match
    
    style Convert fill:#fff3e0
    style Match fill:#e8f5e8

Element Identity Management

class HistoryTreeProcessor:
    """Operations on DOM elements for history tracking"""
    
    @staticmethod
    def convert_dom_element_to_history_element(
        dom_element: DOMElementNode
    ) -> DOMHistoryElement:
        """Convert current DOM element to history-trackable format"""
        
        parent_branch_path = HistoryTreeProcessor._get_parent_branch_path(dom_element)
        css_selector = dom_element.get_advanced_css_selector()
        
        return DOMHistoryElement(
            dom_element.tag_name,
            dom_element.xpath,
            dom_element.highlight_index,
            parent_branch_path,
            dom_element.attributes,
            dom_element.shadow_root,
            css_selector=css_selector,
            page_coordinates=dom_element.page_coordinates,
            viewport_coordinates=dom_element.viewport_coordinates,
            viewport_info=dom_element.viewport_info,
        )
    
    @staticmethod
    def find_history_element_in_tree(
        dom_history_element: DOMHistoryElement,
        tree: DOMElementNode
    ) -> Optional[DOMElementNode]:
        """Find historical element in current DOM tree"""
        
        hashed_history = HistoryTreeProcessor._hash_dom_history_element(dom_history_element)
        
        def process_node(node: DOMElementNode):
            if node.highlight_index is not None:
                hashed_node = HistoryTreeProcessor._hash_dom_element(node)
                if hashed_node == hashed_history:
                    return node
            
            for child in node.children:
                if isinstance(child, DOMElementNode):
                    result = process_node(child)
                    if result is not None:
                        return result
            return None
        
        return process_node(tree)

Element Hashing Strategy

@dataclass
class HashedDomElement:
    """Hash representation of DOM element for identity tracking"""
    branch_path_hash: str      # Hash of parent element path
    attributes_hash: str       # Hash of element attributes
    xpath_hash: str           # Hash of XPath selector

Advanced DOM Features

Element Highlighting System

graph TB
    subgraph "Visual Highlighting"
        Create[Create Highlight Container]
        Generate[Generate Color Scheme]
        Position[Calculate Positions]
        Overlay[Create Visual Overlay]
    end
    
    subgraph "Color System"
        Colors[Iron Man Theme Colors]
        Index[Index-based Selection]
        Focus[Focus Element Emphasis]
    end
    
    subgraph "Coordinate Mapping"
        Viewport[Viewport Coordinates]
        Page[Page Coordinates]
        Relative[Relative Positioning]
    end
    
    Create --> Generate
    Generate --> Position
    Position --> Overlay
    Colors --> Index
    Index --> Focus
    Viewport --> Relative
    Page --> Relative
    
    style Create fill:#e8f5e8
    style Colors fill:#fff3e0

Shadow DOM Support

The DOM service provides comprehensive Shadow DOM support:

// Shadow DOM traversal in buildDomTree.js
function processShadowRoots(element) {
    if (element.shadowRoot) {
        // Process shadow DOM content
        const shadowElements = extractFromShadowRoot(element.shadowRoot);
        return processShadowElements(shadowElements);
    }
    return [];
}

Viewport-Aware Processing

async def get_clickable_elements(
    self,
    highlight_elements: bool = True,
    focus_element: int = -1,
    viewport_expansion: int = 0,  # Expand viewport analysis area
) -> DOMState:
    """Extract clickable elements with viewport awareness"""

The viewport_expansion parameter allows analysis beyond the visible viewport, useful for:

Pre-loading elements that will become visible
Analyzing scroll-triggered content
Comprehensive page understanding

Performance Optimization

Memory Management

# Efficient memory cleanup after processing
html_to_dict = node_map[js_root_id]

del node_map
del js_node_map  
del js_root_id

gc.collect()  # Force garbage collection

Caching Strategy

class DomService:
    def __init__(self, page: Page):
        self.xpath_cache = {}  # Cache XPath calculations

Selective Processing

The DOM service supports selective element processing:

Attribute Filtering: Include only specified attributes
Element Type Filtering: Focus on specific element types
Visibility Filtering: Process only visible elements
Region-based Processing: Analyze specific page regions

Integration with Agent System

DOM State in Agent Flow

sequenceDiagram
    participant Agent
    participant Browser
    participant DOMService
    participant JavaScript
    
    Agent->>Browser: get_current_state()
    Browser->>DOMService: get_clickable_elements()
    DOMService->>JavaScript: Execute buildDomTree.js
    JavaScript-->>DOMService: DOM Map + Metadata
    DOMService->>DOMService: Build Python Tree
    DOMService-->>Browser: DOM State
    Browser-->>Agent: Browser State with DOM
    
    Agent->>Agent: Analyze Interactive Elements
    Agent->>Browser: Execute Action on Element[index]

Element Selection for Actions

# Agent uses selector map to find elements
selector_map = dom_state.selector_map
target_element = selector_map[highlight_index]

# Execute action on specific element
await controller.execute_action({
    "click_element": {"index": highlight_index}
})

Error Handling and Resilience

JavaScript Execution Validation

# Validate JavaScript execution capability
if await self.page.evaluate('1+1') != 2:
    raise ValueError('The page cannot evaluate javascript code properly')

DOM Parsing Robustness

def _parse_node(self, node_data: dict):
    if not node_data:
        return None, []
    
    # Graceful handling of missing data
    # Fallback strategies for malformed DOM
    # Type validation for all extracted data

Memory Safety

Automatic cleanup of large DOM structures
Garbage collection after processing
Cache size limits for XPath storage
Selective data retention

Usage Examples

Basic DOM Extraction

from browser_ai import DomService
from playwright.async_api import async_playwright

async with async_playwright() as p:
    browser = await p.chromium.launch()
    page = await browser.new_page()
    await page.goto("https://example.com")
    
    dom_service = DomService(page)
    dom_state = await dom_service.get_clickable_elements(
        highlight_elements=True,
        focus_element=-1,
        viewport_expansion=100
    )
    
    print(f"Found {len(dom_state.selector_map)} interactive elements")
    
    # Access specific element
    first_element = dom_state.selector_map[0]
    print(f"Element: {first_element.tag_name}")
    print(f"Attributes: {first_element.attributes}")

Advanced Processing with History Tracking

from browser_ai.dom.history_tree_processor.service import HistoryTreeProcessor

# Extract DOM state
dom_state = await dom_service.get_clickable_elements()

# Convert elements to history format
history_elements = []
for element in dom_state.selector_map.values():
    history_element = HistoryTreeProcessor.convert_dom_element_to_history_element(element)
    history_elements.append(history_element)

# Navigate to new page
await page.goto("https://example.com/page2")

# Get new DOM state
new_dom_state = await dom_service.get_clickable_elements()

# Find matching elements
for history_element in history_elements:
    current_element = HistoryTreeProcessor.find_history_element_in_tree(
        history_element, 
        new_dom_state.element_tree
    )
    if current_element:
        print(f"Found matching element: {current_element.highlight_index}")

Best Practices

1. Resource Management

# Always clean up resources
try:
    dom_state = await dom_service.get_clickable_elements()
    # Process DOM state
finally:
    # Cleanup handled automatically by garbage collection
    pass

2. Error Handling

try:
    dom_state = await dom_service.get_clickable_elements()
except ValueError as e:
    logger.error(f"DOM processing failed: {e}")
    # Implement fallback strategy

3. Performance Tuning

# For large pages, consider viewport expansion limits
dom_state = await dom_service.get_clickable_elements(
    highlight_elements=True,      # Enable for visual debugging
    focus_element=-1,             # No focus highlighting
    viewport_expansion=0          # Limit to visible area only
)

This documentation provides comprehensive coverage of the DOM processing system. The combination of JavaScript-based extraction and Python-based processing creates a robust foundation for AI-driven browser automation.

FilesExpand file tree

dom-processing.md

Latest commit

History