Skip to content

Web scraping fallback for retailers without APIs #491

@kovtcharov

Description

@kovtcharov

Summary

Add a web scraping fallback using BeautifulSoup (and optionally Playwright from #458) to extract product prices from retailer websites that lack public APIs — extending coverage beyond Best Buy and SerpApi.

Motivation

Many retailers (Amazon, Walmart, Target, Newegg) don't offer free public product APIs. Web scraping fills this gap, enabling the DealAgent to track prices across a wider range of sources. This builds on the BrowserToolsMixin (#458) being developed in the v0.17.0 milestone.

Design

Scraper Architecture

# src/gaia/agents/deals/tools/scraper_tools.py
from abc import ABC, abstractmethod

class RetailerScraper(ABC):
    """Base class for retailer-specific scrapers."""
    name: str = ""
    base_url: str = ""

    @abstractmethod
    def search(self, query: str, max_results: int = 10) -> List[ProductResult]: ...

    @abstractmethod
    def get_price(self, url: str) -> ProductResult: ...

class AmazonScraper(RetailerScraper):
    name = "amazon"
    base_url = "https://www.amazon.com"
    # Uses BeautifulSoup for static extraction
    # Falls back to Playwright for JS-rendered content

class WalmartScraper(RetailerScraper):
    name = "walmart"
    base_url = "https://www.walmart.com"

class NeweggScraper(RetailerScraper):
    name = "newegg"
    base_url = "https://www.newegg.com"

class ScraperRegistry:
    """Registry of available scrapers."""
    scrapers: Dict[str, RetailerScraper] = {}

    def register(self, scraper: RetailerScraper): ...
    def get(self, name: str) -> RetailerScraper: ...
    def search_all(self, query: str) -> List[ProductResult]: ...

Scraper Tool

class ScraperToolsMixin:
    def register_scraper_tools(self) -> None:
        from gaia.agents.base.tools import tool

        @tool
        def scrape_price(url: str) -> Dict:
            """Extract current price from a product URL.

            Args:
                url: Direct product page URL from any supported retailer
            """

        @tool
        def scrape_search(query: str, retailers: str = "all") -> Dict:
            """Search for products by scraping retailer websites (fallback when APIs unavailable).

            Args:
                query: Product search query
                retailers: Comma-separated retailer names or "all"
            """

Ethical Scraping Practices

  • Respect robots.txt — check before scraping
  • Rate limit: max 1 request/second per domain
  • User-Agent: identify as GAIA bot
  • Cache scraped results for 1 hour to reduce load
  • Terms of Service: document which sites allow scraping

Integration with BrowserToolsMixin (#458)

If Playwright is available (from v0.17.0 BrowserToolsMixin), use it for JavaScript-heavy sites. Otherwise, fall back to requests + BeautifulSoup for static HTML.

def _fetch_page(self, url: str) -> str:
    """Fetch page HTML, using Playwright if available, else requests."""
    try:
        from gaia.agents.base.browser_tools import BrowserToolsMixin
        return self._fetch_with_playwright(url)
    except ImportError:
        return self._fetch_with_requests(url)

Acceptance Criteria

  • RetailerScraper base class with search() and get_price() methods
  • At least 2 retailer scrapers implemented (e.g., Amazon, Newegg)
  • scrape_price extracts price from a product URL
  • scrape_search searches across scraped retailers
  • robots.txt respected before scraping
  • Rate limiting: 1 req/s per domain
  • Results cached for 1 hour
  • Falls back gracefully if Playwright unavailable
  • Results normalized to ProductResult matching API results
  • Unit tests with saved HTML fixtures (no live scraping in CI)

Phase

Phase 3 — Visualization & Intelligence

Dependencies

Cross-References

New Dependencies

Package Version License Purpose
beautifulsoup4 >=4.12 MIT HTML parsing for price extraction
lxml >=4.9 BSD Fast HTML parser backend

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentdealsDealAgent: price tracking and deal discoveryenhancementNew feature or requestp2low priority

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions