Summary
Add a web scraping fallback using BeautifulSoup (and optionally Playwright from #458) to extract product prices from retailer websites that lack public APIs — extending coverage beyond Best Buy and SerpApi.
Motivation
Many retailers (Amazon, Walmart, Target, Newegg) don't offer free public product APIs. Web scraping fills this gap, enabling the DealAgent to track prices across a wider range of sources. This builds on the BrowserToolsMixin (#458) being developed in the v0.17.0 milestone.
Design
Scraper Architecture
# src/gaia/agents/deals/tools/scraper_tools.py
from abc import ABC, abstractmethod
class RetailerScraper(ABC):
"""Base class for retailer-specific scrapers."""
name: str = ""
base_url: str = ""
@abstractmethod
def search(self, query: str, max_results: int = 10) -> List[ProductResult]: ...
@abstractmethod
def get_price(self, url: str) -> ProductResult: ...
class AmazonScraper(RetailerScraper):
name = "amazon"
base_url = "https://www.amazon.com"
# Uses BeautifulSoup for static extraction
# Falls back to Playwright for JS-rendered content
class WalmartScraper(RetailerScraper):
name = "walmart"
base_url = "https://www.walmart.com"
class NeweggScraper(RetailerScraper):
name = "newegg"
base_url = "https://www.newegg.com"
class ScraperRegistry:
"""Registry of available scrapers."""
scrapers: Dict[str, RetailerScraper] = {}
def register(self, scraper: RetailerScraper): ...
def get(self, name: str) -> RetailerScraper: ...
def search_all(self, query: str) -> List[ProductResult]: ...
Scraper Tool
class ScraperToolsMixin:
def register_scraper_tools(self) -> None:
from gaia.agents.base.tools import tool
@tool
def scrape_price(url: str) -> Dict:
"""Extract current price from a product URL.
Args:
url: Direct product page URL from any supported retailer
"""
@tool
def scrape_search(query: str, retailers: str = "all") -> Dict:
"""Search for products by scraping retailer websites (fallback when APIs unavailable).
Args:
query: Product search query
retailers: Comma-separated retailer names or "all"
"""
Ethical Scraping Practices
- Respect
robots.txt — check before scraping
- Rate limit: max 1 request/second per domain
- User-Agent: identify as GAIA bot
- Cache scraped results for 1 hour to reduce load
- Terms of Service: document which sites allow scraping
Integration with BrowserToolsMixin (#458)
If Playwright is available (from v0.17.0 BrowserToolsMixin), use it for JavaScript-heavy sites. Otherwise, fall back to requests + BeautifulSoup for static HTML.
def _fetch_page(self, url: str) -> str:
"""Fetch page HTML, using Playwright if available, else requests."""
try:
from gaia.agents.base.browser_tools import BrowserToolsMixin
return self._fetch_with_playwright(url)
except ImportError:
return self._fetch_with_requests(url)
Acceptance Criteria
Phase
Phase 3 — Visualization & Intelligence
Dependencies
Cross-References
New Dependencies
| Package |
Version |
License |
Purpose |
beautifulsoup4 |
>=4.12 |
MIT |
HTML parsing for price extraction |
lxml |
>=4.9 |
BSD |
Fast HTML parser backend |
Summary
Add a web scraping fallback using BeautifulSoup (and optionally Playwright from #458) to extract product prices from retailer websites that lack public APIs — extending coverage beyond Best Buy and SerpApi.
Motivation
Many retailers (Amazon, Walmart, Target, Newegg) don't offer free public product APIs. Web scraping fills this gap, enabling the DealAgent to track prices across a wider range of sources. This builds on the BrowserToolsMixin (#458) being developed in the v0.17.0 milestone.
Design
Scraper Architecture
Scraper Tool
Ethical Scraping Practices
robots.txt— check before scrapingIntegration with BrowserToolsMixin (#458)
If Playwright is available (from v0.17.0 BrowserToolsMixin), use it for JavaScript-heavy sites. Otherwise, fall back to
requests+ BeautifulSoup for static HTML.Acceptance Criteria
RetailerScraperbase class withsearch()andget_price()methodsscrape_priceextracts price from a product URLscrape_searchsearches across scraped retailersProductResultmatching API resultsPhase
Phase 3 — Visualization & Intelligence
Dependencies
Cross-References
New Dependencies
beautifulsoup4lxml