-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Open
Description
Overview
Implement a robust, extensible product data scraper in Python that works with Shopify stores built on Dawn. The tool should:
- Scrape single product pages for title, price, description, images, SKU, brand, variants, and raw JSON-LD.
- Crawl collection pages, including multi-page pagination, and extract all product URLs and details.
- Support both static HTML scraping (requests + BeautifulSoup) and JavaScript-rendered pages (Playwright integration).
- Output results to JSON or CSV.
- Use selectors or schema.org detection for collection/product identification.
- Include rate limiting, retry logic, and error handling.
Features
- CLI and library API
- Single product scraper (requests+BS, fallback to Playwright if needed)
- Collection crawler (pagination detection via rel="next", Next anchor, selectors)
- Concurrency and optional delay
- Output format: JSON/CSV
- Site adapter for Shopify Dawn (structured extraction of variants/pricing/SKU)
- Logs and progress reporting
Motivation
Shopify Dawn stores often use dynamic rendering and structured data (JSON-LD, schema.org). Manual extraction is error-prone and slow. This tool will automate data extraction for bulk operations, analytics, and migrations.
Labels
- Category: Enhancement
- 📁 Section: Featured Product
- 🗂️ Template: Collection
Acceptance Criteria
- Scrapes all key product data (including variants and images) from product pages
- Crawls collection pages and follows pagination automatically
- Works on both static and JS-heavy Shopify Dawn pages
- CLI exposes options for concurrency, selectors, and output format
- Adapters for Shopify Dawn structure
- Error logs for failed pages
Out of Scope
- Scraping of non-Shopify storefronts
- Bypassing CAPTCHAs or login walls
Let me know if you want Playwright integration or adapters for other platforms to be part of MVP.
Metadata
Metadata
Assignees
Labels
No labels