Skip to content

Hybrid Product & Collection Data Scraper Tool for Shopify Dawn #3869

@waleednaeem330-gif

Description

@waleednaeem330-gif

Overview

Implement a robust, extensible product data scraper in Python that works with Shopify stores built on Dawn. The tool should:

  • Scrape single product pages for title, price, description, images, SKU, brand, variants, and raw JSON-LD.
  • Crawl collection pages, including multi-page pagination, and extract all product URLs and details.
  • Support both static HTML scraping (requests + BeautifulSoup) and JavaScript-rendered pages (Playwright integration).
  • Output results to JSON or CSV.
  • Use selectors or schema.org detection for collection/product identification.
  • Include rate limiting, retry logic, and error handling.

Features

  • CLI and library API
  • Single product scraper (requests+BS, fallback to Playwright if needed)
  • Collection crawler (pagination detection via rel="next", Next anchor, selectors)
  • Concurrency and optional delay
  • Output format: JSON/CSV
  • Site adapter for Shopify Dawn (structured extraction of variants/pricing/SKU)
  • Logs and progress reporting

Motivation

Shopify Dawn stores often use dynamic rendering and structured data (JSON-LD, schema.org). Manual extraction is error-prone and slow. This tool will automate data extraction for bulk operations, analytics, and migrations.

Labels

  • Category: Enhancement
  • 📁 Section: Featured Product
  • 🗂️ Template: Collection

Acceptance Criteria

  • Scrapes all key product data (including variants and images) from product pages
  • Crawls collection pages and follows pagination automatically
  • Works on both static and JS-heavy Shopify Dawn pages
  • CLI exposes options for concurrency, selectors, and output format
  • Adapters for Shopify Dawn structure
  • Error logs for failed pages

Out of Scope

  • Scraping of non-Shopify storefronts
  • Bypassing CAPTCHAs or login walls

Let me know if you want Playwright integration or adapters for other platforms to be part of MVP.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions