Skip to content

Feature: Scrapy plugin for Pydoll (scrapy-pydoll) #248

@thalissonvs

Description

@thalissonvs

Make it trivial to use Pydoll inside Scrapy without custom glue code. The plugin should let a spider opt-in per request to drive a headless tab, run small actions (clicks, waits), and return a rendered HtmlResponse that plays nicely with Scrapy selectors. It should feel like standard Scrapy, just powered by Pydoll when needed.

Proposed API

  • Installable optional plugin: pip install scrapy-pydoll
  • Enable via settings:
PYDOLL_ENABLED = True
PYDOLL_CONCURRENCY = 2
PYDOLL_BROWSER_OPTIONS = { "geolocation": "GB", "headless": True }
  • Per-request opt-in (meta) or helper Request:
yield scrapy.Request(
    url,
    meta={
        "pydoll": {
            "actions": [
                {"type": "wait", "for": "networkidle"},
                {"type": "click", "selector": "#show-more"},
            ],
            "timeout": 15000,
        },
        "cookiejar": "sessionA",
    },
    callback=self.parse_page,
)

# or
yield PydollRequest(url, actions=[...], timeout=15000)

Requirements (MVP)

  • Deterministic rendered HtmlResponse compatible with .css() / .xpath()
  • Wait strategies: networkidle, selector, sleep(ms)
  • Small action set: click, type, scroll
  • Per-request headers/cookies merged with Pydoll context
  • Session reuse by cookiejar; graceful shutdown on spider_closed
  • Timeouts, retries surfaced as IgnoreRequest or similar

Follow-ups

  • Optionally attach Markdown (return_markdown=True) once exporter exists
  • Network record on error (integration with recorder feature)
  • Page bundle snapshot on exception for offline debugging
  • WebPoet/Scrapy-Poet provider to inject a Tab or rendered HTML

Example Spider

class ExampleSpider(scrapy.Spider):
    name = "example"

    def start_requests(self):
        yield scrapy.Request(
            "https://example.com/products",
            meta={"pydoll": {
                "actions": [{"type": "wait", "for": "networkidle"}],
                "timeout": 15000
            }},
            callback=self.parse_list
        )

    def parse_list(self, response):
        for href in response.css(".item a::attr(href)").getall():
            yield scrapy.Request(
                response.urljoin(href),
                meta={"pydoll": {"actions": [{"type": "click", "selector": "#accept"}]}},
                callback=self.parse_item
            )

    def parse_item(self, response):
        yield {
            "title": response.css("h1::text").get(),
            "price": response.css(".price::text").get(),
        }

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestfuture planningIdeas or features proposed for future development.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions