Skip to content

Add optional dynamic crawler middleware for click simulation and anti-crawling retries#37

Draft
Copilot wants to merge 5 commits into
release-2026from
copilot/add-dynamic-web-crawlers
Draft

Add optional dynamic crawler middleware for click simulation and anti-crawling retries#37
Copilot wants to merge 5 commits into
release-2026from
copilot/add-dynamic-web-crawlers

Conversation

Copilot AI commented May 10, 2026

Copy link
Copy Markdown
Contributor

This PR introduces first-class support for dynamic crawling workflows in scrapy-distributed, focused on simulated click flows and basic anti-crawling resilience without changing existing scheduler/pipeline behavior.

  • Dynamic crawling middleware

    • Added scrapy_distributed.middlewares.dynamic.DynamicCrawlerMiddleware.
    • Supports request-level click simulation via dynamic_click_selectors metadata.
    • When scrapy-playwright is available, selectors are translated into playwright_page_methods automatically.
  • Anti-crawling hardening

    • Adds configurable User-Agent rotation (DYNAMIC_CRAWLER_USER_AGENTS).
    • Adds configurable proxy rotation (DYNAMIC_CRAWLER_PROXIES).
    • Adds bounded retry-on-block behavior for configurable statuses (defaults include 403/429) via:
      • DYNAMIC_CRAWLER_BLOCK_STATUSES
      • DYNAMIC_CRAWLER_MAX_RETRY_TIMES
  • Public surface + docs

    • Exported the new middleware module from scrapy_distributed.middlewares.
    • Extended README.md with setup/usage for dynamic crawling and click selector metadata.
  • Focused test coverage

    • Added tests/test_dynamic_middleware.py covering:
      • UA/proxy assignment
      • click selector → Playwright method conversion
      • retry creation for blocked responses
      • retry cap behavior

Example usage:

DOWNLOADER_MIDDLEWARES = {
    "scrapy_distributed.middlewares.dynamic.DynamicCrawlerMiddleware": 540,
}

DYNAMIC_CRAWLER_USER_AGENTS = [
    "Mozilla/5.0 (X11; Linux x86_64)",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64)",
]
DYNAMIC_CRAWLER_PROXIES = ["http://127.0.0.1:7890"]
DYNAMIC_CRAWLER_BLOCK_STATUSES = [403, 429]
DYNAMIC_CRAWLER_MAX_RETRY_TIMES = 2

# per-request dynamic click simulation
yield scrapy.Request(
    url,
    meta={"dynamic_click_selectors": ["#load-more", ".next-page"]},
)

Copilot AI linked an issue May 10, 2026 that may be closed by this pull request
Copilot AI changed the title [WIP] Add dynamic web crawlers with simulation click technology Add optional dynamic crawler middleware for click simulation and anti-crawling retries May 10, 2026
Copilot AI requested a review from Insutanto May 10, 2026 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dynamic web crawlers

2 participants