Skip to content

How to gracefully stop spider from CrawlerProcess after N consecutive timeouts? #10

@nilsinhojrx

Description

@nilsinhojrx

Title: How to gracefully stop spider from CrawlerProcess after N consecutive timeouts?

Hi, bro, how´s It going? Thanks for the project! 👋

I’m using aio-scrapy with CrawlerProcess and I’m struggling to cleanly stop a spider when the target site starts timing out.


Goal

I’d like to stop the spider (and end the process) when there are 5 consecutive timeout errors (after retries), while running the spider via CrawlerProcess.


Environment

  • OS: Windows (ProactorEventLoop)
  • Python: 3.11 (Anaconda)
  • aio-scrapy: (please fill in version)
  • Also using: aiohttp, playwright (for login/credentials), but the problem seems limited to aio-scrapy’s crawler/engine.

What I’m doing

I have a spider that:

  • sends POST requests to a single endpoint
  • uses RETRY_TIMES and DOWNLOAD_TIMEOUT
  • counts consecutive final failures in an errback:
class StockFetcherSpider(Spider):
    custom_settings = {
        "RETRY_TIMES": 1,
        "DOWNLOAD_TIMEOUT": 20,
        "CLOSE_SPIDER_ON_IDLE": True,
        "CONCURRENT_REQUESTS": 8,
    }

    def __init__(self, output_dir, product_codes, credentials, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._response_errors = 0
        self._max_response_errors = 5
        self._stop_flag = False
        self.product_codes = product_codes
        self.COOKIES = credentials["cookies"]
        self.REQUEST_TOKEN = credentials["request_token"]

    async def start_requests(self):
        for code in self.product_codes:
            if self._stop_flag:
                logger.warning("Stop flag set, not scheduling new requests.")
                break

            form_data = {...}

            yield FormRequest(
                url=self.URL,
                formdata=form_data,
                cookies=self.COOKIES,
                callback=self.parse,
                meta={"product_code": code},
                errback=self.errback_request,
            )

    async def errback_request(self, failure):
        self._response_errors += 1
        logger.warning(
            f"Final failure (after retries): {repr(failure)} "
            f"- {self._response_errors}/{self._max_response_errors}"
        )

        if self._response_errors >= self._max_response_errors:
            logger.critical("Too many consecutive failures.")
            # here I tried different ways to stop the spider

Runner (simplified):

process = CrawlerProcess()
process.crawl(
    StockFetcherSpider,
    output_dir=output_dir,
    product_codes=product_codes,
    credentials=creds,
)
process.start()

What I tried

  1. raise CloseSpider(...) in errback_request
from aioscrapy.exceptions import CloseSpider

if self._response_errors >= self._max_response_errors:
    raise CloseSpider("too_many_consecutive_failures")

Logs show:

Closing spider (too_many_consecutive_failures)
Dumping aioscrapy stats:
{
  'finish_reason': 'too_many_consecutive_failures',
  ...
}
Spider closed (too_many_consecutive_failures)

But after that, the process ends with:

RuntimeError: Event loop stopped before Future completed.
Task was destroyed but it is pending!

So CloseSpider works in terms of stats, but shutdown is not clean.


  1. self.crawler.engine.close_spider(...) in errback_request
await self.crawler.engine.close_spider(self, "too_many_consecutive_failures")
  • Without await:
    RuntimeWarning: coroutine 'ExecutionEngine.close_spider' was never awaited

  • With await:
    eventually: AssertionError: assert self.spider is not None inside _spider_idle.


  1. Soft stop with _stop_flag + CLOSE_SPIDER_ON_IDLE = True
  • In errback_request, when limit is reached: self._stop_flag = True
  • In start_requests, check flag and break
  • This stops scheduling new requests, but with many timeouts I can still see the same kind of shutdown noise (RuntimeError: Event loop stopped before Future completed), and sometimes the process doesn’t seem to exit cleanly.

Questions

  1. What is the recommended way to stop a spider (running under CrawlerProcess) after N consecutive download timeouts?
  2. Is the RuntimeError: Event loop stopped before Future completed + "Task was destroyed but it is pending" expected when using CloseSpider or engine.close_spider with CrawlerProcess?

Any guidance or a minimal example showing the “correct” pattern for stopping a spider after N consecutive timeouts with CrawlerProcess would be really helpful. 🙏

I look forward to your reply.
Best regards.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions