-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Title: How to gracefully stop spider from CrawlerProcess after N consecutive timeouts?
Hi, bro, how´s It going? Thanks for the project! 👋
I’m using aio-scrapy with CrawlerProcess and I’m struggling to cleanly stop a spider when the target site starts timing out.
Goal
I’d like to stop the spider (and end the process) when there are 5 consecutive timeout errors (after retries), while running the spider via CrawlerProcess.
Environment
- OS: Windows (ProactorEventLoop)
- Python: 3.11 (Anaconda)
- aio-scrapy: (please fill in version)
- Also using:
aiohttp,playwright(for login/credentials), but the problem seems limited to aio-scrapy’s crawler/engine.
What I’m doing
I have a spider that:
- sends POST requests to a single endpoint
- uses
RETRY_TIMESandDOWNLOAD_TIMEOUT - counts consecutive final failures in an
errback:
class StockFetcherSpider(Spider):
custom_settings = {
"RETRY_TIMES": 1,
"DOWNLOAD_TIMEOUT": 20,
"CLOSE_SPIDER_ON_IDLE": True,
"CONCURRENT_REQUESTS": 8,
}
def __init__(self, output_dir, product_codes, credentials, *args, **kwargs):
super().__init__(*args, **kwargs)
self._response_errors = 0
self._max_response_errors = 5
self._stop_flag = False
self.product_codes = product_codes
self.COOKIES = credentials["cookies"]
self.REQUEST_TOKEN = credentials["request_token"]
async def start_requests(self):
for code in self.product_codes:
if self._stop_flag:
logger.warning("Stop flag set, not scheduling new requests.")
break
form_data = {...}
yield FormRequest(
url=self.URL,
formdata=form_data,
cookies=self.COOKIES,
callback=self.parse,
meta={"product_code": code},
errback=self.errback_request,
)
async def errback_request(self, failure):
self._response_errors += 1
logger.warning(
f"Final failure (after retries): {repr(failure)} "
f"- {self._response_errors}/{self._max_response_errors}"
)
if self._response_errors >= self._max_response_errors:
logger.critical("Too many consecutive failures.")
# here I tried different ways to stop the spiderRunner (simplified):
process = CrawlerProcess()
process.crawl(
StockFetcherSpider,
output_dir=output_dir,
product_codes=product_codes,
credentials=creds,
)
process.start()What I tried
- raise CloseSpider(...) in errback_request
from aioscrapy.exceptions import CloseSpider
if self._response_errors >= self._max_response_errors:
raise CloseSpider("too_many_consecutive_failures")Logs show:
Closing spider (too_many_consecutive_failures)
Dumping aioscrapy stats:
{
'finish_reason': 'too_many_consecutive_failures',
...
}
Spider closed (too_many_consecutive_failures)
But after that, the process ends with:
RuntimeError: Event loop stopped before Future completed.
Task was destroyed but it is pending!
So CloseSpider works in terms of stats, but shutdown is not clean.
- self.crawler.engine.close_spider(...) in errback_request
await self.crawler.engine.close_spider(self, "too_many_consecutive_failures")-
Without await:
RuntimeWarning: coroutine 'ExecutionEngine.close_spider' was never awaited -
With await:
eventually: AssertionError: assert self.spider is not None inside _spider_idle.
- Soft stop with _stop_flag +
CLOSE_SPIDER_ON_IDLE = True
- In errback_request, when limit is reached: self._stop_flag = True
- In start_requests, check flag and break
- This stops scheduling new requests, but with many timeouts I can still see the same kind of shutdown noise (RuntimeError: Event loop stopped before Future completed), and sometimes the process doesn’t seem to exit cleanly.
Questions
- What is the recommended way to stop a spider (running under CrawlerProcess) after N consecutive download timeouts?
- Is the
RuntimeError: Event loop stopped before Future completed+ "Task was destroyed but it is pending" expected when usingCloseSpiderorengine.close_spiderwithCrawlerProcess?
Any guidance or a minimal example showing the “correct” pattern for stopping a spider after N consecutive timeouts with CrawlerProcess would be really helpful. 🙏
I look forward to your reply.
Best regards.