Open
Description
Description
Demo spider with settings:
DNS_RESOLVER = "scrapy.resolver.CachingHostnameResolver"
import scrapy
class DemoSpider(scrapy.Spider):
name = 'demo_spider'
start_urls = ['https://[2402:4e00:40:40::2:3b6]']
def parse(self, response, **kwargs):
print(response.body)
print(response)
Command to start the spider:
scrapy crawl demo_spider -s JOBDIR=./jobs/run-1
When i use the JOBDIR
parameter, it cause an exception:
Traceback (most recent call last):
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 27, in run
self.crawler_process.start()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/crawler.py", line 348, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1318, in run
self.mainLoop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1328, in mainLoop
reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/reactor.py", line 51, in __call__
return self._func(*self._a, **self._kw)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 147, in _next_request
while not self._needs_backout() and self._next_request_from_scheduler() is not None:
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
request = self.slot.scheduler.next_request()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 263, in next_request
request = self._dqpop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 299, in _dqpop
return self.dqs.pop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/pqueues.py", line 99, in pop
m = q.pop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/squeues.py", line 78, in pop
return request_from_dict(request, spider=self.spider)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/request.py", line 124, in request_from_dict
return request_cls(**kwargs)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 60, in __init__
self._set_url(url)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 100, in _set_url
s = safe_url_string(url, self.encoding)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/w3lib/url.py", line 103, in safe_url_string
parts.port,
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 178, in port
raise ValueError(message) from None
builtins.ValueError: Port could not be cast to integer value as '4e00:40:40::2:3b6'
2022-10-09 13:57:19 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 27, in run
self.crawler_process.start()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/crawler.py", line 348, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1318, in run
self.mainLoop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1328, in mainLoop
reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/reactor.py", line 51, in __call__
return self._func(*self._a, **self._kw)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 147, in _next_request
while not self._needs_backout() and self._next_request_from_scheduler() is not None:
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
request = self.slot.scheduler.next_request()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 263, in next_request
request = self._dqpop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 299, in _dqpop
return self.dqs.pop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/pqueues.py", line 99, in pop
m = q.pop()
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/squeues.py", line 78, in pop
return request_from_dict(request, spider=self.spider)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/request.py", line 124, in request_from_dict
return request_cls(**kwargs)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 60, in __init__
self._set_url(url)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 100, in _set_url
s = safe_url_string(url, self.encoding)
File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/w3lib/url.py", line 103, in safe_url_string
parts.port,
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 178, in port
raise ValueError(message) from None
builtins.ValueError: Port could not be cast to integer value as '4e00:40:40::2:3b6'
I debugged and found that the problem was in urllib.parse#L202
,as shown below:
And when I stopped using the JOBDIR
parameter and debugged again, I found that the problem still existed. At this point, the problem is in middlewares such as CookieJar
, RetryMiddleware
,RobotsTxtMiddleware
and so on.
The problem should be in the creation of the Request
instance,It called self._set_url and then parsed the url https://[2402:4e00:40:40::2:3b6]
to https://2402:4e00:40:40::2:3b6
.
When the middlewares create another instance of Request
based on Request.url
, calling self._set_url will return the wrong hostname and port.
Versions
$ scrapy version --verbose
Scrapy : 2.6.3
lxml : 4.9.1.0
libxml2 : 2.9.4
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 2.0.1
Twisted : 22.8.0
Python : 3.9.6 (default, Sep 13 2022, 22:03:16) - [Clang 14.0.0 (clang-1400.0.29.102)]
pyOpenSSL : 22.0.0 (OpenSSL 3.0.5 5 Jul 2022)
cryptography : 37.0.4
Platform : macOS-12.6-arm64-arm-64bit