Skip to content

safe_url_string handling IPv6 URLs #193

Open
@Cash111

Description

@Cash111

Description

Demo spider with settings:

DNS_RESOLVER = "scrapy.resolver.CachingHostnameResolver"
import scrapy


class DemoSpider(scrapy.Spider):
    name = 'demo_spider'
    start_urls = ['https://[2402:4e00:40:40::2:3b6]']

    def parse(self, response, **kwargs):
        print(response.body)
        print(response)

Command to start the spider:

scrapy crawl demo_spider -s JOBDIR=./jobs/run-1

When i use the JOBDIR parameter, it cause an exception:

Traceback (most recent call last):
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 27, in run
    self.crawler_process.start()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/crawler.py", line 348, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1318, in run
    self.mainLoop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1328, in mainLoop
    reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/reactor.py", line 51, in __call__
    return self._func(*self._a, **self._kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 147, in _next_request
    while not self._needs_backout() and self._next_request_from_scheduler() is not None:
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
    request = self.slot.scheduler.next_request()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 263, in next_request
    request = self._dqpop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 299, in _dqpop
    return self.dqs.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/pqueues.py", line 99, in pop
    m = q.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/squeues.py", line 78, in pop
    return request_from_dict(request, spider=self.spider)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/request.py", line 124, in request_from_dict
    return request_cls(**kwargs)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 60, in __init__
    self._set_url(url)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 100, in _set_url
    s = safe_url_string(url, self.encoding)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/w3lib/url.py", line 103, in safe_url_string
    parts.port,
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 178, in port
    raise ValueError(message) from None
builtins.ValueError: Port could not be cast to integer value as '4e00:40:40::2:3b6'

2022-10-09 13:57:19 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/commands/crawl.py", line 27, in run
    self.crawler_process.start()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/crawler.py", line 348, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1318, in run
    self.mainLoop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 1328, in mainLoop
    reactorBaseSelf.runUntilCurrent()
--- <exception caught here> ---
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/twisted/internet/base.py", line 994, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/reactor.py", line 51, in __call__
    return self._func(*self._a, **self._kw)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 147, in _next_request
    while not self._needs_backout() and self._next_request_from_scheduler() is not None:
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/engine.py", line 176, in _next_request_from_scheduler
    request = self.slot.scheduler.next_request()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 263, in next_request
    request = self._dqpop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/core/scheduler.py", line 299, in _dqpop
    return self.dqs.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/pqueues.py", line 99, in pop
    m = q.pop()
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/squeues.py", line 78, in pop
    return request_from_dict(request, spider=self.spider)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/utils/request.py", line 124, in request_from_dict
    return request_cls(**kwargs)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 60, in __init__
    self._set_url(url)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/scrapy/http/request/__init__.py", line 100, in _set_url
    s = safe_url_string(url, self.encoding)
  File "/Users/ql/workspace/py3env/lib/python3.9/site-packages/w3lib/url.py", line 103, in safe_url_string
    parts.port,
  File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/parse.py", line 178, in port
    raise ValueError(message) from None
builtins.ValueError: Port could not be cast to integer value as '4e00:40:40::2:3b6'

I debugged and found that the problem was in urllib.parse#L202,as shown below:

SefJhostinfo(selp seilf SolftRewltCschene- 'https', netlec- 24024r0040402306443  PotheVp

And when I stopped using the JOBDIR parameter and debugged again, I found that the problem still existed. At this point, the problem is in middlewares such as CookieJar, RetryMiddleware,RobotsTxtMiddleware and so on.

hosts - potenttal domatnunotches(req host) hoEte T40  2w' '2402 tocol”)

The problem should be in the creation of the Request instance,It called self._set_url and then parsed the url https://[2402:4e00:40:40::2:3b6] to https://2402:4e00:40:40::2:3b6 .

When the middlewares create another instance of Request based on Request.url, calling self._set_url will return the wrong hostname and port.

Versions

$ scrapy version --verbose
Scrapy       : 2.6.3
lxml         : 4.9.1.0
libxml2      : 2.9.4
cssselect    : 1.1.0
parsel       : 1.6.0
w3lib        : 2.0.1
Twisted      : 22.8.0
Python       : 3.9.6 (default, Sep 13 2022, 22:03:16) - [Clang 14.0.0 (clang-1400.0.29.102)]
pyOpenSSL    : 22.0.0 (OpenSSL 3.0.5 5 Jul 2022)
cryptography : 37.0.4
Platform     : macOS-12.6-arm64-arm-64bit

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions