Skip to content

QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once. #313

@JWBWork

Description

@JWBWork

this specific website is throwing an exception I can't understand.

QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once.

it results in the splash docker container hanging. It becomes unresponsive to all future requests. More verbose logs didn't reveal any more info

The logs

(.venv) C:\Users\me\path\to\project>docker run -p 8050:8050 scrapinghub/splash:latest                                                                          
2024-06-25 20:39:41+0000 [-] Log opened.
2024-06-25 20:39:41.947216 [-] Xvfb is started: ['Xvfb', ':769163157', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-splash'
2024-06-25 20:39:42.012362 [-] Splash version: 3.5
2024-06-25 20:39:42.045852 [-] Qt 5.14.1, PyQt 5.14.2, WebKit 602.1, Chromium 77.0.3865.129, sip 4.19.22, Twisted 19.7.0, Lua 5.2
2024-06-25 20:39:42.046036 [-] Python 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0]
2024-06-25 20:39:42.046099 [-] Open files limit: 1048576
2024-06-25 20:39:42.046140 [-] Can't bump open files limit
2024-06-25 20:39:42.061355 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2024-06-25 20:39:42.061513 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled
2024-06-25 20:39:42.170427 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0
2024-06-25 20:39:42.170695 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Webkit: enabled, Chromium: enabled
2024-06-25 20:39:42.171427 [-] Site starting on 8050
2024-06-25 20:39:42.171615 [-] Starting factory <twisted.web.server.Site object at 0x7f96c40ae5c0>
2024-06-25 20:39:42.172103 [-] Server listening on http://0.0.0.0:8050
QNetworkReplyImplPrivate::error: Internal problem, this method must only be called once.

Minimum replication

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy_splash import SplashRequest


class ResearchSpider(scrapy.Spider):
    name = "research_spider"

    custom_settings = {
        'SPLASH_URL': 'http://localhost:8050',
        'ROBOTSTXT_OBEY': True,
        'DOWNLOAD_DELAY': 2,
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'SPIDER_MIDDLEWARES': {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        },
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
    }

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url,
                self.parse
            )

    def parse(self, response):
        print(f"parsing {response.url=}")


def crawl_process(websites: list[str]):
    print(f"Initializing crawler process - {websites=}")
    process = CrawlerProcess()
    process.crawl(ResearchSpider, start_urls=websites)
    process.start()
    print(f"Completed crawl")


if __name__ == "__main__":
    crawl_process([
        "http://www.crazyplumbers.com/",
    ])

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions