Skip to content

How to implement Scrapy Splash in Virtual Machine #301

@lime-n

Description

@lime-n

How do I run scrapy splash on a virtual machine with linux? Essentially, I have a lua script that requires me to send keys onto a site to log in and then scrape it.

I have installed docker however I cannot seem to get the scraper to work as it won't connect to the server.

Are there any simple steps that I can follow to get this to work on a VM? Like what should I install, and what should I do next before running scrapy crawl spider.

As for docker, I have implemented the following whilst in admin mode:

docker run -p 8050:8050 scrapinghub/splash --max-timeout 3600

However this is currently running and I'd like it to run in on the background. I cannot seem to figure this out; I have tried:

docker run -d 8050:8050 scrapinghub/splash --max-timeout 3600

But I just get the error:

Unable to find image '8050:8050' locally

I believe this may solve my issue or perhaps not and I need some further installations. Please let me know! I really need expert guidance to figure this out.

I have opened another instance whilst docker was running on the first instance.

I get the following error when running the scrapy crawler:

2022-02-16 02:55:26 [scrapy_splash.middleware] WARNING: Bad request to Splash: {'error': 400, 'type': 'ScriptError', 'description': 'Error happened while executing Lua script', 'info': 
{'type': 'JS_ERROR', 'js_error_type': 'TypeError', 'js_error_message': 'null is not an object (evaluating \'document.querySelector("button:nth-child(2)").getClientRects\')', 'js_error':
 'TypeError: null is not an object (evaluating \'document.querySelector("button:nth-child(2)").getClientRects\')', 'message': '[string "..."]:12: error during JS function call: \'TypeEr
ror: null is not an object (evaluating \\\'document.querySelector("button:nth-child(2)").getClientRects\\\')\'', 'source': '[string "..."]', 'line_number': 12, 'error': 'error during JS
 function call: \'TypeError: null is not an object (evaluating \\\'document.querySelector("button:nth-child(2)").getClientRects\\\')\''}}
2022-02-16 02:55:26 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://instagram.com/ via http://localhost:8050/execute> (referer: None)
2022-02-16 02:55:26 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://instagram.com/>: HTTP status code is not handled or not allowed

The scraper works perfectly fine on my mac so there's definitely an installation that I am missing somewhere.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions