Skip to content
This repository was archived by the owner on Jan 22, 2020. It is now read-only.
This repository was archived by the owner on Jan 22, 2020. It is now read-only.

Python Documentation #42

Open
Open
@tegansnyder

Description

@tegansnyder

It might be worth noting that you need a few things on your system to get this working for the Python example.

Python Modules:

You will receive this error if you try and run without installing a few modules.

  File "crawl_executor.py", line 25, in <module>
    from bs4 import BeautifulSoup
ImportError: No module named bs4
Install the following:
sudo pip install wget
sudo pip install beautifulsoup4
sudo pip install html5lib
sudo yum install -y libxml2-devel
sudo yum install -y libxslt-devel
sudo yum install -y python-devel
sudo pip install lxml
PhantomJS:

You will get errors about PhantomJs like the following:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.7/threading.py", line 764, in run
    self.__target(*self.__args, **self.__kwargs)
  File "render_executor.py", line 62, in run_task
    if call(["phantomjs", "render.js", url, destination]) != 0:
  File "/usr/lib64/python2.7/subprocess.py", line 524, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.7/subprocess.py", line 1308, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

To resolve that you need to build PhantomJs from source. If you can find a binary for your Linux distro then go with that. I used a binary I found for Centos 7 here. Note there are some issues bundling binaries for PhantomJs see thead here. If you must build from source follow the steps below it can take an hour or so.

# needed to phantomjs build from source
sudo yum -y install gcc gcc-c++ make flex bison gperf ruby \
  openssl-devel freetype-devel fontconfig-devel libicu-devel sqlite-devel \
  libpng-devel libjpeg-devel

git clone --recurse-submodules https://github.com/ariya/phantomjs.git
cd phantomjs
./build.py
Parser Warning on BS4:

Also the Executer throws a nice warning about not explicitly specifying the parser for BS4 that appears to halt the script.

Executor registered on slave 586d51bc-408a-4191-bce7-8527a6c0f2f4-S0
/usr/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this (See PR #41):

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions