This is the component of Qcumber from students at Queen's university that was made to scrape the data off FSU's my.fsu.edu, parse it, and generate structured data that a site can then display.
- This guide has been verified for Ubuntu 11.10 and 12.10.
- Setting up on mac OSX should be quite similar. It will be verified soon.
- It works on Windows, but installation there is left as an exercise for the reader.
- Installing the Prerequisites
- Make sure you have all the needed permissions to install.
- For most users, this means prepending each install command with
sudo
- Ex:
sudo apt-get install ...
This project has been designed to work with Python versions 2.7.x and 3.3.x You can try other versions, but no promises.
Python 3.3.x is recommended.
- Install a compatible version of Python. Use a package manager (Ex:
apt-get install python3 python3-dev
), or get the source from http://www.python.org/download/ if your distribution doesn't have the correct version of Python availible. - Make sure to also install the developement libraries (packages
python3-dev
orpython2-dev
). If you compile from source, these are already included. - Install extra libraries needed for compiling the
lxml
module:- Most Debian-based distros:
apt-get install libxml2-dev libxslt1-dev
- Red Hat/Fedora:
yum install libxml2-devel libxslt-devel
- Arch:
pacman -S libxml2 libxslt
- Most Debian-based distros:
- Go to https://github.com/ and follow the instructions to register an account.
- Run
apt-get install git
to install Git. - Follow the guide at https://help.github.com/articles/set-up-git to set up Git.
Pip is used to install extra Python modules that aren't included by default. A virtual environment is an isolated Python environment. It allows for per-program environment configuration.
- Install Pip by running
apt-get install python3-pip
(orpython-pip
for 2.7.x users) - Once Pip is installed, run
pip install virtualenv
- The virtual environment will be configured later.
- Fork the Repository
- Click the "Fork" button at the top-right of https://github.com/Queens-Hacks/qcumber-scraper
- You now have your own copy of qcumber-scraper that you can safely mess around with!
- Clone it to your computer
- Copy the
[email protected]:[yourusername]/qcumber-scraper.git
link on the page. - Open up a terminal window.
- Navigate to the folder in which you want to store your local copy of the scraper.
- Clone the repository.
git clone [repository]
, where[repository]
is the url you copied. - You should now have a
qcumber-scraper
folder.
- Create and Activate a Virtual Environment
-
Navigate into the
FSUCourseScraper
folder -
Create a new virtual environment:
virtualenv venv
-
If you have multiple versions of Python on your system, make sure to specify the correct one with a
-p
switch (Ex:virtualenv -p /usr/bin/python3 venv
) -
Activate the new environment:
source venv/bin/activate
-
NOTE: you will need to activate the virtual environment every time you want to run the local project.
-
To deactivate the virtual environment:
deactivate
- Install Required Packages
Make sure you have activated your virtual environment (see above) before running this command!
pip install -r requirements.txt
- If this command reports an error, check the log to see if you have all the dependencies required.
- Make sure your virtual environment is activated.
- Make you you have created a config.py
- To do a my.fsu.edu webscrape run
python main.py
- To do a textbook scrape run
python textbooks.py
For better logging and debugging later it is recommended to redirect the output to log files. Something like:
python main.py >logs/debug.log 2>logs/error.log
To watch the logs as they happen, first open 2 other terminals, and run tailf logs/debug.log
in one, and tailf logs/error.log
in the other. Then start the main scrape command like above.