ETL process for legacy Rose of Eternity reviews archived on wayback machine.
Steps:
- downloads review HTML pages
- scrapes out data (user name, score, comments, and posted on date)
- cleans data
- saves data as CSV
- loads data into database
Before you start:
- apps should always be run inside of virtualenvs
- dependencies are managed with pip and pip-tools. Virtualenvs come with
pip preinstalled; pip-tools should be installed manually with
python -m pip install pip-tools.
Before the first run of the app:
$ pip-compile --upgrade requirements.in $ python -m pip install -r dev-requirements.txt
Running the app:
$ python -m etl.etl