A fork of steam scraper
This is a fork of steam scraper. Key differences:
- Updated to later version
- Output is now stored in SQLite database
- Additional fields are being scraped (ie. description)
- Added script for fetching News from API
- Added script for minimizing SQLite dataset
This repository contains Scrapy spiders for crawling products and scraping all user-submitted reviews from the Steam game store. A few scripts for more easily managing and deploying the spiders are included as well.
This repository contains code accompanying the Scraping the Steam Game Store article published on the Scrapinghub blog and the Intoli blog.
After cloning the repository with
git clone [email protected]:lkrsnik/steam-scraper.gitstart and activate a Python 3.6+ virtualenv with
cd steam-scraper
virtualenv -p python3 venv
. venv/bin/activateInstall Python requirements via:
pip install -r requirements.txtBy the way, on macOS you can install Python 3.6 via homebrew:
brew install python3On Ubuntu you can use instructions posted on askubuntu.com.
The purpose of ProductSpider is to discover product pages on the Steam product listing and extract useful metadata from them.
A neat feature of this spider is that it automatically navigates through Steam's age verification checkpoints.
You can initiate the multi-hour crawl with
mkdir output
scrapy crawl products --logfile=output/products_all.log --loglevel=INFO -s JOBDIR=output/products_all_job -s HTTPCACHE_ENABLED=False -a sqlite_path=output/db.sqlite3When it completes you should have metadata for all games (products) on Steam stored in db.sqlite3.
The purpose of ReviewSpider is to scrape all user-submitted reviews of a particular product from the Steam community portal.
By default, it scrapes reviews of products, where column reviews_scraped is empty (NULL) and n_reviews is larger than 10.
scrapy crawl reviews --logfile=output/reviews_all.log --loglevel=INFO -s JOBDIR=output/reviews -s HTTPCACHE_ENABLED=False -a sqlite_path=output/db.sqlite3If you want to scrape all reviews, the whole job takes a few days with Steam's generous rate limits.
The repository also includes a script that gives you an option to add news of all projects to the database. This is done by accessing Steam API and not scraping.
python -m scripts.get_news_api --sqlite_path output/db.sqlite3If you manage to get complete database, but would like to get a sample database from it, you may use minimize_dataset.py script.
python -m scripts.minimize_dataset --sqlite_path output/db.sqlite3 --minimized_sqlite_path output/db_mini.psql --size 1000