This is the open-source portion of the back-end, website scraping software that powers www.macovidvaccines.com. Technologies used are Node JS and Puppeteer. In production, this code is run every 5 minutes via AWS Lambda, posting its results to a JSON file in an AWS S3 bucket.
-
Download a recent version of Chromium locally: https://download-chromium.appspot.com/
-
Create a
.envfile with the following:DEVELOPMENT=true CHROMEPATH="path/to/chromium/that/you/downloaded" # e.g. /Applications/Chromium.app/Contents/MacOS/Chromium PROPRIETARY_SITE_SCRAPERS_PATH="./../proprietary/site-scrapers" (optional, example) -
Install
prettierandeslint; make sure you run them before making any commits.
- In your terminal, install dependencies with
npm install - To run all scrapers:
node main.jsTo run an individual scraper, specify the base filename from site-scrapers, e.g.:node main.js MAImmunizationsto runsite-scrapers/MAImmunizations.js - If you have your own scrapers you want to add, mimic the structure of
./site-scrapers/inside a folder structure namedproprietary/site-scrapers. In your .env file, have the fieldPROPRIETARY_SITE_SCRAPERS_PATHset./../proprietary/site-scrapers. This naming is recommended since the.gitignorelists the folderproprietary. - When you're ready to deploy via AWS Lambda, run
npm run predeploywhich will generatelambda.zipfor you. This needs to stay under 50 MB for you to upload it manually. - Your production environment needs to have the environment variables
AWSS3BUCKETNAME,AWSACCESSKEYID, andAWSSECRETACCESSKEYso that it can publish to S3. If you are inserting your own scrapers, setPROPRIETARY_SITE_SCRAPERS_PATHin production as well. If you have any scrapers that need to solve reCAPTCHAs, you will also need aRECAPTCHATOKENfrom the 2captcha service.