Skip to content

Storing Data

chris48s edited this page Feb 12, 2020 · 5 revisions

Databases

Each bot/scraper may store data in its own local SQLite database. We use scraperwiki-python to interact with it. A common pattern is to store a record of things we already know about in a database and then perform an action (e.g: send a notification) when a new thing (that isn't already in the database) is found.

Because the data is essentially disposable and there is no mechanism to apply migrations to a DB, sometimes we need to delete the DB and do a clean run. For this case, it is common for bots/scrapers to include a way to turn notifications off, or disable some functionality on the basis we are initializing an empty DB and don't want to flood slack/GitHub with spurious notifications about everything we scrape or discover. e.g:

Because bots are automatically updated from Github, we need to commit the settings change and revert it after the next run. e.g:

This is also useful for the first time we ever deploy/run a new scraper.

GitHub

Sometimes it is useful to store data in a Github repo. This is mainly useful when we want to be able to view how some data has changed over time as a diff, but we can also write bots that open a pull request based on an event (as in the case of Ubuntu-AMI-Scraper). The conventions for this are:

  • commitment provides some handy abstractions over the Github API.
  • We use the polling-bot-4000 Github account credentials.
  • Polling Bot can open an issue on any repo, but it doesn't automatically have permission to commit to our repos. If we want the account to be able to commit to our repo, we must explicitly grant the account write access to the target repo.

Some bots are set up to persist their data to a seperate data repo:

Clone this wiki locally