Skip to content
This repository was archived by the owner on Aug 30, 2021. It is now read-only.

v1.0.0

Compare
Choose a tag to compare
@nschiefer nschiefer released this 30 Apr 01:09
· 24 commits to main since this release
d1bd7cf

The 1.0.0 makes big changes to the schema to make it much easier to extract clean, useful data from our fetch/parse/normalize stages.

High-level goal

The downstream pipelines that consume our data have to adapt to the wide variety of scraped data formats. To help us along, we are going to impose more structure on the schema formats so that the consumers of our data have to deal with fewer edge cases. In most cases, these changes should also make it easier to write correct ingestion stages.

Specific changes

  • New pydantic enums to reduce ambiguity:
    • State for describing US states and territories, with USPS two-character abbreviations.
    • ContactType, with "general" and "booking" options.
    • DayOfWeek for the days Monday - Sunday and "public holidays".
    • VaccineType, with options for Pfizer/BioNTech, Moderna, Johnson & Johnson, and Oxford/AstraZeneca vaccines.
    • VaccineSupply, with options for vaccine stock status.
    • WheelchairAccessLevel, with various options for describing the wheelchair accessibility of the location.
    • VaccineProvider, for common parent organizations such as retail pharamcy chains.
    • LocationAuthority, for other authorities that identify locations, such as Google Places.
  • Format validation for certain fields:
    • Address:
      • zip must be a ZIP or ZIP+4 code, if present.
      • state must be a State, if present.
    • LatLng:
      • latitude must be between -90 and 90, inclusive, if present.
      • longitude must be between -180 and 180, inclusive, if present.
    • Contact:
      • contact_type must be from the ContactType enum, if present.
      • phone must be in the format of a 9 or 10 digit US phone number, if present.
      • website must be an HTTP/HTTPS URL, if present.
      • email must be formatted as an email address, if present.
    • OpenHour:
      • day must be a DayOfWeek.
      • open has been renamed opens for better parallelism with closes.
    • Vaccine:
      • vaccine must be a VaccineType.
      • supply_level must be from the VaccineSupply enum, if present.
    • Organization:
      • id should be from the VaccineProvider enum, if possible, but may be a string or empty. Using an enum value makes it easier for consumers to interpret the value.
      • id must use only lowercase alphanumeric characters and underscores.
    • Link:
      • authority should be a LocationAuthority/VaccineProvider, if possible, but may be a string or empty. Using an an enum value makes it easier to use these links to match locations.
      • authority must use only lowercase alphanumeric characters and underscores.
      • uri must be a URL, if present.
    • Source:
      • source must use only lowercase alphanumeric characters and underscores.
      • id must not use a space or colon. These must be replaced with another character, such as a dash.
      • fetched_from-uri must be a URL, if present.
    • Location:
      • id must consist of only lowercase alphanumeric characters and underscores, with precisely one colon. The colon should separate the part of the ID that reflects the data source and the part of the ID that reflects the specific location.
  • Additional requirements:
    • Each Contact should have precisely one field (phone, website, email, other). Do not coalesce several of these into a single method.
    • The opens value should be before or the same as the closes value on an OpenDate.
    • The opens value should be before or the same as the closes value on an OpenHour.
    • The id of a Location must be prefixed with the source name (specified in Location.source.source).