This repository was archived by the owner on Aug 30, 2021. It is now read-only.
v1.0.0
The 1.0.0 makes big changes to the schema to make it much easier to extract clean, useful data from our fetch
/parse
/normalize
stages.
High-level goal
The downstream pipelines that consume our data have to adapt to the wide variety of scraped data formats. To help us along, we are going to impose more structure on the schema formats so that the consumers of our data have to deal with fewer edge cases. In most cases, these changes should also make it easier to write correct ingestion stages.
Specific changes
- New pydantic enums to reduce ambiguity:
State
for describing US states and territories, with USPS two-character abbreviations.ContactType
, with"general"
and"booking"
options.DayOfWeek
for the days Monday - Sunday and "public holidays".VaccineType
, with options for Pfizer/BioNTech, Moderna, Johnson & Johnson, and Oxford/AstraZeneca vaccines.VaccineSupply
, with options for vaccine stock status.WheelchairAccessLevel
, with various options for describing the wheelchair accessibility of the location.VaccineProvider
, for common parent organizations such as retail pharamcy chains.LocationAuthority
, for other authorities that identify locations, such as Google Places.
- Format validation for certain fields:
Address
:zip
must be a ZIP or ZIP+4 code, if present.state
must be aState
, if present.
LatLng
:latitude
must be between -90 and 90, inclusive, if present.longitude
must be between -180 and 180, inclusive, if present.
Contact
:contact_type
must be from theContactType
enum, if present.phone
must be in the format of a 9 or 10 digit US phone number, if present.website
must be an HTTP/HTTPS URL, if present.email
must be formatted as an email address, if present.
OpenHour
:day
must be aDayOfWeek
.open
has been renamedopens
for better parallelism withcloses
.
Vaccine
:vaccine
must be aVaccineType
.supply_level
must be from theVaccineSupply
enum, if present.
Organization
:id
should be from theVaccineProvider
enum, if possible, but may be a string or empty. Using an enum value makes it easier for consumers to interpret the value.id
must use only lowercase alphanumeric characters and underscores.
Link
:authority
should be aLocationAuthority
/VaccineProvider
, if possible, but may be a string or empty. Using an an enum value makes it easier to use these links to match locations.authority
must use only lowercase alphanumeric characters and underscores.uri
must be a URL, if present.
Source
:source
must use only lowercase alphanumeric characters and underscores.id
must not use a space or colon. These must be replaced with another character, such as a dash.fetched_from-uri
must be a URL, if present.
Location
:id
must consist of only lowercase alphanumeric characters and underscores, with precisely one colon. The colon should separate the part of the ID that reflects the data source and the part of the ID that reflects the specific location.
- Additional requirements:
- Each
Contact
should have precisely one field (phone
,website
,email
,other
). Do not coalesce several of these into a single method. - The
opens
value should be before or the same as thecloses
value on anOpenDate
. - The
opens
value should be before or the same as thecloses
value on anOpenHour
. - The
id
of aLocation
must be prefixed with the source name (specified inLocation.source.source
).
- Each