Skip to content
This repository was archived by the owner on Nov 12, 2024. It is now read-only.
This repository was archived by the owner on Nov 12, 2024. It is now read-only.

Automatic sanity check of data, flagging out-of-range or suspiciously large changes #452

@geening

Description

@geening

Propose a system that on each run of the pipeline would sanity check data for values that are out of a reasonable range, or with a suspiciously large change from one run of the pipeline to the next. Ideally checks would apply to data at all stages along the pipeline -- input sources, intermediate data, as well as generated data (output cell indexed by table/variable/key/date) -- but we could start by implementing where this is easiest. The results would be reported in a pipeline status report and/or stored (either appending to a log, or in a more structured format or database) for future reference (for instance, when suspicious data is manually discovered, one could look to see when it was introduced).

Some errors that have come up that would likely be caught by such a system:
Regions with confirmed cases > population
Regions with area > area of earth

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions