docs: 📝 Why Pandera for data verification/validation (#135)

martonvago · lwjohnst86 · signekb · web-flow · commit a7b09452000e · 2024-11-13T13:01:37.000+01:00
Co-authored-by: Luke W. Johnston &lt;lwjohnst86@users.noreply.github.com&gt;
Co-authored-by: Signe Kirk Brødbæk &lt;40836345+signekb@users.noreply.github.com&gt;
diff --git a/why-pandera/index.qmd b/why-pandera/index.qmd
@@ -0,0 +1,242 @@
+---
+title: "Why Pandera"
+description: |
+  A key functionality of Sprout is checking whether user-supplied data match the metadata that describe them.
+  This post explains why we chose to use `pandera` for automating this process.
+date: "2024-11-08"
+categories:
+- backend
+- develop
+- check
+---
+
+::: content-hidden
+Use other decision posts as inspiration to writing these. Leave the
+content-hidden sections in the text for future reference.
+:::
+
+## Context and problem statement
+
+::: content-hidden
+State the context and some background on the issue, then write a
+statement in the form of a question for the problem.
+:::
+
+A key functionality of Sprout is checking whether user-supplied data
+match the metadata that describe them. Metadata are stored in JSON files
+following the [Data Package](https://datapackage.org/) standard.
+Checking data against metadata has two components: verification and
+validation. Verification involves checking whether the overall structure
+of the data (e.g. column number, column data type) is as expected, while
+validation involves checking that all individual data items meet
+constraints listed in the metadata (e.g. maximum values, specific
+formats). We are looking for a tool to automate the data verification
+and validation process. The question then is:
+
+*Which data verification and validation tools are available and which
+one should we use?*
+
+## Decision drivers
+
+::: content-hidden
+List some reasons for why we need to make this decision and what things
+have arisen that impact work.
+:::
+
+-   The new tool should support both data verification and validation.
+-   Ideally, it should support multiple tabular data formats, including
+    [`polars`](https://pola.rs/) data frames.
+-   It should be easy to transform JSON metadata into the representation
+    required by the tool.
+-   The tool should be able to handle relatively large datasets
+    efficiently.
+-   Support for extracting metadata from data would be a plus.
+
+## Considered options
+
+::: content-hidden
+List and describe some of the options, as well as some of the benefits
+and drawbacks for each option.
+:::
+
+-   [`frictionless-py`](https://framework.frictionlessdata.io/)
+-   [`pandera`](https://pandera.readthedocs.io/en/stable/index.html)
+-   [Great
+    Expectations](https://docs.greatexpectations.io/docs/core/introduction/)
+-   [Pydantic](https://docs.pydantic.dev/latest/)
+
+### `frictionless-py`
+
+[`frictionless-py`](https://framework.frictionlessdata.io/) is the
+Python implementation of the Data Package standard by its parent
+organisation, and as such it would be the obvious choice for our use
+case. As well as functionality for data verification and validation, it
+supports checking metadata against the Data Package standard and
+building pipelines for transforming data.
+
+::::: columns
+::: column
+#### Benefits
+
+-   Supports both data verification and validation, although it is not
+    possible to run these checks separately.
+-   Multiple tabular data formats are supported, including `pandas` data
+    frames.
+-   Directly compatible with our JSON metadata, as it implements the
+    Data Package standard.
+-   Supports large data files.
+-   Supports extracting metadata from data, matching the Data Package
+    standard.
+:::
+
+::: column
+#### Drawbacks
+
+-   The API suggests that it is possible to filter for specific errors,
+    but this functionality does not seem to work fully.
+-   There are a number of different entry points to the
+    verification/validation flow and it is quite difficult to foresee
+    how these differ in behaviour.
+-   `polars` data frames are not supported.
+-   So far we've found it a bit difficult to navigate the
+    `frictionless-py` codebase and documentation.
+:::
+:::::
+
+### Pandera
+
+[`pandera`](https://pandera.readthedocs.io/en/stable/index.html) is a
+flexible data validation library operating on data frames. Its
+validation mechanism is based on the concept of a schema expressing
+expectations about the data. It also has capabilities for preprocessing
+data and generating synthetic data from `pandera` schemas.
+
+::::: columns
+::: column
+#### Benefits
+
+-   Supports both data verification and validation, and can run these
+    checks separately.
+-   Supports `polars` data frames to a large extent (see on the right).
+-   Supports large datasets.
+-   Offers schema inference, although not with `polars`.
+-   `pandera` is widely used, extensively tested, and has good
+    documentation.
+:::
+
+::: column
+#### Drawbacks
+
+-   Only data frames are accepted as input, so other formats (e.g. CSV)
+    have to be loaded into a data frame first.
+-   While `polars` is supported, the integration is [not yet
+    complete](https://pandera.readthedocs.io/en/stable/index.html#supported-features).
+    E.g., it cannot yet extract metadata from `polars` data frames.
+-   We would need to write custom code to translate our table metadata
+    from JSON to `pandera` schemas in Python. For its own schemas,
+    `pandera` provides JSON conversion out of the box.
+:::
+:::::
+
+### Great Expectations
+
+[Great
+Expectations](https://docs.greatexpectations.io/docs/core/introduction/)
+is a larger framework for testing and validating data. It also offers a
+range of other functionality, which includes data visualisation, data
+collation from remote sources, and statistical summary generation. It is
+structured around expectations about the data, which are organised into
+expectation suites.
+
+::::: columns
+::: column
+#### Benefits
+
+-   Supports both data verification and validation, and can run these
+    checks separately.
+-   Supports a wide range of data formats, although not `polars`.
+-   Supports large datasets.
+-   Can generate an expectation suite based on data.
+:::
+
+::: column
+#### Drawbacks
+
+-   No support for `polars`.
+-   We would need to write custom code to translate our table metadata
+    from JSON to expectations in Python. For its own expectations
+    suites, Great Expectations provides JSON conversion out of the box.
+-   The API for declaring expectations matches the structure of the Data
+    Package standard less closely than that of the other options.
+-   Significantly larger and more complex to set up than any of the
+    other options.
+:::
+:::::
+
+### Pydantic
+
+[Pydantic](https://docs.pydantic.dev/latest/) is the most popular
+library for matching data against a schema in Python. Its basic use case
+is describing how data should be structured in a Pydantic model and
+checking an object against this model to confirm that they match. Model
+requirements are expressed using type hints and the matching behaviour
+is highly customisable.
+
+::::: columns
+::: column
+#### Benefits
+
+-   Supports data validation.
+:::
+
+::: column
+#### Drawbacks
+
+-   No out-of-the-box support for data verification.
+-   We would need to translate our JSON metadata into Pydantic models.
+-   Pydantic only accepts dictionary-like objects as input, so data
+    files would need to be loaded into Python manually and fed to the
+    Pydantic model row by row.
+-   The above means that support for large datasets would depend on our
+    implementation.
+-   No support for model extraction.
+:::
+:::::
+
+## Decision outcome
+
+::: content-hidden
+What decision was made, use the form "We decided on CHOICE because of
+REASONS."
+:::
+
+We decided to use `pandera` because it is a great match for our use
+case, has extensive documentation, and its behaviour is easy to tailor
+to our needs. While `frictionless-py` is a direct implementation of the
+Data Package standard, it is less mature and less widely used than
+`pandera`. We have found some inconsistencies in its
+verification/validation behaviour and feel that we would need to
+customise it using somewhat brittle and inelegant workarounds for it to
+fit into our workflow.
+
+As for the remaining options, we decided not to go with Pydantic because
+its use case is not verifying or validating datasets. Although Great
+Expectations offers most of the functionality we need, it is a complete
+framework with many parts we don't need, is rather complex to set up,
+and integrating with it would shape our codebase more than any of the
+other tools.
+
+### Consequences
+
+::: content-hidden
+List some potential consequences of this decision.
+:::
+
+-   We will have to write custom logic for transforming JSON metadata
+    into `pandera` schemas.
+-   We will have to find a solution for extracting metadata from data,
+    as `pandera` cannot currently infer schemas from `polars` data
+    frames.
+-   If we want to add any checks or behaviours based on file-level
+    properties of the data (e.g. file size, hash, encoding, etc.), these
+    will have to be implemented outside of `pandera`.