Skip to content

Commit a7b0945

Browse files
martonvagolwjohnst86signekb
authored
docs: 📝 Why Pandera for data verification/validation (#135)
Co-authored-by: Luke W. Johnston <[email protected]> Co-authored-by: Signe Kirk Brødbæk <[email protected]>
1 parent 576cfc0 commit a7b0945

File tree

1 file changed

+242
-0
lines changed

1 file changed

+242
-0
lines changed

why-pandera/index.qmd

Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
---
2+
title: "Why Pandera"
3+
description: |
4+
A key functionality of Sprout is checking whether user-supplied data match the metadata that describe them.
5+
This post explains why we chose to use `pandera` for automating this process.
6+
date: "2024-11-08"
7+
categories:
8+
- backend
9+
- develop
10+
- check
11+
---
12+
13+
::: content-hidden
14+
Use other decision posts as inspiration to writing these. Leave the
15+
content-hidden sections in the text for future reference.
16+
:::
17+
18+
## Context and problem statement
19+
20+
::: content-hidden
21+
State the context and some background on the issue, then write a
22+
statement in the form of a question for the problem.
23+
:::
24+
25+
A key functionality of Sprout is checking whether user-supplied data
26+
match the metadata that describe them. Metadata are stored in JSON files
27+
following the [Data Package](https://datapackage.org/) standard.
28+
Checking data against metadata has two components: verification and
29+
validation. Verification involves checking whether the overall structure
30+
of the data (e.g. column number, column data type) is as expected, while
31+
validation involves checking that all individual data items meet
32+
constraints listed in the metadata (e.g. maximum values, specific
33+
formats). We are looking for a tool to automate the data verification
34+
and validation process. The question then is:
35+
36+
*Which data verification and validation tools are available and which
37+
one should we use?*
38+
39+
## Decision drivers
40+
41+
::: content-hidden
42+
List some reasons for why we need to make this decision and what things
43+
have arisen that impact work.
44+
:::
45+
46+
- The new tool should support both data verification and validation.
47+
- Ideally, it should support multiple tabular data formats, including
48+
[`polars`](https://pola.rs/) data frames.
49+
- It should be easy to transform JSON metadata into the representation
50+
required by the tool.
51+
- The tool should be able to handle relatively large datasets
52+
efficiently.
53+
- Support for extracting metadata from data would be a plus.
54+
55+
## Considered options
56+
57+
::: content-hidden
58+
List and describe some of the options, as well as some of the benefits
59+
and drawbacks for each option.
60+
:::
61+
62+
- [`frictionless-py`](https://framework.frictionlessdata.io/)
63+
- [`pandera`](https://pandera.readthedocs.io/en/stable/index.html)
64+
- [Great
65+
Expectations](https://docs.greatexpectations.io/docs/core/introduction/)
66+
- [Pydantic](https://docs.pydantic.dev/latest/)
67+
68+
### `frictionless-py`
69+
70+
[`frictionless-py`](https://framework.frictionlessdata.io/) is the
71+
Python implementation of the Data Package standard by its parent
72+
organisation, and as such it would be the obvious choice for our use
73+
case. As well as functionality for data verification and validation, it
74+
supports checking metadata against the Data Package standard and
75+
building pipelines for transforming data.
76+
77+
::::: columns
78+
::: column
79+
#### Benefits
80+
81+
- Supports both data verification and validation, although it is not
82+
possible to run these checks separately.
83+
- Multiple tabular data formats are supported, including `pandas` data
84+
frames.
85+
- Directly compatible with our JSON metadata, as it implements the
86+
Data Package standard.
87+
- Supports large data files.
88+
- Supports extracting metadata from data, matching the Data Package
89+
standard.
90+
:::
91+
92+
::: column
93+
#### Drawbacks
94+
95+
- The API suggests that it is possible to filter for specific errors,
96+
but this functionality does not seem to work fully.
97+
- There are a number of different entry points to the
98+
verification/validation flow and it is quite difficult to foresee
99+
how these differ in behaviour.
100+
- `polars` data frames are not supported.
101+
- So far we've found it a bit difficult to navigate the
102+
`frictionless-py` codebase and documentation.
103+
:::
104+
:::::
105+
106+
### Pandera
107+
108+
[`pandera`](https://pandera.readthedocs.io/en/stable/index.html) is a
109+
flexible data validation library operating on data frames. Its
110+
validation mechanism is based on the concept of a schema expressing
111+
expectations about the data. It also has capabilities for preprocessing
112+
data and generating synthetic data from `pandera` schemas.
113+
114+
::::: columns
115+
::: column
116+
#### Benefits
117+
118+
- Supports both data verification and validation, and can run these
119+
checks separately.
120+
- Supports `polars` data frames to a large extent (see on the right).
121+
- Supports large datasets.
122+
- Offers schema inference, although not with `polars`.
123+
- `pandera` is widely used, extensively tested, and has good
124+
documentation.
125+
:::
126+
127+
::: column
128+
#### Drawbacks
129+
130+
- Only data frames are accepted as input, so other formats (e.g. CSV)
131+
have to be loaded into a data frame first.
132+
- While `polars` is supported, the integration is [not yet
133+
complete](https://pandera.readthedocs.io/en/stable/index.html#supported-features).
134+
E.g., it cannot yet extract metadata from `polars` data frames.
135+
- We would need to write custom code to translate our table metadata
136+
from JSON to `pandera` schemas in Python. For its own schemas,
137+
`pandera` provides JSON conversion out of the box.
138+
:::
139+
:::::
140+
141+
### Great Expectations
142+
143+
[Great
144+
Expectations](https://docs.greatexpectations.io/docs/core/introduction/)
145+
is a larger framework for testing and validating data. It also offers a
146+
range of other functionality, which includes data visualisation, data
147+
collation from remote sources, and statistical summary generation. It is
148+
structured around expectations about the data, which are organised into
149+
expectation suites.
150+
151+
::::: columns
152+
::: column
153+
#### Benefits
154+
155+
- Supports both data verification and validation, and can run these
156+
checks separately.
157+
- Supports a wide range of data formats, although not `polars`.
158+
- Supports large datasets.
159+
- Can generate an expectation suite based on data.
160+
:::
161+
162+
::: column
163+
#### Drawbacks
164+
165+
- No support for `polars`.
166+
- We would need to write custom code to translate our table metadata
167+
from JSON to expectations in Python. For its own expectations
168+
suites, Great Expectations provides JSON conversion out of the box.
169+
- The API for declaring expectations matches the structure of the Data
170+
Package standard less closely than that of the other options.
171+
- Significantly larger and more complex to set up than any of the
172+
other options.
173+
:::
174+
:::::
175+
176+
### Pydantic
177+
178+
[Pydantic](https://docs.pydantic.dev/latest/) is the most popular
179+
library for matching data against a schema in Python. Its basic use case
180+
is describing how data should be structured in a Pydantic model and
181+
checking an object against this model to confirm that they match. Model
182+
requirements are expressed using type hints and the matching behaviour
183+
is highly customisable.
184+
185+
::::: columns
186+
::: column
187+
#### Benefits
188+
189+
- Supports data validation.
190+
:::
191+
192+
::: column
193+
#### Drawbacks
194+
195+
- No out-of-the-box support for data verification.
196+
- We would need to translate our JSON metadata into Pydantic models.
197+
- Pydantic only accepts dictionary-like objects as input, so data
198+
files would need to be loaded into Python manually and fed to the
199+
Pydantic model row by row.
200+
- The above means that support for large datasets would depend on our
201+
implementation.
202+
- No support for model extraction.
203+
:::
204+
:::::
205+
206+
## Decision outcome
207+
208+
::: content-hidden
209+
What decision was made, use the form "We decided on CHOICE because of
210+
REASONS."
211+
:::
212+
213+
We decided to use `pandera` because it is a great match for our use
214+
case, has extensive documentation, and its behaviour is easy to tailor
215+
to our needs. While `frictionless-py` is a direct implementation of the
216+
Data Package standard, it is less mature and less widely used than
217+
`pandera`. We have found some inconsistencies in its
218+
verification/validation behaviour and feel that we would need to
219+
customise it using somewhat brittle and inelegant workarounds for it to
220+
fit into our workflow.
221+
222+
As for the remaining options, we decided not to go with Pydantic because
223+
its use case is not verifying or validating datasets. Although Great
224+
Expectations offers most of the functionality we need, it is a complete
225+
framework with many parts we don't need, is rather complex to set up,
226+
and integrating with it would shape our codebase more than any of the
227+
other tools.
228+
229+
### Consequences
230+
231+
::: content-hidden
232+
List some potential consequences of this decision.
233+
:::
234+
235+
- We will have to write custom logic for transforming JSON metadata
236+
into `pandera` schemas.
237+
- We will have to find a solution for extracting metadata from data,
238+
as `pandera` cannot currently infer schemas from `polars` data
239+
frames.
240+
- If we want to add any checks or behaviours based on file-level
241+
properties of the data (e.g. file size, hash, encoding, etc.), these
242+
will have to be implemented outside of `pandera`.

0 commit comments

Comments
 (0)