|
2 | 2 | <div align="center"><a href="https://www.union.ai/pandera"><img src="docs/source/_static/pandera-banner.png" width="400"></a></div>
|
3 | 3 |
|
4 | 4 | <h1 align="center">
|
5 |
| - The Open-source Framework for Precision Data Testing |
| 5 | + The Open-source Framework for Validating DataFrame-like Objects |
6 | 6 | </h1>
|
7 | 7 |
|
8 | 8 | <p align="center">
|
|
32 | 32 | [](https://anaconda.org/conda-forge/pandera)
|
33 | 33 | [](https://discord.gg/vyanhWuaKB)
|
34 | 34 |
|
35 |
| -`pandera` is a [Union.ai](https://union.ai/blog-post/pandera-joins-union-ai) open |
| 35 | +Pandera is a [Union.ai](https://union.ai/blog-post/pandera-joins-union-ai) open |
36 | 36 | source project that provides a flexible and expressive API for performing data
|
37 |
| -validation on dataframe-like objects to make data processing pipelines more readable and robust. |
38 |
| - |
39 |
| -Dataframes contain information that `pandera` explicitly validates at runtime. |
40 |
| -This is useful in production-critical or reproducible research settings. With |
41 |
| -`pandera`, you can: |
42 |
| - |
43 |
| -1. Define a schema once and use it to validate |
44 |
| - [different dataframe types](https://pandera.readthedocs.io/en/stable/supported_libraries.html) |
45 |
| - including [pandas](http://pandas.pydata.org), [polars](https://docs.pola.rs/), |
46 |
| - [dask](https://dask.org), [modin](https://modin.readthedocs.io/), |
47 |
| - and [pyspark](https://spark.apache.org/docs/3.2.0/api/python/user_guide/pandas_on_spark/index.html). |
48 |
| -1. [Check](https://pandera.readthedocs.io/en/stable/checks.html) the types and |
49 |
| - properties of columns in a `DataFrame` or values in a `Series`. |
50 |
| -1. Perform more complex statistical validation like |
51 |
| - [hypothesis testing](https://pandera.readthedocs.io/en/stable/hypothesis.html#hypothesis). |
52 |
| -1. [Parse](https://pandera.readthedocs.io/en/stable/parsers.html) data to standardize |
53 |
| - the preprocessing steps needed to produce valid data. |
54 |
| -1. Seamlessly integrate with existing data analysis/processing pipelines |
55 |
| - via [function decorators](https://pandera.readthedocs.io/en/stable/decorators.html#decorators). |
56 |
| -1. Define dataframe models with the |
57 |
| - [class-based API](https://pandera.readthedocs.io/en/stable/dataframe_models.html#dataframe-models) |
58 |
| - with pydantic-style syntax and validate dataframes using the typing syntax. |
59 |
| -1. [Synthesize data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#data-synthesis-strategies) |
60 |
| - from schema objects for property-based testing with pandas data structures. |
61 |
| -1. [Lazily Validate](https://pandera.readthedocs.io/en/stable/lazy_validation.html) |
62 |
| - dataframes so that all validation checks are executed before raising an error. |
63 |
| -1. [Integrate](https://pandera.readthedocs.io/en/stable/integrations.html) with |
64 |
| - a rich ecosystem of python tools like [pydantic](https://pydantic-docs.helpmanual.io), |
65 |
| - [fastapi](https://fastapi.tiangolo.com/), and [mypy](http://mypy-lang.org/). |
66 |
| - |
67 |
| -## Documentation |
68 |
| - |
69 |
| -The official documentation is hosted here: https://pandera.readthedocs.io |
70 |
| - |
| 37 | +validation on dataframe-like objects. The goal of Pandera is to make data |
| 38 | +processing pipelines more readable and robust with statistically typed |
| 39 | +dataframes. |
71 | 40 |
|
72 | 41 | ## Install
|
73 | 42 |
|
74 |
| -Using pip: |
| 43 | +Pandera supports [multiple dataframe libraries](https://pandera.readthedocs.io/en/stable/supported_libraries.html), including [pandas](http://pandas.pydata.org), [polars](https://docs.pola.rs/), [pyspark](https://spark.apache.org/docs/latest/api/python/index.html), and more. To validate `pandas` DataFrames, install Pandera with the `pandas` extra: |
| 44 | + |
| 45 | +**With `pip`:** |
75 | 46 |
|
76 | 47 | ```
|
77 |
| -pip install pandera |
| 48 | +pip install 'pandera[pandas]' |
78 | 49 | ```
|
79 | 50 |
|
80 |
| -Using conda: |
| 51 | +**With `uv`:** |
81 | 52 |
|
82 | 53 | ```
|
83 |
| -conda install -c conda-forge pandera |
| 54 | +uv pip install 'pandera[pandas]' |
84 | 55 | ```
|
85 | 56 |
|
86 |
| -### Extras |
87 |
| - |
88 |
| -Installing additional functionality: |
| 57 | +**With `conda`:** |
89 | 58 |
|
90 |
| -<details> |
91 |
| - |
92 |
| -<summary><i>pip</i></summary> |
93 |
| - |
94 |
| -```bash |
95 |
| -pip install 'pandera[hypotheses]' # hypothesis checks |
96 |
| -pip install 'pandera[io]' # yaml/script schema io utilities |
97 |
| -pip install 'pandera[strategies]' # data synthesis strategies |
98 |
| -pip install 'pandera[mypy]' # enable static type-linting of pandas |
99 |
| -pip install 'pandera[fastapi]' # fastapi integration |
100 |
| -pip install 'pandera[dask]' # validate dask dataframes |
101 |
| -pip install 'pandera[pyspark]' # validate pyspark dataframes |
102 |
| -pip install 'pandera[modin]' # validate modin dataframes |
103 |
| -pip install 'pandera[modin-ray]' # validate modin dataframes with ray |
104 |
| -pip install 'pandera[modin-dask]' # validate modin dataframes with dask |
105 |
| -pip install 'pandera[geopandas]' # validate geopandas geodataframes |
106 |
| -pip install 'pandera[polars]' # validate polars dataframes |
107 | 59 | ```
|
108 |
| - |
109 |
| -</details> |
110 |
| - |
111 |
| -<details> |
112 |
| - |
113 |
| -<summary><i>conda</i></summary> |
114 |
| - |
115 |
| -```bash |
116 |
| -conda install -c conda-forge pandera-hypotheses # hypothesis checks |
117 |
| -conda install -c conda-forge pandera-io # yaml/script schema io utilities |
118 |
| -conda install -c conda-forge pandera-strategies # data synthesis strategies |
119 |
| -conda install -c conda-forge pandera-mypy # enable static type-linting of pandas |
120 |
| -conda install -c conda-forge pandera-fastapi # fastapi integration |
121 |
| -conda install -c conda-forge pandera-dask # validate dask dataframes |
122 |
| -conda install -c conda-forge pandera-pyspark # validate pyspark dataframes |
123 |
| -conda install -c conda-forge pandera-modin # validate modin dataframes |
124 |
| -conda install -c conda-forge pandera-modin-ray # validate modin dataframes with ray |
125 |
| -conda install -c conda-forge pandera-modin-dask # validate modin dataframes with dask |
126 |
| -conda install -c conda-forge pandera-geopandas # validate geopandas geodataframes |
127 |
| -conda install -c conda-forge pandera-polars # validate polars dataframes |
| 60 | +conda install -c conda-forge pandera-pandas |
128 | 61 | ```
|
129 | 62 |
|
130 |
| -</details> |
| 63 | +## Get started |
131 | 64 |
|
132 |
| -## Quick Start |
| 65 | +First, create a dataframe: |
133 | 66 |
|
134 | 67 | ```python
|
135 | 68 | import pandas as pd
|
136 |
| -import pandera as pa |
137 |
| - |
| 69 | +import pandera.pandas as pa |
138 | 70 |
|
139 | 71 | # data to validate
|
140 | 72 | df = pd.DataFrame({
|
141 |
| - "column1": [1, 4, 0, 10, 9], |
142 |
| - "column2": [-1.3, -1.4, -2.9, -10.1, -20.4], |
143 |
| - "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"] |
| 73 | + "column1": [1, 2, 3], |
| 74 | + "column2": [1.1, 1.2, 1.3], |
| 75 | + "column3": ["a", "b", "c"], |
144 | 76 | })
|
| 77 | +``` |
| 78 | + |
| 79 | +Validate the data using the object-based API: |
145 | 80 |
|
146 |
| -# define schema |
| 81 | +```python |
| 82 | +# define a schema |
147 | 83 | schema = pa.DataFrameSchema({
|
148 |
| - "column1": pa.Column(int, checks=pa.Check.le(10)), |
149 |
| - "column2": pa.Column(float, checks=pa.Check.lt(-1.2)), |
150 |
| - "column3": pa.Column(str, checks=[ |
151 |
| - pa.Check.str_startswith("value_"), |
152 |
| - # define custom checks as functions that take a series as input and |
153 |
| - # outputs a boolean or boolean Series |
154 |
| - pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2) |
155 |
| - ]), |
| 84 | + "column1": pa.Column(int, pa.Check.ge(0)), |
| 85 | + "column2": pa.Column(float, pa.Check.lt(10)), |
| 86 | + "column3": pa.Column( |
| 87 | + str, |
| 88 | + [ |
| 89 | + pa.Check.isin([*"abc"]), |
| 90 | + pa.Check(lambda series: series.str.len() == 1), |
| 91 | + ] |
| 92 | + ), |
156 | 93 | })
|
157 | 94 |
|
158 |
| -validated_df = schema(df) |
159 |
| -print(validated_df) |
160 |
| - |
161 |
| -# column1 column2 column3 |
162 |
| -# 0 1 -1.3 value_1 |
163 |
| -# 1 4 -1.4 value_2 |
164 |
| -# 2 0 -2.9 value_3 |
165 |
| -# 3 10 -10.1 value_2 |
166 |
| -# 4 9 -20.4 value_1 |
| 95 | +print(schema.validate(df)) |
| 96 | +# column1 column2 column3 |
| 97 | +# 0 1 1.1 a |
| 98 | +# 1 2 1.2 b |
| 99 | +# 2 3 1.3 c |
167 | 100 | ```
|
168 | 101 |
|
169 |
| -## DataFrame Model |
170 |
| - |
171 |
| -`pandera` also provides an alternative API for expressing schemas inspired |
172 |
| -by [dataclasses](https://docs.python.org/3/library/dataclasses.html) and |
173 |
| -[pydantic](https://pydantic-docs.helpmanual.io/). The equivalent `DataFrameModel` |
174 |
| -for the above `DataFrameSchema` would be: |
175 |
| - |
| 102 | +Or validate the data using the class-based API: |
176 | 103 |
|
177 | 104 | ```python
|
178 |
| -from pandera.typing import Series |
179 |
| - |
| 105 | +# define a schema |
180 | 106 | class Schema(pa.DataFrameModel):
|
181 |
| - |
182 |
| - column1: int = pa.Field(le=10) |
183 |
| - column2: float = pa.Field(lt=-1.2) |
184 |
| - column3: str = pa.Field(str_startswith="value_") |
| 107 | + column1: int = pa.Field(ge=0) |
| 108 | + column2: float = pa.Field(lt=10) |
| 109 | + column3: str = pa.Field(isin=[*"abc"]) |
185 | 110 |
|
186 | 111 | @pa.check("column3")
|
187 |
| - def column_3_check(cls, series: Series[str]) -> Series[bool]: |
188 |
| - """Check that values have two elements after being split with '_'""" |
189 |
| - return series.str.split("_", expand=True).shape[1] == 2 |
190 |
| - |
191 |
| -Schema.validate(df) |
192 |
| -``` |
193 |
| - |
194 |
| -## Development Installation |
195 |
| - |
196 |
| -``` |
197 |
| -git clone https://github.com/pandera-dev/pandera.git |
198 |
| -cd pandera |
199 |
| -export PYTHON_VERSION=... # specify desired python version |
200 |
| -pip install -r dev/requirements-${PYTHON_VERSION}.txt |
201 |
| -pip install -e . |
| 112 | + def custom_check(cls, series: pd.Series) -> pd.Series: |
| 113 | + return series.str.len() == 1 |
| 114 | + |
| 115 | +print(Schema.validate(df)) |
| 116 | +# column1 column2 column3 |
| 117 | +# 0 1 1.1 a |
| 118 | +# 1 2 1.2 b |
| 119 | +# 2 3 1.3 c |
202 | 120 | ```
|
203 | 121 |
|
204 |
| -## Tests |
205 |
| - |
206 |
| -``` |
207 |
| -pip install pytest |
208 |
| -pytest tests |
209 |
| -``` |
210 |
| - |
211 |
| -## Contributing to pandera [](https://github.com/pandera-dev/pandera/graphs/contributors) |
212 |
| - |
213 |
| -All contributions, bug reports, bug fixes, documentation improvements, |
214 |
| -enhancements and ideas are welcome. |
215 |
| - |
216 |
| -A detailed overview on how to contribute can be found in the |
217 |
| -[contributing guide](https://github.com/pandera-dev/pandera/blob/main/.github/CONTRIBUTING.md) |
218 |
| -on GitHub. |
219 |
| - |
220 |
| -## Issues |
221 |
| - |
222 |
| -Go [here](https://github.com/pandera-dev/pandera/issues) to submit feature |
223 |
| -requests or bugfixes. |
224 |
| - |
225 |
| -## Need Help? |
226 |
| - |
227 |
| -There are many ways of getting help with your questions. You can ask a question |
228 |
| -on [Github Discussions](https://github.com/pandera-dev/pandera/discussions/categories/q-a) |
229 |
| -page or reach out to the maintainers and pandera community on |
230 |
| -[Discord](https://discord.gg/vyanhWuaKB) |
231 |
| - |
232 |
| -## Why `pandera`? |
233 |
| - |
234 |
| -- [dataframe-centric data types](https://pandera.readthedocs.io/en/stable/dtypes.html), |
235 |
| - [column nullability](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#null-values-in-columns), |
236 |
| - and [uniqueness](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#validating-the-joint-uniqueness-of-columns) |
237 |
| - are first-class concepts. |
238 |
| -- Define [dataframe models](https://pandera.readthedocs.io/en/stable/schema_models.html) with the class-based API with |
239 |
| - [pydantic](https://pydantic-docs.helpmanual.io/)-style syntax and validate dataframes using the typing syntax. |
240 |
| -- `check_input` and `check_output` [decorators](https://pandera.readthedocs.io/en/stable/decorators.html#decorators-for-pipeline-integration) |
241 |
| - enable seamless integration with existing code. |
242 |
| -- [`Check`s](https://pandera.readthedocs.io/en/stable/checks.html) provide flexibility and performance by providing access to `pandas` |
243 |
| - API by design and offers built-in checks for common data tests. |
244 |
| -- [`Hypothesis`](https://pandera.readthedocs.io/en/stable/hypothesis.html) class provides a tidy-first interface for statistical hypothesis |
245 |
| - testing. |
246 |
| -- `Check`s and `Hypothesis` objects support both [tidy and wide data validation](https://pandera.readthedocs.io/en/stable/checks.html#wide-checks). |
247 |
| -- Use schemas as generative contracts to [synthesize data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html) for unit testing. |
248 |
| -- [Schema inference](https://pandera.readthedocs.io/en/stable/schema_inference.html) allows you to bootstrap schemas from data. |
249 |
| - |
250 |
| -## How to Cite |
251 |
| - |
252 |
| -If you use `pandera` in the context of academic or industry research, please |
253 |
| -consider citing the **paper** and/or **software package**. |
254 |
| - |
255 |
| -### [Paper](https://conference.scipy.org/proceedings/scipy2020/niels_bantilan.html) |
256 |
| - |
257 |
| -``` |
258 |
| -@InProceedings{ niels_bantilan-proc-scipy-2020, |
259 |
| - author = { {N}iels {B}antilan }, |
260 |
| - title = { pandera: {S}tatistical {D}ata {V}alidation of {P}andas {D}ataframes }, |
261 |
| - booktitle = { {P}roceedings of the 19th {P}ython in {S}cience {C}onference }, |
262 |
| - pages = { 116 - 124 }, |
263 |
| - year = { 2020 }, |
264 |
| - editor = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe }, |
265 |
| - doi = { 10.25080/Majora-342d178e-010 } |
266 |
| -} |
267 |
| -``` |
268 |
| - |
269 |
| -### Software Package |
270 |
| - |
271 |
| -[](https://doi.org/10.5281/zenodo.3385265) |
272 |
| - |
273 |
| - |
274 |
| -## License and Credits |
| 122 | +## Next steps |
275 | 123 |
|
276 |
| -`pandera` is licensed under the [MIT license](license.txt) and is written and |
277 |
| -maintained by Niels Bantilan ( [email protected]) |
| 124 | +See the [official documentation](https://pandera.readthedocs.io) to learn more. |
0 commit comments