Skip to content

Commit 16dfb8c

Browse files
authored
Update imports from pandera to pandera.pandas (#1965)
* update pandera imports and docs Signed-off-by: cosmicBboy <[email protected]> * update readme and index docs page Signed-off-by: cosmicBboy <[email protected]> * update imports in docs and tests Signed-off-by: cosmicBboy <[email protected]> * fix lint Signed-off-by: cosmicBboy <[email protected]> * fix import Signed-off-by: cosmicBboy <[email protected]> * fix pandera.polars import Signed-off-by: cosmicBboy <[email protected]> * fix mypy test Signed-off-by: cosmicBboy <[email protected]> --------- Signed-off-by: cosmicBboy <[email protected]>
1 parent d71098a commit 16dfb8c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+314
-431
lines changed

README.md

Lines changed: 53 additions & 206 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
<div align="center"><a href="https://www.union.ai/pandera"><img src="docs/source/_static/pandera-banner.png" width="400"></a></div>
33

44
<h1 align="center">
5-
The Open-source Framework for Precision Data Testing
5+
The Open-source Framework for Validating DataFrame-like Objects
66
</h1>
77

88
<p align="center">
@@ -32,246 +32,93 @@
3232
[![Conda Downloads](https://img.shields.io/conda/dn/conda-forge/pandera?style=for-the-badge)](https://anaconda.org/conda-forge/pandera)
3333
[![Discord](https://img.shields.io/badge/discord-chat-purple?color=%235765F2&label=discord&logo=discord&style=for-the-badge)](https://discord.gg/vyanhWuaKB)
3434

35-
`pandera` is a [Union.ai](https://union.ai/blog-post/pandera-joins-union-ai) open
35+
Pandera is a [Union.ai](https://union.ai/blog-post/pandera-joins-union-ai) open
3636
source project that provides a flexible and expressive API for performing data
37-
validation on dataframe-like objects to make data processing pipelines more readable and robust.
38-
39-
Dataframes contain information that `pandera` explicitly validates at runtime.
40-
This is useful in production-critical or reproducible research settings. With
41-
`pandera`, you can:
42-
43-
1. Define a schema once and use it to validate
44-
[different dataframe types](https://pandera.readthedocs.io/en/stable/supported_libraries.html)
45-
including [pandas](http://pandas.pydata.org), [polars](https://docs.pola.rs/),
46-
[dask](https://dask.org), [modin](https://modin.readthedocs.io/),
47-
and [pyspark](https://spark.apache.org/docs/3.2.0/api/python/user_guide/pandas_on_spark/index.html).
48-
1. [Check](https://pandera.readthedocs.io/en/stable/checks.html) the types and
49-
properties of columns in a `DataFrame` or values in a `Series`.
50-
1. Perform more complex statistical validation like
51-
[hypothesis testing](https://pandera.readthedocs.io/en/stable/hypothesis.html#hypothesis).
52-
1. [Parse](https://pandera.readthedocs.io/en/stable/parsers.html) data to standardize
53-
the preprocessing steps needed to produce valid data.
54-
1. Seamlessly integrate with existing data analysis/processing pipelines
55-
via [function decorators](https://pandera.readthedocs.io/en/stable/decorators.html#decorators).
56-
1. Define dataframe models with the
57-
[class-based API](https://pandera.readthedocs.io/en/stable/dataframe_models.html#dataframe-models)
58-
with pydantic-style syntax and validate dataframes using the typing syntax.
59-
1. [Synthesize data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html#data-synthesis-strategies)
60-
from schema objects for property-based testing with pandas data structures.
61-
1. [Lazily Validate](https://pandera.readthedocs.io/en/stable/lazy_validation.html)
62-
dataframes so that all validation checks are executed before raising an error.
63-
1. [Integrate](https://pandera.readthedocs.io/en/stable/integrations.html) with
64-
a rich ecosystem of python tools like [pydantic](https://pydantic-docs.helpmanual.io),
65-
[fastapi](https://fastapi.tiangolo.com/), and [mypy](http://mypy-lang.org/).
66-
67-
## Documentation
68-
69-
The official documentation is hosted here: https://pandera.readthedocs.io
70-
37+
validation on dataframe-like objects. The goal of Pandera is to make data
38+
processing pipelines more readable and robust with statistically typed
39+
dataframes.
7140

7241
## Install
7342

74-
Using pip:
43+
Pandera supports [multiple dataframe libraries](https://pandera.readthedocs.io/en/stable/supported_libraries.html), including [pandas](http://pandas.pydata.org), [polars](https://docs.pola.rs/), [pyspark](https://spark.apache.org/docs/latest/api/python/index.html), and more. To validate `pandas` DataFrames, install Pandera with the `pandas` extra:
44+
45+
**With `pip`:**
7546

7647
```
77-
pip install pandera
48+
pip install 'pandera[pandas]'
7849
```
7950

80-
Using conda:
51+
**With `uv`:**
8152

8253
```
83-
conda install -c conda-forge pandera
54+
uv pip install 'pandera[pandas]'
8455
```
8556

86-
### Extras
87-
88-
Installing additional functionality:
57+
**With `conda`:**
8958

90-
<details>
91-
92-
<summary><i>pip</i></summary>
93-
94-
```bash
95-
pip install 'pandera[hypotheses]' # hypothesis checks
96-
pip install 'pandera[io]' # yaml/script schema io utilities
97-
pip install 'pandera[strategies]' # data synthesis strategies
98-
pip install 'pandera[mypy]' # enable static type-linting of pandas
99-
pip install 'pandera[fastapi]' # fastapi integration
100-
pip install 'pandera[dask]' # validate dask dataframes
101-
pip install 'pandera[pyspark]' # validate pyspark dataframes
102-
pip install 'pandera[modin]' # validate modin dataframes
103-
pip install 'pandera[modin-ray]' # validate modin dataframes with ray
104-
pip install 'pandera[modin-dask]' # validate modin dataframes with dask
105-
pip install 'pandera[geopandas]' # validate geopandas geodataframes
106-
pip install 'pandera[polars]' # validate polars dataframes
10759
```
108-
109-
</details>
110-
111-
<details>
112-
113-
<summary><i>conda</i></summary>
114-
115-
```bash
116-
conda install -c conda-forge pandera-hypotheses # hypothesis checks
117-
conda install -c conda-forge pandera-io # yaml/script schema io utilities
118-
conda install -c conda-forge pandera-strategies # data synthesis strategies
119-
conda install -c conda-forge pandera-mypy # enable static type-linting of pandas
120-
conda install -c conda-forge pandera-fastapi # fastapi integration
121-
conda install -c conda-forge pandera-dask # validate dask dataframes
122-
conda install -c conda-forge pandera-pyspark # validate pyspark dataframes
123-
conda install -c conda-forge pandera-modin # validate modin dataframes
124-
conda install -c conda-forge pandera-modin-ray # validate modin dataframes with ray
125-
conda install -c conda-forge pandera-modin-dask # validate modin dataframes with dask
126-
conda install -c conda-forge pandera-geopandas # validate geopandas geodataframes
127-
conda install -c conda-forge pandera-polars # validate polars dataframes
60+
conda install -c conda-forge pandera-pandas
12861
```
12962

130-
</details>
63+
## Get started
13164

132-
## Quick Start
65+
First, create a dataframe:
13366

13467
```python
13568
import pandas as pd
136-
import pandera as pa
137-
69+
import pandera.pandas as pa
13870

13971
# data to validate
14072
df = pd.DataFrame({
141-
"column1": [1, 4, 0, 10, 9],
142-
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
143-
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"]
73+
"column1": [1, 2, 3],
74+
"column2": [1.1, 1.2, 1.3],
75+
"column3": ["a", "b", "c"],
14476
})
77+
```
78+
79+
Validate the data using the object-based API:
14580

146-
# define schema
81+
```python
82+
# define a schema
14783
schema = pa.DataFrameSchema({
148-
"column1": pa.Column(int, checks=pa.Check.le(10)),
149-
"column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
150-
"column3": pa.Column(str, checks=[
151-
pa.Check.str_startswith("value_"),
152-
# define custom checks as functions that take a series as input and
153-
# outputs a boolean or boolean Series
154-
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
155-
]),
84+
"column1": pa.Column(int, pa.Check.ge(0)),
85+
"column2": pa.Column(float, pa.Check.lt(10)),
86+
"column3": pa.Column(
87+
str,
88+
[
89+
pa.Check.isin([*"abc"]),
90+
pa.Check(lambda series: series.str.len() == 1),
91+
]
92+
),
15693
})
15794

158-
validated_df = schema(df)
159-
print(validated_df)
160-
161-
# column1 column2 column3
162-
# 0 1 -1.3 value_1
163-
# 1 4 -1.4 value_2
164-
# 2 0 -2.9 value_3
165-
# 3 10 -10.1 value_2
166-
# 4 9 -20.4 value_1
95+
print(schema.validate(df))
96+
# column1 column2 column3
97+
# 0 1 1.1 a
98+
# 1 2 1.2 b
99+
# 2 3 1.3 c
167100
```
168101

169-
## DataFrame Model
170-
171-
`pandera` also provides an alternative API for expressing schemas inspired
172-
by [dataclasses](https://docs.python.org/3/library/dataclasses.html) and
173-
[pydantic](https://pydantic-docs.helpmanual.io/). The equivalent `DataFrameModel`
174-
for the above `DataFrameSchema` would be:
175-
102+
Or validate the data using the class-based API:
176103

177104
```python
178-
from pandera.typing import Series
179-
105+
# define a schema
180106
class Schema(pa.DataFrameModel):
181-
182-
column1: int = pa.Field(le=10)
183-
column2: float = pa.Field(lt=-1.2)
184-
column3: str = pa.Field(str_startswith="value_")
107+
column1: int = pa.Field(ge=0)
108+
column2: float = pa.Field(lt=10)
109+
column3: str = pa.Field(isin=[*"abc"])
185110

186111
@pa.check("column3")
187-
def column_3_check(cls, series: Series[str]) -> Series[bool]:
188-
"""Check that values have two elements after being split with '_'"""
189-
return series.str.split("_", expand=True).shape[1] == 2
190-
191-
Schema.validate(df)
192-
```
193-
194-
## Development Installation
195-
196-
```
197-
git clone https://github.com/pandera-dev/pandera.git
198-
cd pandera
199-
export PYTHON_VERSION=... # specify desired python version
200-
pip install -r dev/requirements-${PYTHON_VERSION}.txt
201-
pip install -e .
112+
def custom_check(cls, series: pd.Series) -> pd.Series:
113+
return series.str.len() == 1
114+
115+
print(Schema.validate(df))
116+
# column1 column2 column3
117+
# 0 1 1.1 a
118+
# 1 2 1.2 b
119+
# 2 3 1.3 c
202120
```
203121

204-
## Tests
205-
206-
```
207-
pip install pytest
208-
pytest tests
209-
```
210-
211-
## Contributing to pandera [![GitHub contributors](https://img.shields.io/github/contributors/pandera-dev/pandera.svg?style=for-the-badge)](https://github.com/pandera-dev/pandera/graphs/contributors)
212-
213-
All contributions, bug reports, bug fixes, documentation improvements,
214-
enhancements and ideas are welcome.
215-
216-
A detailed overview on how to contribute can be found in the
217-
[contributing guide](https://github.com/pandera-dev/pandera/blob/main/.github/CONTRIBUTING.md)
218-
on GitHub.
219-
220-
## Issues
221-
222-
Go [here](https://github.com/pandera-dev/pandera/issues) to submit feature
223-
requests or bugfixes.
224-
225-
## Need Help?
226-
227-
There are many ways of getting help with your questions. You can ask a question
228-
on [Github Discussions](https://github.com/pandera-dev/pandera/discussions/categories/q-a)
229-
page or reach out to the maintainers and pandera community on
230-
[Discord](https://discord.gg/vyanhWuaKB)
231-
232-
## Why `pandera`?
233-
234-
- [dataframe-centric data types](https://pandera.readthedocs.io/en/stable/dtypes.html),
235-
[column nullability](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#null-values-in-columns),
236-
and [uniqueness](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html#validating-the-joint-uniqueness-of-columns)
237-
are first-class concepts.
238-
- Define [dataframe models](https://pandera.readthedocs.io/en/stable/schema_models.html) with the class-based API with
239-
[pydantic](https://pydantic-docs.helpmanual.io/)-style syntax and validate dataframes using the typing syntax.
240-
- `check_input` and `check_output` [decorators](https://pandera.readthedocs.io/en/stable/decorators.html#decorators-for-pipeline-integration)
241-
enable seamless integration with existing code.
242-
- [`Check`s](https://pandera.readthedocs.io/en/stable/checks.html) provide flexibility and performance by providing access to `pandas`
243-
API by design and offers built-in checks for common data tests.
244-
- [`Hypothesis`](https://pandera.readthedocs.io/en/stable/hypothesis.html) class provides a tidy-first interface for statistical hypothesis
245-
testing.
246-
- `Check`s and `Hypothesis` objects support both [tidy and wide data validation](https://pandera.readthedocs.io/en/stable/checks.html#wide-checks).
247-
- Use schemas as generative contracts to [synthesize data](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html) for unit testing.
248-
- [Schema inference](https://pandera.readthedocs.io/en/stable/schema_inference.html) allows you to bootstrap schemas from data.
249-
250-
## How to Cite
251-
252-
If you use `pandera` in the context of academic or industry research, please
253-
consider citing the **paper** and/or **software package**.
254-
255-
### [Paper](https://conference.scipy.org/proceedings/scipy2020/niels_bantilan.html)
256-
257-
```
258-
@InProceedings{ niels_bantilan-proc-scipy-2020,
259-
author = { {N}iels {B}antilan },
260-
title = { pandera: {S}tatistical {D}ata {V}alidation of {P}andas {D}ataframes },
261-
booktitle = { {P}roceedings of the 19th {P}ython in {S}cience {C}onference },
262-
pages = { 116 - 124 },
263-
year = { 2020 },
264-
editor = { {M}eghann {A}garwal and {C}hris {C}alloway and {D}illon {N}iederhut and {D}avid {S}hupe },
265-
doi = { 10.25080/Majora-342d178e-010 }
266-
}
267-
```
268-
269-
### Software Package
270-
271-
[![DOI](https://img.shields.io/badge/DOI-10.5281/zenodo.3385265-blue?style=for-the-badge)](https://doi.org/10.5281/zenodo.3385265)
272-
273-
274-
## License and Credits
122+
## Next steps
275123

276-
`pandera` is licensed under the [MIT license](license.txt) and is written and
277-
maintained by Niels Bantilan ([email protected])
124+
See the [official documentation](https://pandera.readthedocs.io) to learn more.

docs/source/checks.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ of boolean values. For the check to pass, all of the elements in the boolean
2929
series must evaluate to `True`, for example:
3030

3131
```{code-cell} python
32-
import pandera as pa
32+
import pandera.pandas as pa
3333
import pandas as pd
3434
3535
check_lt_10 = pa.Check(lambda s: s <= 10)
@@ -54,7 +54,7 @@ schema = pa.DataFrameSchema({
5454
For common validation tasks, built-in checks are available in `pandera`.
5555

5656
```{code-cell} python
57-
import pandera as pa
57+
import pandera.pandas as pa
5858
5959
schema = pa.DataFrameSchema({
6060
"small_values": pa.Column(float, pa.Check.less_than(100)),
@@ -75,7 +75,7 @@ you can provide the `element_wise=True` keyword argument:
7575

7676
```{code-cell} python
7777
import pandas as pd
78-
import pandera as pa
78+
import pandera.pandas as pa
7979
8080
schema = pa.DataFrameSchema({
8181
"a": pa.Column(
@@ -140,7 +140,7 @@ fly.
140140

141141
```{code-cell} python
142142
import pandas as pd
143-
import pandera as pa
143+
import pandera.pandas as pa
144144
145145
schema = pa.DataFrameSchema({
146146
"height_in_feet": pa.Column(
@@ -195,7 +195,7 @@ columns in a `DataFrame`. For example, if you want to make assertions about
195195

196196
```{code-cell} python
197197
import pandas as pd
198-
import pandera as pa
198+
import pandera.pandas as pa
199199
200200
201201
df = pd.DataFrame({
@@ -267,7 +267,7 @@ import warnings
267267
268268
import numpy as np
269269
import pandas as pd
270-
import pandera as pa
270+
import pandera.pandas as pa
271271
272272
from scipy.stats import normaltest
273273

docs/source/dask.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ below we'll use the {ref}`class-based API <dataframe-models>` to define a
2828
```{code-cell} python
2929
import dask.dataframe as dd
3030
import pandas as pd
31-
import pandera as pa
31+
import pandera.pandas as pa
3232
3333
from pandera.typing.dask import DataFrame, Series
3434

docs/source/data_format_conversion.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ type that supports this feature.
2323
Consider this simple example:
2424

2525
```{code-cell} python
26-
import pandera as pa
26+
import pandera.pandas as pa
2727
from pandera.typing import DataFrame, Series
2828
2929
class InSchema(pa.DataFrameModel):

docs/source/data_synthesis_strategies.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ property-based testing library.
2424
Once you've defined a schema, it's easy to generate examples:
2525

2626
```{code-cell} python
27-
import pandera as pa
27+
import pandera.pandas as pa
2828
2929
schema = pa.DataFrameSchema(
3030
{

0 commit comments

Comments
 (0)