π π β
Data validation for scientists, engineers, and analysts seeking correctness.
Pandera is a Union.ai open source project that provides a flexible and expressive API for performing data validation on dataframe-like objects. The goal of Pandera is to make data processing pipelines more readable and robust with statistically typed dataframes.
Pandera supports multiple dataframe libraries, including pandas, polars, pyspark, and more. To validate pandas
DataFrames, install Pandera with the pandas
extra:
With pip
:
pip install 'pandera[pandas]'
With uv
:
uv pip install 'pandera[pandas]'
With conda
:
conda install -c conda-forge pandera-pandas
First, create a dataframe:
import pandas as pd
import pandera.pandas as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 2, 3],
"column2": [1.1, 1.2, 1.3],
"column3": ["a", "b", "c"],
})
Validate the data using the object-based API:
# define a schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, pa.Check.ge(0)),
"column2": pa.Column(float, pa.Check.lt(10)),
"column3": pa.Column(
str,
[
pa.Check.isin([*"abc"]),
pa.Check(lambda series: series.str.len() == 1),
]
),
})
print(schema.validate(df))
# column1 column2 column3
# 0 1 1.1 a
# 1 2 1.2 b
# 2 3 1.3 c
Or validate the data using the class-based API:
# define a schema
class Schema(pa.DataFrameModel):
column1: int = pa.Field(ge=0)
column2: float = pa.Field(lt=10)
column3: str = pa.Field(isin=[*"abc"])
@pa.check("column3")
def custom_check(cls, series: pd.Series) -> pd.Series:
return series.str.len() == 1
print(Schema.validate(df))
# column1 column2 column3
# 0 1 1.1 a
# 1 2 1.2 b
# 2 3 1.3 c
Warning
Pandera v0.24.0
introduces the pandera.pandas
module, which is now the
(highly) recommended way of defining DataFrameSchema
s and DataFrameModel
s
for pandas
data structures like DataFrame
s. Defining a dataframe schema from
the top-level pandera
module will produce a FutureWarning
:
import pandera as pa
schema = pa.DataFrameSchema({"col": pa.Column(str)})
Update your import to:
import pandera.pandas as pa
And all of the rest of your pandera code should work. Using the top-level
pandera
module to access DataFrameSchema
and the other pandera classes
or functions will be deprecated in a future version
See the official documentation to learn more.