Skip to content

Error - Pandera Pyspark - Not able to do check for str_matches #2181

@rovertttt

Description

@rovertttt

Describe the bug
A clear and concise description of what the bug is.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandera.pyspark as pa
import pyspark.sql.types as T

from decimal import Decimal
from pyspark.sql import SparkSession
from pyspark.sql import DataFrame
from pandera.pyspark import DataFrameModel

spark = SparkSession.builder.getOrCreate()

class PanderaSchema(DataFrameModel):
    id: T.IntegerType() = pa.Field(gt=5)
    product_name: T.StringType() = pa.Field(str_matches="^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$")
    price: T.DecimalType(20, 5) = pa.Field()

data = [
    (5, "fgsdgmaill.com", Decimal(44.4)),
    (15, "[email protected]", Decimal(99.0)),
]

spark_schema = T.StructType(
    [
        T.StructField("id", T.IntegerType(), False),
        T.StructField("product_name", T.StringType(), False),
        T.StructField("price", T.DecimalType(20, 5), False),
    ]
)
df = spark.createDataFrame(data, spark_schema)
df.show()

df_out = PanderaSchema.validate(check_obj=df)
df_out

import json

df_out_errors = df_out.pandera.errors
print(json.dumps(dict(df_out_errors), indent=4))

Expected behavior

Expected to give the error report for pyspark and show that i got a error when verifying the email string

Desktop (please complete the following information):

  • OS: MacOS
  • Browser: chrome
  • Version: 142.0.7444.176 (Official Build) (arm64)

Screenshots

 "CHECK_ERROR": [
            {
                "schema": "PanderaSchema",
                "column": "product_name",
                "check": "str_matches('^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$')",
                "error": "Error while executing check function: KeyError(\"<class 'pandera.api.pyspark.types.PysparkDataframeColumnObject'>\")Traceback (most recent call last):  File \"/usr/local/lib/python3.11/site-packages/pandera/backends/pyspark/components.py\", line 132, in run_checks    self.run_check(  File \"/usr/local/lib/python3.11/site-packages/pandera/backends/pyspark/base.py\", line 82, in run_check    check_result = check(check_obj, *args)                   ^^^^^^^^^^^^^^^^^^^^^^^  File \"/usr/local/lib/python3.11/site-packages/pandera/api/checks.py\", line 237, in __call__    return backend(check_obj, column)           ^^^^^^^^^^^^^^^^^^^^^^^^^^  File \"/usr/local/lib/python3.11/site-packages/pandera/backends/pyspark/checks.py\", line 92, in __call__    check_output = self.apply(check_obj, key, self.check._check_kwargs)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File \"/usr/local/lib/python3.11/site-packages/pandera/backends/pyspark/checks.py\", line 67, in apply    return self.check._check_fn(check_obj_and_col_name, **kwargs)           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File \"/usr/local/lib/python3.11/site-packages/pandera/api/function_dispatch.py\", line 24, in __call__    fn = self._function_registry[input_data_type]         ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^KeyError: <class 'pandera.api.pyspark.types.PysparkDataframeColumnObject'>"
            }
        ]

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions