Skip to content

[BUGFIX] Compare type dict column name with actual column name with casefold. … #11064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

mihirraj
Copy link

@mihirraj mihirraj commented Apr 6, 2025

…If the column name is passed as all caps or snake-case then the column matching will fail and it will raise an indexError which is hard to debug for users.
Steps to reproduce:

`import great_expectations as gx
import pandas as pd
import yaml
from great_expectations.expectations import ExpectColumnValuesToBeOfType

NAME_DATA_SOURCE = "pandas"
NAME_DATA_ASSET = "tutorial_data"
NAME_BATCH_DEF = "pandas_tutorial"
NAME_EXPECTATION_SUITE = "pandas_tutorial"
NAME_VALIDATION_DEF = "pandas_validation"
NAME_CHECKPOINT = "pandas"

Create a small DataFrame

data = {
"age": [25, 30, 35, 40]
}
config = yaml.safe_load(open("expectations.yml"))

df = pd.DataFrame(data)

context = gx.get_context()

data_source = context.data_sources.add_pandas(name=NAME_DATA_SOURCE)
data_asset = data_source.add_dataframe_asset(name=NAME_DATA_ASSET)
batch_definition = data_asset.add_batch_definition_whole_dataframe(NAME_BATCH_DEF)

expectation_suite = gx.ExpectationSuite(name="demo_expectation_suite")
expectation_suite = context.suites.add(expectation_suite)

expectations = ExpectColumnValuesToBeOfType(
column="AGE",
type_="int64"
)

expectation_suite.add_expectation(expectations)

batch_parameters = {"dataframe": df}
batch = batch_definition.get_batch(batch_parameters=batch_parameters)
validation_results = batch.validate(expectation_suite)

print(validation_results)`

  • Description of PR changes above includes a link to an existing GitHub issue
  • PR title is prefixed with one of: [BUGFIX], [FEATURE], [DOCS], [MAINTENANCE], [CONTRIB], [MINORBUMP]
  • Code is linted - run invoke lint (uses ruff format + ruff check)
  • Appropriate tests and docs have been updated

For more information about contributing, visit our community resources.

After you submit your PR, keep the page open and monitor the statuses of the various checks made by our continuous integration process at the bottom of the page. Please fix any issues that come up and reach out on Slack if you need help. Thanks for contributing!

…If the column name is passed as all caps or snakecase then the column matching will fail and it will raise indexError which is hard to detect.
Copy link

netlify bot commented Apr 6, 2025

‼️ Deploy request for niobium-lead-7998 rejected.

Name Link
🔨 Latest commit 00118fb

@@ -566,7 +566,7 @@ def _validate(
actual_column_type = [
type_dict["type"]
for type_dict in actual_column_types_list
if type_dict["name"] == column_name
if type_dict["name"].casefold() == column_name.casefold()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing. I agree that an IndexError is onerous for a user to debug and we should make this easier. I don't this this is a viable solution though because a table can have column names where casing matters. For example, a postgres database may have a table, MyTable, with quoted columns of different types called "MyColumn" and "mycolumn". These column names would be identical using casefold.
Different databases may use different quoting characters for column names, eg ", ', `. Also, in pandas column names are case sensitive and spark can be configurable to have case sensitive column names.

To fix this, making a better error message might be easier than changing the behavior of the expectation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants