-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Describe the bug
The ColumnValuesValueLength
metric provider has inconsistent behavior between execution engines for strict_min
and strict_max
parameters. The Pandas execution engine properly implements strict (exclusive) bounds logic, but the Spark execution engine completely ignores these parameters, treating all comparisons as inclusive.
To Reproduce
# This should exclude values with length exactly 2 and 4 (strict bounds)
expectation_config = {
"expectation_type": "expect_column_value_lengths_to_be_between",
"kwargs": {
"column": "text_col",
"min_value": 2,
"max_value": 4,
"strict_min": True,
"strict_max": True
}
}
# With Pandas: correctly excludes length 2 and 4, only length 3 passes
# With Spark (before fix): incorrectly includes length 2 and 4 (ignores strict parameters)
Original buggy Spark implementation:
Lines 241 to 248 in b1c632f
if min_value is not None and max_value is not None: return (column_lengths >= min_value) & (column_lengths <= max_value) elif min_value is None and max_value is not None: return column_lengths <= max_value elif min_value is not None and max_value is None: return column_lengths >= min_value
Expected behavior
All execution engines (Pandas, Spark, SQLAlchemy) should handle strict_min and strict_max parameters consistently:
- When
strict_min=True
: use > instead of >= for minimum bound - When
strict_max=True
: use < instead of <= for maximum bound
Proposed Fix:
I have implemented a fix that adds proper strict bounds logic to the Spark engine, matching the behavior of the Pandas engine:
# Fixed implementation
if min_value is None:
if strict_max:
return column_lengths < max_value
else:
return column_lengths <= max_value
elif max_value is None:
if strict_min:
return column_lengths > min_value
else:
return column_lengths >= min_value
else:
if strict_min and strict_max:
return (column_lengths > min_value) & (column_lengths < max_value)
elif strict_min:
return (column_lengths > min_value) & (column_lengths <= max_value)
elif strict_max:
return (column_lengths >= min_value) & (column_lengths < max_value)
else:
return (column_lengths >= min_value) & (column_lengths <= max_value)
Environment (please complete the following information):
- Operating System: macOS
- Great Expectations Version: 1.5.10
- Data Source: Spark
- Cloud environment: Local development
Additional context