Skip to content

Spark engine ignores strict_min and strict_max parameters in expect_column_value_lengths_to_be_between #11393

@ccccchoyu

Description

@ccccchoyu

Describe the bug
The ColumnValuesValueLength metric provider has inconsistent behavior between execution engines for strict_min and strict_max parameters. The Pandas execution engine properly implements strict (exclusive) bounds logic, but the Spark execution engine completely ignores these parameters, treating all comparisons as inclusive.

To Reproduce

# This should exclude values with length exactly 2 and 4 (strict bounds)
expectation_config = {
    "expectation_type": "expect_column_value_lengths_to_be_between",
    "kwargs": {
        "column": "text_col",
        "min_value": 2,
        "max_value": 4,
        "strict_min": True,
        "strict_max": True
    }
}

# With Pandas: correctly excludes length 2 and 4, only length 3 passes
# With Spark (before fix): incorrectly includes length 2 and 4 (ignores strict parameters)

Original buggy Spark implementation:

Expected behavior
All execution engines (Pandas, Spark, SQLAlchemy) should handle strict_min and strict_max parameters consistently:

  • When strict_min=True: use > instead of >= for minimum bound
  • When strict_max=True: use < instead of <= for maximum bound

Proposed Fix:
I have implemented a fix that adds proper strict bounds logic to the Spark engine, matching the behavior of the Pandas engine:

# Fixed implementation
if min_value is None:
    if strict_max:
        return column_lengths < max_value
    else:
        return column_lengths <= max_value

elif max_value is None:
    if strict_min:
        return column_lengths > min_value
    else:
        return column_lengths >= min_value

else:
    if strict_min and strict_max:
        return (column_lengths > min_value) & (column_lengths < max_value)
    elif strict_min:
        return (column_lengths > min_value) & (column_lengths <= max_value)
    elif strict_max:
        return (column_lengths >= min_value) & (column_lengths < max_value)
    else:
        return (column_lengths >= min_value) & (column_lengths <= max_value)

Environment (please complete the following information):

  • Operating System: macOS
  • Great Expectations Version: 1.5.10
  • Data Source: Spark
  • Cloud environment: Local development

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions