Skip to content

Comments

[BUGFIX] ExpectColumnDistinctValuesToEqualSet with database-pushed comparison#11616

Open
NathanFarmer wants to merge 76 commits intodevelopfrom
m/gx-2374/distinct-values-equal-set
Open

[BUGFIX] ExpectColumnDistinctValuesToEqualSet with database-pushed comparison#11616
NathanFarmer wants to merge 76 commits intodevelopfrom
m/gx-2374/distinct-values-equal-set

Conversation

@NathanFarmer
Copy link
Contributor

@NathanFarmer NathanFarmer commented Jan 23, 2026

Refactor ExpectColumnDistinctValuesToEqualSet to use database-pushdown metrics

This PR optimizes ExpectColumnDistinctValuesToEqualSet for high-cardinality columns by pushing set comparison logic to the database instead of fetching all distinct values into Python memory.

We no longer return an observed value of all distinct values, because this was bad practice and is not particularly actionable.

Changes:

  • Refactored _validate to use column.distinct_values.not_in_set.count, column.distinct_values.not_in_set, column.distinct_values.missing_from_column.count, and column.distinct_values.missing_from_column metrics
  • Updated result format to return observed_value: None with unexpected_count, partial_unexpected_list, missing_count, and partial_missing_list at the top level
  • Updated _atomic_diagnostic_observed_value renderer to display unexpected values with UNEXPECTED state and missing values with MISSING state
  • Always uses MAX_DISTINCT_VALUES (20) to limit partial_unexpected_list and partial_missing_list size, regardless of the global partial_unexpected_count default
  • Updated documentation in format_results.md to add partial_missing_list field
  • Added all four new metrics to great_expectations/metrics/__init__.py public API exports

Manual Test

Expectation

gxe.ExpectColumnDistinctValuesToEqualSet(
    column="my_col",
    value_set=first_10_values + [non_existent_value],  # 11 values vs 100K
)

Test Script Output

Generating 100,000 unique values...
Generated 100,000 rows with 100,000 unique values

Connecting to GX Cloud...
Using existing data source: test_distinct_values_datasource
Using existing data asset: test_distinct_values_asset
Using existing batch definition: test_batch
Creating fresh suite: test_distinct_values_suite

============================================================
Adding Expectations to Suite
============================================================
Added: ExpectColumnDistinctValuesToEqualSet (FAILURE expected - huge mismatch)

Saved suite with 1 expectations
Created/updated validation definition: test_distinct_values_validation
Created/updated checkpoint: test_distinct_values_checkpoint

============================================================
Running Checkpoint...
============================================================

Validating expectation directly to check payload size...

Expectation: expect_column_distinct_values_to_equal_set
Payload size: 108,594 bytes (0.10 MB)

============================================================
Running Checkpoint (this will attempt to save to GX Cloud)...
============================================================

Checkpoint completed in 13.45s
Overall success: False

…hed comparison

- Add ColumnDistinctValuesNotEqualSet metric to find bidirectional differences in DB
- Refactor expectation to use new metric instead of fetching all distinct values
- Add Metrics API wrapper class for new metric
- Respects result_format and partial_unexpected_count settings
@netlify
Copy link

netlify bot commented Jan 23, 2026

Deploy Preview for niobium-lead-7998 ready!

Name Link
🔨 Latest commit 88a713b
🔍 Latest deploy log https://app.netlify.com/projects/niobium-lead-7998/deploys/698b4cfe46fc400008439223
😎 Deploy Preview https://deploy-preview-11616.docs.greatexpectations.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.22%. Comparing base (44454b6) to head (88a713b).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop   #11616      +/-   ##
===========================================
+ Coverage    84.19%   84.22%   +0.03%     
===========================================
  Files          471      471              
  Lines        39591    39606      +15     
===========================================
+ Hits         33332    33357      +25     
+ Misses        6259     6249      -10     
Flag Coverage Δ
3.10 72.70% <91.83%> (+0.01%) ⬆️
3.10 athena 41.52% <38.77%> (+0.01%) ⬆️
3.10 aws_deps 45.87% <38.77%> (+0.01%) ⬆️
3.10 big 55.20% <38.77%> (+0.01%) ⬆️
3.10 bigquery 50.70% <67.34%> (+0.03%) ⬆️
3.10 clickhouse 41.53% <38.77%> (+0.01%) ⬆️
3.10 databricks 52.46% <67.34%> (+0.03%) ⬆️
3.10 filesystem 63.98% <63.26%> (+0.02%) ⬆️
3.10 gx-redshift 50.82% <67.34%> (+0.15%) ⬆️
3.10 mssql 50.94% <67.34%> (+0.15%) ⬆️
3.10 mysql 51.22% <67.34%> (+0.15%) ⬆️
3.10 openpyxl or pyarrow or project or sqlite or aws_creds 59.26% <67.34%> (+0.02%) ⬆️
3.10 postgresql 54.75% <67.34%> (+0.02%) ⬆️
3.10 snowflake 53.28% <67.34%> (+0.03%) ⬆️
3.10 spark 55.36% <67.34%> (+0.02%) ⬆️
3.10 spark_connect 46.38% <38.77%> (+0.01%) ⬆️
3.10 trino 48.19% <38.77%> (+0.01%) ⬆️
3.11 72.71% <91.83%> (+0.03%) ⬆️
3.11 athena ?
3.11 aws_deps ?
3.11 big ?
3.11 clickhouse ?
3.11 filesystem ?
3.11 mssql ?
3.11 mysql ?
3.11 openpyxl or pyarrow or project or sqlite or aws_creds ?
3.11 spark_connect ?
3.12 72.72% <91.83%> (+0.04%) ⬆️
3.12 athena ?
3.12 aws_deps ?
3.12 big ?
3.12 mysql ?
3.12 openpyxl or pyarrow or project or sqlite or aws_creds ?
3.12 spark_connect ?
3.13 72.72% <91.83%> (+0.03%) ⬆️
3.13 athena 41.52% <38.77%> (+0.01%) ⬆️
3.13 aws_deps 45.87% <38.77%> (+0.01%) ⬆️
3.13 big 55.20% <38.77%> (+0.01%) ⬆️
3.13 bigquery 50.70% <67.34%> (+0.03%) ⬆️
3.13 clickhouse 41.53% <38.77%> (+0.01%) ⬆️
3.13 databricks 52.46% <67.34%> (+0.03%) ⬆️
3.13 filesystem 63.98% <63.26%> (+0.02%) ⬆️
3.13 gx-redshift 50.82% <67.34%> (+0.15%) ⬆️
3.13 mssql 50.94% <67.34%> (+0.15%) ⬆️
3.13 mysql 51.22% <67.34%> (+0.15%) ⬆️
3.13 openpyxl or pyarrow or project or sqlite or aws_creds 59.27% <67.34%> (+0.02%) ⬆️
3.13 postgresql 54.75% <67.34%> (+0.02%) ⬆️
3.13 snowflake 53.28% <67.34%> (+0.03%) ⬆️
3.13 spark 55.37% <67.34%> (+0.02%) ⬆️
3.13 spark_connect 46.38% <38.77%> (+0.01%) ⬆️
3.13 trino 48.19% <38.77%> (+0.01%) ⬆️
cloud 0.00% <0.00%> (ø)
docs-basic 58.60% <38.77%> (+0.08%) ⬆️
docs-creds-needed 57.40% <38.77%> (+0.08%) ⬆️
docs-spark 56.71% <38.77%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@NathanFarmer NathanFarmer changed the title Optimize expect_column_distinct_values_to_equal_set with database-pushed comparison [MAINTENANCE] Optimize ExpectColumnDistinctValuesToEqualSet with database-pushed comparison Jan 23, 2026
NathanFarmer and others added 22 commits January 23, 2026 13:29
- Add ValidationDependencies to TYPE_CHECKING for type annotations
- Import inside method for runtime use
- Avoids circular import while keeping unquoted type annotation
…rcion

- Always fetch column.distinct_values for type coercion and observed_value
- Coerce value_set to match column value types before comparison
- Return all distinct values in observed_value when successful (backward compatible)
- Return value_counts in details when result_format is COMPLETE
- Handle type coercion for date/string mismatches
- Fix TypeError when sorting mixed types by catching exception
…unt on failure, filter nulls

- Cast partial_unexpected_count to int() to fix type checker error
- Only include unexpected_count in result when there are violations
- Filter out null/NaN values from observed_value_set before comparison
- Add PLR0915 noqa comment for complexity
- Fixes test failures for result format, type checking, and null handling
- Check type before converting to int to satisfy type checker
- Handle case where partial_unexpected_count might not be int or str
- Test ColumnDistinctValuesNotEqualSet for all data sources
- Tests cover: sets equal, column has extra, set has extra, both have extra, completely different
- Tests verify limit parameter works correctly
…tric

- Add test_dates_equal for date objects in value_set
- Add test_dates_with_str_value_set for string dates (tests BigQuery DATE coercion)
- Uses DATA_SOURCES_THAT_SUPPORT_DATE_COMPARISONS which includes BigQuery
Metrics don't do type coercion - that's an expectation-level feature.
The test_dates_equal test with proper date objects is sufficient for metrics.
observed_value now contains only differences (values in column but not in set,
or values in set but not in column), not all distinct values. This is a breaking
change that avoids fetching all distinct values into memory for high-cardinality
columns.

Also includes details.in_column_not_in_set and details.in_set_not_in_column
when the expectation fails.
…_equal_set

- observed_value now contains only differences, not all distinct values
- details now contains in_column_not_in_set and in_set_not_in_column
- No more details.value_counts in results
- String values are no longer auto-coerced to match date columns
@NathanFarmer NathanFarmer self-assigned this Jan 28, 2026
@NathanFarmer NathanFarmer changed the title [MAINTENANCE] Optimize ExpectColumnDistinctValuesToEqualSet with database-pushed comparison [BUGFIX] ExpectColumnDistinctValuesToEqualSet with database-pushed comparison Feb 4, 2026
@wookasz wookasz disabled auto-merge February 4, 2026 20:13
Copy link
Contributor

@wookasz wookasz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking as a breaking change

@NathanFarmer NathanFarmer removed their assignment Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants