[BUGFIX] `ExpectColumnDistinctValuesToEqualSet` with database-pushed comparison by NathanFarmer · Pull Request #11616 · great-expectations/great_expectations

NathanFarmer · 2026-01-23T20:21:05Z

Refactor ExpectColumnDistinctValuesToEqualSet to use database-pushdown metrics

This PR optimizes ExpectColumnDistinctValuesToEqualSet for high-cardinality columns by pushing set comparison logic to the database instead of fetching all distinct values into Python memory.

We no longer return an observed value of all distinct values, because this was bad practice and is not particularly actionable.

Changes:

Refactored _validate to use column.distinct_values.not_in_set.count, column.distinct_values.not_in_set, column.distinct_values.missing_from_column.count, and column.distinct_values.missing_from_column metrics
Updated result format to return observed_value: None with unexpected_count, partial_unexpected_list, missing_count, and partial_missing_list at the top level
Updated _atomic_diagnostic_observed_value renderer to display unexpected values with UNEXPECTED state and missing values with MISSING state
Always uses MAX_DISTINCT_VALUES (20) to limit partial_unexpected_list and partial_missing_list size, regardless of the global partial_unexpected_count default
Updated documentation in format_results.md to add partial_missing_list field
Added all four new metrics to great_expectations/metrics/__init__.py public API exports

Manual Test

Expectation

gxe.ExpectColumnDistinctValuesToEqualSet(
    column="my_col",
    value_set=first_10_values + [non_existent_value],  # 11 values vs 100K
)

Test Script Output

Generating 100,000 unique values...
Generated 100,000 rows with 100,000 unique values

Connecting to GX Cloud...
Using existing data source: test_distinct_values_datasource
Using existing data asset: test_distinct_values_asset
Using existing batch definition: test_batch
Creating fresh suite: test_distinct_values_suite

============================================================
Adding Expectations to Suite
============================================================
Added: ExpectColumnDistinctValuesToEqualSet (FAILURE expected - huge mismatch)

Saved suite with 1 expectations
Created/updated validation definition: test_distinct_values_validation
Created/updated checkpoint: test_distinct_values_checkpoint

============================================================
Running Checkpoint...
============================================================

Validating expectation directly to check payload size...

Expectation: expect_column_distinct_values_to_equal_set
Payload size: 108,594 bytes (0.10 MB)

============================================================
Running Checkpoint (this will attempt to save to GX Cloud)...
============================================================

Checkpoint completed in 13.45s
Overall success: False

…hed comparison - Add ColumnDistinctValuesNotEqualSet metric to find bidirectional differences in DB - Refactor expectation to use new metric instead of fetching all distinct values - Add Metrics API wrapper class for new metric - Respects result_format and partial_unexpected_count settings

netlify · 2026-01-23T20:21:55Z

✅ Deploy Preview for niobium-lead-7998 ready!

Name	Link
🔨 Latest commit	`88a713b`
🔍 Latest deploy log	https://app.netlify.com/projects/niobium-lead-7998/deploys/698b4cfe46fc400008439223
😎 Deploy Preview	https://deploy-preview-11616.docs.greatexpectations.io
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

codecov · 2026-01-23T20:22:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.22%. Comparing base (44454b6) to head (88a713b).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop   #11616      +/-   ##
===========================================
+ Coverage    84.19%   84.22%   +0.03%     
===========================================
  Files          471      471              
  Lines        39591    39606      +15     
===========================================
+ Hits         33332    33357      +25     
+ Misses        6259     6249      -10

Flag	Coverage Δ
3.10	`72.70% <91.83%> (+0.01%)`	⬆️
3.10 athena	`41.52% <38.77%> (+0.01%)`	⬆️
3.10 aws_deps	`45.87% <38.77%> (+0.01%)`	⬆️
3.10 big	`55.20% <38.77%> (+0.01%)`	⬆️
3.10 bigquery	`50.70% <67.34%> (+0.03%)`	⬆️
3.10 clickhouse	`41.53% <38.77%> (+0.01%)`	⬆️
3.10 databricks	`52.46% <67.34%> (+0.03%)`	⬆️
3.10 filesystem	`63.98% <63.26%> (+0.02%)`	⬆️
3.10 gx-redshift	`50.82% <67.34%> (+0.15%)`	⬆️
3.10 mssql	`50.94% <67.34%> (+0.15%)`	⬆️
3.10 mysql	`51.22% <67.34%> (+0.15%)`	⬆️
3.10 openpyxl or pyarrow or project or sqlite or aws_creds	`59.26% <67.34%> (+0.02%)`	⬆️
3.10 postgresql	`54.75% <67.34%> (+0.02%)`	⬆️
3.10 snowflake	`53.28% <67.34%> (+0.03%)`	⬆️
3.10 spark	`55.36% <67.34%> (+0.02%)`	⬆️
3.10 spark_connect	`46.38% <38.77%> (+0.01%)`	⬆️
3.10 trino	`48.19% <38.77%> (+0.01%)`	⬆️
3.11	`72.71% <91.83%> (+0.03%)`	⬆️
3.11 athena	`?`
3.11 aws_deps	`?`
3.11 big	`?`
3.11 clickhouse	`?`
3.11 filesystem	`?`
3.11 mssql	`?`
3.11 mysql	`?`
3.11 openpyxl or pyarrow or project or sqlite or aws_creds	`?`
3.11 spark_connect	`?`
3.12	`72.72% <91.83%> (+0.04%)`	⬆️
3.12 athena	`?`
3.12 aws_deps	`?`
3.12 big	`?`
3.12 mysql	`?`
3.12 openpyxl or pyarrow or project or sqlite or aws_creds	`?`
3.12 spark_connect	`?`
3.13	`72.72% <91.83%> (+0.03%)`	⬆️
3.13 athena	`41.52% <38.77%> (+0.01%)`	⬆️
3.13 aws_deps	`45.87% <38.77%> (+0.01%)`	⬆️
3.13 big	`55.20% <38.77%> (+0.01%)`	⬆️
3.13 bigquery	`50.70% <67.34%> (+0.03%)`	⬆️
3.13 clickhouse	`41.53% <38.77%> (+0.01%)`	⬆️
3.13 databricks	`52.46% <67.34%> (+0.03%)`	⬆️
3.13 filesystem	`63.98% <63.26%> (+0.02%)`	⬆️
3.13 gx-redshift	`50.82% <67.34%> (+0.15%)`	⬆️
3.13 mssql	`50.94% <67.34%> (+0.15%)`	⬆️
3.13 mysql	`51.22% <67.34%> (+0.15%)`	⬆️
3.13 openpyxl or pyarrow or project or sqlite or aws_creds	`59.27% <67.34%> (+0.02%)`	⬆️
3.13 postgresql	`54.75% <67.34%> (+0.02%)`	⬆️
3.13 snowflake	`53.28% <67.34%> (+0.03%)`	⬆️
3.13 spark	`55.37% <67.34%> (+0.02%)`	⬆️
3.13 spark_connect	`46.38% <38.77%> (+0.01%)`	⬆️
3.13 trino	`48.19% <38.77%> (+0.01%)`	⬆️
cloud	`0.00% <0.00%> (ø)`
docs-basic	`58.60% <38.77%> (+0.08%)`	⬆️
docs-creds-needed	`57.40% <38.77%> (+0.08%)`	⬆️
docs-spark	`56.71% <38.77%> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Add ValidationDependencies to TYPE_CHECKING for type annotations - Import inside method for runtime use - Avoids circular import while keeping unquoted type annotation

…rcion - Always fetch column.distinct_values for type coercion and observed_value - Coerce value_set to match column value types before comparison - Return all distinct values in observed_value when successful (backward compatible) - Return value_counts in details when result_format is COMPLETE - Handle type coercion for date/string mismatches - Fix TypeError when sorting mixed types by catching exception

…unt on failure, filter nulls - Cast partial_unexpected_count to int() to fix type checker error - Only include unexpected_count in result when there are violations - Filter out null/NaN values from observed_value_set before comparison - Add PLR0915 noqa comment for complexity - Fixes test failures for result format, type checking, and null handling

- Check type before converting to int to satisfy type checker - Handle case where partial_unexpected_count might not be int or str

- Test ColumnDistinctValuesNotEqualSet for all data sources - Tests cover: sets equal, column has extra, set has extra, both have extra, completely different - Tests verify limit parameter works correctly

…tric - Add test_dates_equal for date objects in value_set - Add test_dates_with_str_value_set for string dates (tests BigQuery DATE coercion) - Uses DATA_SOURCES_THAT_SUPPORT_DATE_COMPARISONS which includes BigQuery

Metrics don't do type coercion - that's an expectation-level feature. The test_dates_equal test with proper date objects is sufficient for metrics.

…qualSet

for more information, see https://pre-commit.ci

observed_value now contains only differences (values in column but not in set, or values in set but not in column), not all distinct values. This is a breaking change that avoids fetching all distinct values into memory for high-cardinality columns. Also includes details.in_column_not_in_set and details.in_set_not_in_column when the expectation fails.

…_equal_set - observed_value now contains only differences, not all distinct values - details now contains in_column_not_in_set and in_set_not_in_column - No more details.value_counts in results - String values are no longer auto-coerced to match date columns

…pected_list

…rror

…tial_unexpected_list to top-level result for equal_set

…ues Expectations

…ove redundant None handling

…or distinct values Expectations

…ues Expectations

…ectations

docs/docusaurus/docs/cloud/validations/format_results.md

wookasz

Blocking as a breaking change

NathanFarmer changed the title ~~Optimize expect_column_distinct_values_to_equal_set with database-pushed comparison~~ [MAINTENANCE] Optimize ExpectColumnDistinctValuesToEqualSet with database-pushed comparison Jan 23, 2026

NathanFarmer and others added 22 commits January 23, 2026 13:29

Fix circular import: move ValidationDependencies to TYPE_CHECKING block

28482f9

- Add ValidationDependencies to TYPE_CHECKING for type annotations - Import inside method for runtime use - Avoids circular import while keeping unquoted type annotation

Fix type error: handle partial_unexpected_count type safely

65dbe17

- Check type before converting to int to satisfy type checker - Handle case where partial_unexpected_count might not be int or str

Add integration tests for column.distinct_values.not_equal_set metric

3a4eeb9

- Test ColumnDistinctValuesNotEqualSet for all data sources - Tests cover: sets equal, column has extra, set has extra, both have extra, completely different - Tests verify limit parameter works correctly

Remove test_dates_with_str_value_set from metric tests

a3f653c

Metrics don't do type coercion - that's an expectation-level feature. The test_dates_equal test with proper date objects is sufficient for metrics.

Add result format integration tests for ExpectColumnDistinctValuesToE…

de6d219

…qualSet

Merge branch 'develop' into m/gx-2374/distinct-values-equal-set

08843f7

Fix type errors: use result.result instead of to_json_dict

c150c43

Fix value_counts comparison: use to_json_dict for proper serialization

8645225

Fix type errors: compare full result dict instead of nested access

1ab245a

Merge branch 'develop' into m/gx-2374/distinct-values-equal-set

88d0001

Remove unnecessary fallback to column.value_counts metric

62cf62c

[pre-commit.ci] auto fixes from pre-commit.com hooks

1857d71

for more information, see https://pre-commit.ci

Trigger build

6abfb6c

Change result format: observed_value=None, violations in partial_unex…

92350b3

…pected_list

Restore renderer to show unexpected and missing values from details

114216f

Revert expectation to original column.value_counts implementation

2737215

Revert test expectations to match original behavior

4dfdf4f

NathanFarmer self-assigned this Jan 28, 2026

NathanFarmer and others added 3 commits January 28, 2026 16:15

Merge branch 'develop' into m/gx-2374/distinct-values-equal-set

8a42013

Merge branch 'develop' into m/gx-2374/distinct-values-equal-set

e0eb783

Limit observed_value to 1000 values to prevent 413 payload errors

c199bc8

NathanFarmer and others added 14 commits February 3, 2026 11:54

fix: handle None limit in Spark/SQL implementations to prevent py4j e…

d32e57d

…rror

chore: remove unused distinct_values_not_equal_set metric

1ad3779

feat: add missing_count/partial_missing_list and unexpected_count/par…

93dcea6

…tial_unexpected_list to top-level result for equal_set

fix: update integration test result format for equal_set

6031ab5

docs: add partial_missing_list to result format documentation

5cc55fb

docs: clarify missing_count and partial_missing_list for distinct val…

cce2d28

…ues Expectations

test: add result_format unit tests for equal_set

f532044

Merge branch 'develop' into m/gx-2374/distinct-values-equal-set

b6fc1bd

feat: respect partial_unexpected_count setting for partial_missing_list

f1ea710

feat: use MAX_DISTINCT_VALUES (500) for partial lists limit in equal_set

f5100b9

fix: use default partial_unexpected_count of 20, not 500

80967ab

fix: change metric default limit to MAX_DISTINCT_VALUES (500) and rem…

beef742

…ove redundant None handling

feat: default partial_unexpected_count to MAX_DISTINCT_VALUES (500) f…

ecd978b

…or distinct values Expectations

docs: update partial_unexpected_count default to 500 for distinct val…

8df73c4

…ues Expectations

NathanFarmer changed the title ~~[MAINTENANCE] Optimize ExpectColumnDistinctValuesToEqualSet with database-pushed comparison~~ [BUGFIX] ExpectColumnDistinctValuesToEqualSet with database-pushed comparison Feb 4, 2026

NathanFarmer added 6 commits February 4, 2026 09:08

fix: add all four new metrics to public API exports

ba16e60

fix: reduce MAX_DISTINCT_VALUES from 500 to 200 to prevent 413 errors

8b927d9

docs: fix partial_missing_list default to 200 for distinct values Exp…

2c88c60

…ectations

fix: actually slice partial lists to partial_unexpected_count

a840e11

fix: always use MAX_DISTINCT_VALUES for distinct values expectations

7ac3102

fix: change MAX_DISTINCT_VALUES back to 500

81d81c4

wookasz reviewed Feb 4, 2026

View reviewed changes

docs/docusaurus/docs/cloud/validations/format_results.md Outdated Show resolved Hide resolved

docs/docusaurus/docs/cloud/validations/format_results.md Show resolved Hide resolved

docs/docusaurus/docs/cloud/validations/format_results.md Show resolved Hide resolved

wookasz approved these changes Feb 4, 2026

View reviewed changes

NathanFarmer added 2 commits February 4, 2026 11:43

Change MAX_DISTINCT_VALUES from 500 to 20

42163a7

Fix redundant '20 or 20' in docs to just '20'

f6fe7e4

NathanFarmer enabled auto-merge February 4, 2026 18:48

wookasz disabled auto-merge February 4, 2026 20:13

wookasz requested changes Feb 4, 2026

View reviewed changes

Merge branch 'develop' into m/gx-2374/distinct-values-equal-set

88a713b

NathanFarmer removed their assignment Feb 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[BUGFIX] `ExpectColumnDistinctValuesToEqualSet` with database-pushed comparison#11616

[BUGFIX] `ExpectColumnDistinctValuesToEqualSet` with database-pushed comparison#11616
NathanFarmer wants to merge 76 commits intodevelopfrom
m/gx-2374/distinct-values-equal-set

NathanFarmer commented Jan 23, 2026 •

edited

Loading

Uh oh!

netlify bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wookasz left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

NathanFarmer commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for niobium-lead-7998 ready!

Uh oh!

codecov bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wookasz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NathanFarmer commented Jan 23, 2026 •

edited

Loading

netlify bot commented Jan 23, 2026 •

edited

Loading

codecov bot commented Jan 23, 2026 •

edited

Loading