[BUGFIX] `ExpectColumnDistinctValuesToContainSet` with database-pushed comparison by NathanFarmer · Pull Request #11615 · great-expectations/great_expectations

NathanFarmer · 2026-01-23T20:21:03Z

Refactor ExpectColumnDistinctValuesToContainSet to use database-pushdown metrics

This PR optimizes ExpectColumnDistinctValuesToContainSet for high-cardinality columns by pushing set comparison logic to the database instead of fetching all distinct values into Python memory.

We no longer return an observed value of all distinct values, because this was bad practice and is not particularly actionable.

Changes:

Refactored _validate to use column.distinct_values.missing_from_column.count and column.distinct_values.missing_from_column metrics
Updated result format to return observed_value: None with missing_count and partial_missing_list at the top level
Updated _atomic_diagnostic_observed_value renderer to display missing values with MISSING state
Always uses MAX_DISTINCT_VALUES (20) to limit partial_missing_list size, regardless of the global partial_unexpected_count default
Updated documentation in format_results.md to add partial_missing_list field
Added all four new metrics to great_expectations/metrics/__init__.py public API exports

Manual Test

Expectation

gxe.ExpectColumnDistinctValuesToContainSet(
    column="my_col",
    value_set=first_10_values + fake_values,  # 610 values, 600 missing
)

Test Script Output

Generating 100,000 unique values...
Generated 100,000 rows with 100,000 unique values

Connecting to GX Cloud...
Using existing data source: test_distinct_values_datasource
Using existing data asset: test_distinct_values_asset
Using existing batch definition: test_batch
Creating fresh suite: test_distinct_values_suite

============================================================
Adding Expectations to Suite
============================================================
Added: ExpectColumnDistinctValuesToContainSet (FAILURE expected - 600 missing)

Saved suite with 1 expectations
Created/updated validation definition: test_distinct_values_validation
Created/updated checkpoint: test_distinct_values_checkpoint

============================================================
Running Checkpoint...
============================================================

Validating expectation directly to check payload size...

Expectation: expect_column_distinct_values_to_contain_set
Payload size: 140,723 bytes (0.13 MB)

============================================================
Running Checkpoint (this will attempt to save to GX Cloud)...
============================================================

Checkpoint completed in 19.97s
Overall success: False

…ushed comparison - Add ColumnDistinctValuesMissingFromSet metric to find missing values in DB - Refactor expectation to use new metric instead of fetching all distinct values - Add Metrics API wrapper class for new metric - Respects result_format and partial_unexpected_count settings

netlify · 2026-01-23T20:21:41Z

✅ Deploy Preview for niobium-lead-7998 ready!

Name	Link
🔨 Latest commit	`44909a7`
🔍 Latest deploy log	https://app.netlify.com/projects/niobium-lead-7998/deploys/698b4d08b98ca4000843755d
😎 Deploy Preview	https://deploy-preview-11615.docs.greatexpectations.io
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

codecov · 2026-01-23T20:22:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.21%. Comparing base (44454b6) to head (44909a7).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop   #11615      +/-   ##
===========================================
+ Coverage    84.19%   84.21%   +0.02%     
===========================================
  Files          471      471              
  Lines        39591    39591              
===========================================
+ Hits         33332    33343      +11     
+ Misses        6259     6248      -11

Flag	Coverage Δ
3.10	`72.70% <88.88%> (+0.02%)`	⬆️
3.10 athena	`41.54% <52.77%> (+0.03%)`	⬆️
3.10 aws_deps	`45.89% <52.77%> (+0.03%)`	⬆️
3.10 big	`55.22% <52.77%> (+0.03%)`	⬆️
3.10 bigquery	`50.70% <83.33%> (+0.03%)`	⬆️
3.10 clickhouse	`41.55% <52.77%> (+0.03%)`	⬆️
3.10 databricks	`52.46% <83.33%> (+0.03%)`	⬆️
3.10 filesystem	`63.99% <77.77%> (+0.03%)`	⬆️
3.10 gx-redshift	`50.78% <80.55%> (+0.11%)`	⬆️
3.10 mssql	`50.90% <80.55%> (+0.11%)`	⬆️
3.10 mysql	`51.18% <80.55%> (+0.11%)`	⬆️
3.10 openpyxl or pyarrow or project or sqlite or aws_creds	`59.27% <83.33%> (+0.03%)`	⬆️
3.10 postgresql	`54.76% <83.33%> (+0.03%)`	⬆️
3.10 snowflake	`53.28% <83.33%> (+0.03%)`	⬆️
3.10 spark	`55.37% <83.33%> (+0.03%)`	⬆️
3.10 spark_connect	`46.40% <52.77%> (+0.03%)`	⬆️
3.10 trino	`48.21% <52.77%> (+0.03%)`	⬆️
3.11	`72.69% <88.88%> (+0.01%)`	⬆️
3.11 athena	`?`
3.11 aws_deps	`?`
3.11 big	`?`
3.11 clickhouse	`?`
3.11 filesystem	`?`
3.11 mssql	`?`
3.11 mysql	`?`
3.11 openpyxl or pyarrow or project or sqlite or aws_creds	`?`
3.11 spark_connect	`?`
3.12	`72.71% <88.88%> (+0.04%)`	⬆️
3.12 athena	`?`
3.12 aws_deps	`?`
3.12 big	`?`
3.12 mysql	`?`
3.12 openpyxl or pyarrow or project or sqlite or aws_creds	`?`
3.12 spark_connect	`?`
3.13	`72.70% <88.88%> (+0.01%)`	⬆️
3.13 athena	`41.54% <52.77%> (+0.03%)`	⬆️
3.13 aws_deps	`45.89% <52.77%> (+0.03%)`	⬆️
3.13 big	`55.22% <52.77%> (+0.03%)`	⬆️
3.13 bigquery	`50.71% <83.33%> (+0.03%)`	⬆️
3.13 clickhouse	`41.55% <52.77%> (+0.03%)`	⬆️
3.13 databricks	`?`
3.13 filesystem	`63.99% <77.77%> (+0.03%)`	⬆️
3.13 gx-redshift	`50.78% <80.55%> (+0.11%)`	⬆️
3.13 mssql	`50.90% <80.55%> (+0.11%)`	⬆️
3.13 mysql	`51.18% <80.55%> (+0.11%)`	⬆️
3.13 openpyxl or pyarrow or project or sqlite or aws_creds	`59.28% <83.33%> (+0.03%)`	⬆️
3.13 postgresql	`54.76% <83.33%> (+0.03%)`	⬆️
3.13 snowflake	`53.29% <83.33%> (+0.04%)`	⬆️
3.13 spark	`55.38% <83.33%> (+0.03%)`	⬆️
3.13 spark_connect	`46.40% <52.77%> (+0.03%)`	⬆️
3.13 trino	`48.21% <52.77%> (+0.03%)`	⬆️
cloud	`0.00% <0.00%> (ø)`
docs-basic	`58.62% <52.77%> (+0.10%)`	⬆️
docs-creds-needed	`57.42% <52.77%> (+0.10%)`	⬆️
docs-spark	`56.73% <52.77%> (+0.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- Add ValidationDependencies to TYPE_CHECKING for type annotations - Import inside method for runtime use - Avoids circular import while keeping unquoted type annotation

…rcion - Always fetch column.distinct_values for type coercion and observed_value - Coerce value_set to match column value types before comparison - Return all distinct values in observed_value when successful (backward compatible) - Return value_counts in details when result_format is COMPLETE - Handle type coercion for date/string mismatches

…_count on failure - Cast partial_unexpected_count to int to fix type checker error - Only include unexpected_count in result when there are violations - Filter out nulls from observed values before comparison - Fixes test failures for result format and type checking

- Filter out null/NaN values from observed_value_set before comparison - Ensures nulls are ignored as expected by the expectation - Add PLR0912 noqa comment for complexity

- Check type before converting to int to satisfy type checker - Handle case where partial_unexpected_count might not be int or str

- Test ColumnDistinctValuesMissingFromSet for all data sources - Tests cover: all set values present, some missing, all missing, subset - Tests verify limit parameter works correctly

… metric - Add test_dates_all_present for date objects in value_set - Add test_dates_with_str_value_set for string dates (tests BigQuery DATE coercion) - Uses DATA_SOURCES_THAT_SUPPORT_DATE_COMPARISONS which includes BigQuery

Metrics don't do type coercion - that's an expectation-level feature. The test_dates_all_present test with proper date objects is sufficient for metrics.

…ontainSet

for more information, see https://pre-commit.ci

observed_value now contains only missing values (values from value_set not found in column), not all distinct values. This is a breaking change that avoids fetching all distinct values into memory for high-cardinality columns.

…_contain_set - observed_value now contains only missing values, not all distinct values - No more details.value_counts in results - String values are no longer auto-coerced to match date columns

…pected_list - observed_value is now None (semantically correct - it's not observed values) - Missing values go in partial_unexpected_list (limited by partial_unexpected_count) - unexpected_count always included when not BOOLEAN_ONLY - Renderer returns '--' for observed value since it's None

…tation

…old test file

…wargs flow automatically

…rror

…for contain_set

…ues Expectations

…ove redundant None handling

…or distinct values Expectations

…ues Expectations

…ectations

wookasz

Blocking as a breaking change

NathanFarmer changed the title ~~Optimize expect_column_distinct_values_to_contain_set with database-pushed comparison~~ [MAINTENANCE] Optimize ExpectColumnDistinctValuesToContainSet with database-pushed comparison Jan 23, 2026

NathanFarmer and others added 22 commits January 23, 2026 13:28

Fix circular import: move ValidationDependencies to TYPE_CHECKING block

a28b228

- Add ValidationDependencies to TYPE_CHECKING for type annotations - Import inside method for runtime use - Avoids circular import while keeping unquoted type annotation

Add null filtering to expect_column_distinct_values_to_contain_set

9214c06

- Filter out null/NaN values from observed_value_set before comparison - Ensures nulls are ignored as expected by the expectation - Add PLR0912 noqa comment for complexity

Fix type error: handle partial_unexpected_count type safely

e5ba3a4

- Check type before converting to int to satisfy type checker - Handle case where partial_unexpected_count might not be int or str

Add integration tests for column.distinct_values.missing_from_set metric

198740f

- Test ColumnDistinctValuesMissingFromSet for all data sources - Tests cover: all set values present, some missing, all missing, subset - Tests verify limit parameter works correctly

Remove test_dates_with_str_value_set from metric tests

105d6fb

Metrics don't do type coercion - that's an expectation-level feature. The test_dates_all_present test with proper date objects is sufficient for metrics.

Add result format integration tests for ExpectColumnDistinctValuesToC…

7e796bc

…ontainSet

Fix type errors: use result.result instead of to_json_dict

22d1a7f

Merge branch 'develop' into m/gx-2374/distinct-values-contain-set

311d1dd

Fix value_counts comparison: use to_json_dict for proper serialization

4d57d0f

Fix type errors: compare full result dict instead of nested access

5fbcf4c

Merge branch 'develop' into m/gx-2374/distinct-values-contain-set

ddc4386

Remove unnecessary fallback to column.value_counts metric

dceb3ac

[pre-commit.ci] auto fixes from pre-commit.com hooks

d34b590

for more information, see https://pre-commit.ci

Trigger build

55d10a0

Update tests for breaking changes in expect_column_distinct_values_to…

b3aff08

…_contain_set - observed_value now contains only missing values, not all distinct values - No more details.value_counts in results - String values are no longer auto-coerced to match date columns

Restore renderer to show expected values with missing marked

f661b66

Revert expectation and tests to original column.value_counts implemen…

4543fdd

…tation

NathanFarmer self-assigned this Jan 28, 2026

NathanFarmer and others added 3 commits January 28, 2026 16:16

Merge branch 'develop' into m/gx-2374/distinct-values-contain-set

b65047a

Merge branch 'develop' into m/gx-2374/distinct-values-contain-set

f2abf7e

Limit observed_value to 1000 values to prevent 413 payload errors

4818085

NathanFarmer added 2 commits February 3, 2026 10:52

fix: update import for renamed missing_from_column metric and remove …

33a6153

…old test file

refactor: remove unnecessary get_validation_dependencies override - k…

283cf50

…wargs flow automatically

Base automatically changed from m/gx-2374/distinct-values-metrics to develop February 3, 2026 18:31

NathanFarmer and others added 12 commits February 3, 2026 11:50

Merge branch 'develop' into m/gx-2374/distinct-values-contain-set

ef8d617

fix: handle None limit in Spark/SQL implementations to prevent py4j e…

d37a4fb

…rror

feat: add missing_count and partial_missing_list to top-level result …

1f87c88

…for contain_set

fix: update integration test result format for contain_set

89d448e

docs: add partial_missing_list to result format documentation

7e918c3

docs: clarify missing_count and partial_missing_list for distinct val…

c6f4b7e

…ues Expectations

test: add result_format unit tests for contain_set

a6199a8

Merge branch 'develop' into m/gx-2374/distinct-values-contain-set

f142a92

feat: respect partial_unexpected_count setting for partial_missing_list

60faa28

fix: change metric default limit to MAX_DISTINCT_VALUES (500) and rem…

e85d680

…ove redundant None handling

feat: default partial_unexpected_count to MAX_DISTINCT_VALUES (500) f…

61ebd17

…or distinct values Expectations

docs: update partial_unexpected_count default to 500 for distinct val…

797daa4

…ues Expectations

NathanFarmer changed the title ~~[MAINTENANCE] Optimize ExpectColumnDistinctValuesToContainSet with database-pushed comparison~~ [BUGFIX] ExpectColumnDistinctValuesToContainSet with database-pushed comparison Feb 4, 2026

NathanFarmer added 6 commits February 4, 2026 09:08

fix: add all four new metrics to public API exports

6ca7fec

fix: reduce MAX_DISTINCT_VALUES from 500 to 200 to prevent 413 errors

252c586

docs: fix partial_missing_list default to 200 for distinct values Exp…

7a08c21

…ectations

fix: actually slice partial_missing_list to partial_unexpected_count

0833f06

fix: always use MAX_DISTINCT_VALUES for distinct values expectations

b5b77b4

fix: change MAX_DISTINCT_VALUES back to 500

26e4c13

wookasz approved these changes Feb 4, 2026

View reviewed changes

Change MAX_DISTINCT_VALUES from 500 to 20

1cd2207

NathanFarmer enabled auto-merge February 4, 2026 18:44

Fix redundant '20 or 20' in docs to just '20'

2d4c04a

wookasz disabled auto-merge February 4, 2026 20:13

wookasz requested changes Feb 4, 2026

View reviewed changes

Merge branch 'develop' into m/gx-2374/distinct-values-contain-set

44909a7

NathanFarmer removed their assignment Feb 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[BUGFIX] `ExpectColumnDistinctValuesToContainSet` with database-pushed comparison#11615

[BUGFIX] `ExpectColumnDistinctValuesToContainSet` with database-pushed comparison#11615
NathanFarmer wants to merge 75 commits intodevelopfrom
m/gx-2374/distinct-values-contain-set

NathanFarmer commented Jan 23, 2026 •

edited

Loading

Uh oh!

netlify bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 23, 2026 •

edited

Loading

Uh oh!

wookasz left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

NathanFarmer commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for niobium-lead-7998 ready!

Uh oh!

codecov bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wookasz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NathanFarmer commented Jan 23, 2026 •

edited

Loading

netlify bot commented Jan 23, 2026 •

edited

Loading

codecov bot commented Jan 23, 2026 •

edited

Loading