Skip to content

Comments

[BUGFIX] ExpectColumnDistinctValuesToContainSet with database-pushed comparison#11615

Open
NathanFarmer wants to merge 75 commits intodevelopfrom
m/gx-2374/distinct-values-contain-set
Open

[BUGFIX] ExpectColumnDistinctValuesToContainSet with database-pushed comparison#11615
NathanFarmer wants to merge 75 commits intodevelopfrom
m/gx-2374/distinct-values-contain-set

Conversation

@NathanFarmer
Copy link
Contributor

@NathanFarmer NathanFarmer commented Jan 23, 2026

Refactor ExpectColumnDistinctValuesToContainSet to use database-pushdown metrics

This PR optimizes ExpectColumnDistinctValuesToContainSet for high-cardinality columns by pushing set comparison logic to the database instead of fetching all distinct values into Python memory.

We no longer return an observed value of all distinct values, because this was bad practice and is not particularly actionable.

Changes:

  • Refactored _validate to use column.distinct_values.missing_from_column.count and column.distinct_values.missing_from_column metrics
  • Updated result format to return observed_value: None with missing_count and partial_missing_list at the top level
  • Updated _atomic_diagnostic_observed_value renderer to display missing values with MISSING state
  • Always uses MAX_DISTINCT_VALUES (20) to limit partial_missing_list size, regardless of the global partial_unexpected_count default
  • Updated documentation in format_results.md to add partial_missing_list field
  • Added all four new metrics to great_expectations/metrics/__init__.py public API exports

Manual Test

Expectation

gxe.ExpectColumnDistinctValuesToContainSet(
    column="my_col",
    value_set=first_10_values + fake_values,  # 610 values, 600 missing
)

Test Script Output

Generating 100,000 unique values...
Generated 100,000 rows with 100,000 unique values

Connecting to GX Cloud...
Using existing data source: test_distinct_values_datasource
Using existing data asset: test_distinct_values_asset
Using existing batch definition: test_batch
Creating fresh suite: test_distinct_values_suite

============================================================
Adding Expectations to Suite
============================================================
Added: ExpectColumnDistinctValuesToContainSet (FAILURE expected - 600 missing)

Saved suite with 1 expectations
Created/updated validation definition: test_distinct_values_validation
Created/updated checkpoint: test_distinct_values_checkpoint

============================================================
Running Checkpoint...
============================================================

Validating expectation directly to check payload size...

Expectation: expect_column_distinct_values_to_contain_set
Payload size: 140,723 bytes (0.13 MB)

============================================================
Running Checkpoint (this will attempt to save to GX Cloud)...
============================================================

Checkpoint completed in 19.97s
Overall success: False

…ushed comparison

- Add ColumnDistinctValuesMissingFromSet metric to find missing values in DB
- Refactor expectation to use new metric instead of fetching all distinct values
- Add Metrics API wrapper class for new metric
- Respects result_format and partial_unexpected_count settings
@netlify
Copy link

netlify bot commented Jan 23, 2026

Deploy Preview for niobium-lead-7998 ready!

Name Link
🔨 Latest commit 44909a7
🔍 Latest deploy log https://app.netlify.com/projects/niobium-lead-7998/deploys/698b4d08b98ca4000843755d
😎 Deploy Preview https://deploy-preview-11615.docs.greatexpectations.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@NathanFarmer NathanFarmer changed the title Optimize expect_column_distinct_values_to_contain_set with database-pushed comparison [MAINTENANCE] Optimize ExpectColumnDistinctValuesToContainSet with database-pushed comparison Jan 23, 2026
@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.21%. Comparing base (44454b6) to head (44909a7).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop   #11615      +/-   ##
===========================================
+ Coverage    84.19%   84.21%   +0.02%     
===========================================
  Files          471      471              
  Lines        39591    39591              
===========================================
+ Hits         33332    33343      +11     
+ Misses        6259     6248      -11     
Flag Coverage Δ
3.10 72.70% <88.88%> (+0.02%) ⬆️
3.10 athena 41.54% <52.77%> (+0.03%) ⬆️
3.10 aws_deps 45.89% <52.77%> (+0.03%) ⬆️
3.10 big 55.22% <52.77%> (+0.03%) ⬆️
3.10 bigquery 50.70% <83.33%> (+0.03%) ⬆️
3.10 clickhouse 41.55% <52.77%> (+0.03%) ⬆️
3.10 databricks 52.46% <83.33%> (+0.03%) ⬆️
3.10 filesystem 63.99% <77.77%> (+0.03%) ⬆️
3.10 gx-redshift 50.78% <80.55%> (+0.11%) ⬆️
3.10 mssql 50.90% <80.55%> (+0.11%) ⬆️
3.10 mysql 51.18% <80.55%> (+0.11%) ⬆️
3.10 openpyxl or pyarrow or project or sqlite or aws_creds 59.27% <83.33%> (+0.03%) ⬆️
3.10 postgresql 54.76% <83.33%> (+0.03%) ⬆️
3.10 snowflake 53.28% <83.33%> (+0.03%) ⬆️
3.10 spark 55.37% <83.33%> (+0.03%) ⬆️
3.10 spark_connect 46.40% <52.77%> (+0.03%) ⬆️
3.10 trino 48.21% <52.77%> (+0.03%) ⬆️
3.11 72.69% <88.88%> (+0.01%) ⬆️
3.11 athena ?
3.11 aws_deps ?
3.11 big ?
3.11 clickhouse ?
3.11 filesystem ?
3.11 mssql ?
3.11 mysql ?
3.11 openpyxl or pyarrow or project or sqlite or aws_creds ?
3.11 spark_connect ?
3.12 72.71% <88.88%> (+0.04%) ⬆️
3.12 athena ?
3.12 aws_deps ?
3.12 big ?
3.12 mysql ?
3.12 openpyxl or pyarrow or project or sqlite or aws_creds ?
3.12 spark_connect ?
3.13 72.70% <88.88%> (+0.01%) ⬆️
3.13 athena 41.54% <52.77%> (+0.03%) ⬆️
3.13 aws_deps 45.89% <52.77%> (+0.03%) ⬆️
3.13 big 55.22% <52.77%> (+0.03%) ⬆️
3.13 bigquery 50.71% <83.33%> (+0.03%) ⬆️
3.13 clickhouse 41.55% <52.77%> (+0.03%) ⬆️
3.13 databricks ?
3.13 filesystem 63.99% <77.77%> (+0.03%) ⬆️
3.13 gx-redshift 50.78% <80.55%> (+0.11%) ⬆️
3.13 mssql 50.90% <80.55%> (+0.11%) ⬆️
3.13 mysql 51.18% <80.55%> (+0.11%) ⬆️
3.13 openpyxl or pyarrow or project or sqlite or aws_creds 59.28% <83.33%> (+0.03%) ⬆️
3.13 postgresql 54.76% <83.33%> (+0.03%) ⬆️
3.13 snowflake 53.29% <83.33%> (+0.04%) ⬆️
3.13 spark 55.38% <83.33%> (+0.03%) ⬆️
3.13 spark_connect 46.40% <52.77%> (+0.03%) ⬆️
3.13 trino 48.21% <52.77%> (+0.03%) ⬆️
cloud 0.00% <0.00%> (ø)
docs-basic 58.62% <52.77%> (+0.10%) ⬆️
docs-creds-needed 57.42% <52.77%> (+0.10%) ⬆️
docs-spark 56.73% <52.77%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

NathanFarmer and others added 22 commits January 23, 2026 13:28
- Add ValidationDependencies to TYPE_CHECKING for type annotations
- Import inside method for runtime use
- Avoids circular import while keeping unquoted type annotation
…rcion

- Always fetch column.distinct_values for type coercion and observed_value
- Coerce value_set to match column value types before comparison
- Return all distinct values in observed_value when successful (backward compatible)
- Return value_counts in details when result_format is COMPLETE
- Handle type coercion for date/string mismatches
…_count on failure

- Cast partial_unexpected_count to int to fix type checker error
- Only include unexpected_count in result when there are violations
- Filter out nulls from observed values before comparison
- Fixes test failures for result format and type checking
- Filter out null/NaN values from observed_value_set before comparison
- Ensures nulls are ignored as expected by the expectation
- Add PLR0912 noqa comment for complexity
- Check type before converting to int to satisfy type checker
- Handle case where partial_unexpected_count might not be int or str
- Test ColumnDistinctValuesMissingFromSet for all data sources
- Tests cover: all set values present, some missing, all missing, subset
- Tests verify limit parameter works correctly
… metric

- Add test_dates_all_present for date objects in value_set
- Add test_dates_with_str_value_set for string dates (tests BigQuery DATE coercion)
- Uses DATA_SOURCES_THAT_SUPPORT_DATE_COMPARISONS which includes BigQuery
Metrics don't do type coercion - that's an expectation-level feature.
The test_dates_all_present test with proper date objects is sufficient for metrics.
observed_value now contains only missing values (values from value_set not found
in column), not all distinct values. This is a breaking change that avoids
fetching all distinct values into memory for high-cardinality columns.
…_contain_set

- observed_value now contains only missing values, not all distinct values
- No more details.value_counts in results
- String values are no longer auto-coerced to match date columns
…pected_list

- observed_value is now None (semantically correct - it's not observed values)
- Missing values go in partial_unexpected_list (limited by partial_unexpected_count)
- unexpected_count always included when not BOOLEAN_ONLY
- Renderer returns '--' for observed value since it's None
@NathanFarmer NathanFarmer self-assigned this Jan 28, 2026
Base automatically changed from m/gx-2374/distinct-values-metrics to develop February 3, 2026 18:31
@NathanFarmer NathanFarmer changed the title [MAINTENANCE] Optimize ExpectColumnDistinctValuesToContainSet with database-pushed comparison [BUGFIX] ExpectColumnDistinctValuesToContainSet with database-pushed comparison Feb 4, 2026
@wookasz wookasz disabled auto-merge February 4, 2026 20:13
Copy link
Contributor

@wookasz wookasz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking as a breaking change

@NathanFarmer NathanFarmer removed their assignment Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants