Skip to content

Conversation

@tdikland
Copy link
Contributor

Changes

Add new geo check: is_not_null_island.

Linked issues

Resolves #603

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

@tdikland tdikland requested a review from a team as a code owner October 14, 2025 08:00
@tdikland tdikland requested review from pratikk-databricks and removed request for a team October 14, 2025 08:00
Comment on lines +376 to +380
geom_cond = F.expr(f"try_to_geometry({col_str_norm}) IS NULL")
is_point_cond = F.expr(f"st_geometrytype(try_to_geometry({col_str_norm})) = '{POINT_TYPE}'")
is_zero_zero = F.expr(f"st_x(try_to_geometry({col_str_norm})) = 0.0 AND st_y(try_to_geometry({col_str_norm})) = 0.0")
condition = F.when(col_expr.isNull(), F.lit(None)).otherwise(~geom_cond & is_point_cond & is_zero_zero)
condition_str = f"column `{col_expr_str}` contains a null island"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about POINTZ and POINTZM types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you expect? When people talk about the null island, they mean the point at 0 lat, 0 long. I don't know if POINTZ/POINTZM makes sense in this setting, and was therefore a bit "liberal" in my definition.

Copy link
Contributor

@ghanse ghanse Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like POINT Z (0 0 0) and POINT ZM (0 0 0 0) would be equivalent? Just to validate that the data is not a placeholder geometry or generated by some process that defaults values to 0. We might be able to use ST_Z and ST_M.

def test_is_not_null_island(skip_if_runtime_not_geo_compatible, spark):
input_schema = "geom: string"
test_df = spark.createDataFrame(
[["POINT(1 1)"], ["POINT(0 0)"], ["LINESTRING(0 0, 1 1)"], ["nonsense"], [None]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test GeoJSON or WKB as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would be able to construct geometries directly in this constructor instead of starting with strings and converting them. The other formats are already checked in adjacent tests, so I think it is a bit overkill here.

@mwojtyczka mwojtyczka requested a review from Copilot October 16, 2025 09:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a new geospatial row check is_not_null_island to flag geometries at the “Null Island” (POINT(0 0)), with corresponding tests, benchmarks, and documentation updates.

  • Introduces is_not_null_island in geo/check_funcs with rule registration
  • Adds integration and performance tests, and YAML/MDX examples
  • Updates docs to list and demonstrate the new check

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/resources/all_row_geo_checks.yaml Adds is_not_null_island to the all-geo-checks YAML.
tests/perf/test_apply_checks.py Adds a benchmark for is_not_null_island.
tests/integration/test_row_checks_geo.py Adds an integration test verifying behavior and error message for is_not_null_island.
tests/integration/test_apply_checks.py Adds is_not_null_island to the “all geo checks” application test (string and Column inputs).
src/databricks/labs/dqx/geo/check_funcs.py Implements and registers is_not_null_island.
docs/dqx/docs/reference/quality_checks.mdx Documents the new check and adds examples; fixes a table row formatting issue.

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +361 to +372
def is_not_null_island(column: str | Column) -> Column:
"""Checks whether the values in the input column are NULL island geometries (POINT(0 0)).
Args:
column: column to check; can be a string column name or a column expression
Returns:
Column object indicating whether the values in the input column are NULL island geometries
Note:
This function requires Databricks serverless compute or runtime 17.1 or above.
"""
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring contradicts the function name and behavior. The check ensures values are not at Null Island and flags rows that are POINT(0 0). Suggest updating to: Summary: 'Ensures values are not at the Null Island (POINT(0 0)); flags rows at (0,0)'. Returns: 'Column with error message for rows at (0,0), otherwise null'.

Copilot uses AI. Check for mistakes.
| `is_geometrycollection` | Checks whether the values in the input column are geometrycollection geometries/geographies. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
| `is_ogc_valid` | Checks whether the values in the input column are valid geometries in the OGC sense. I.e a bowtie polygon is invalid because it has a self intersection. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
| `is_non_empty_geometry` | Checks whether the values in the input column are non-empty geometries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
| `is_not_null_island` | Checks whether the values in the input column are null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is inconsistent with the check’s intent and message. Recommend: 'Ensures values are not at the Null Island (POINT(0 0)); flags rows at (0,0).'

Suggested change
| `is_not_null_island` | Checks whether the values in the input column are null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
| `is_not_null_island` | Ensures values are not at the Null Island (POINT(0 0)); flags rows at (0,0). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |

Copilot uses AI. Check for mistakes.
Comment on lines +376 to +378
geom_cond = F.expr(f"try_to_geometry({col_str_norm}) IS NULL")
is_point_cond = F.expr(f"st_geometrytype(try_to_geometry({col_str_norm})) = '{POINT_TYPE}'")
is_zero_zero = F.expr(f"st_x(try_to_geometry({col_str_norm})) = 0.0 AND st_y(try_to_geometry({col_str_norm})) = 0.0")
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression repeats try_to_geometry three times, which makes the code harder to read and update. Consider factoring the subexpression into a variable to DRY it up, e.g.: try_geom = f"try_to_geometry({col_str_norm})" then reuse in the subsequent F.expr calls.

Suggested change
geom_cond = F.expr(f"try_to_geometry({col_str_norm}) IS NULL")
is_point_cond = F.expr(f"st_geometrytype(try_to_geometry({col_str_norm})) = '{POINT_TYPE}'")
is_zero_zero = F.expr(f"st_x(try_to_geometry({col_str_norm})) = 0.0 AND st_y(try_to_geometry({col_str_norm})) = 0.0")
try_geom = f"try_to_geometry({col_str_norm})"
geom_cond = F.expr(f"{try_geom} IS NULL")
is_point_cond = F.expr(f"st_geometrytype({try_geom}) = '{POINT_TYPE}'")
is_zero_zero = F.expr(f"st_x({try_geom}) = 0.0 AND st_y({try_geom}) = 0.0")

Copilot uses AI. Check for mistakes.
@mwojtyczka mwojtyczka requested a review from Copilot November 6, 2025 17:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +362 to +368
"""Checks whether the values in the input column are NULL island geometries (POINT(0 0)).
Args:
column: column to check; can be a string column name or a column expression
Returns:
Column object indicating whether the values in the input column are NULL island geometries
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states this checks whether values are NULL island geometries, but based on the function name is_not_null_island and the condition logic, it should state this checks whether values are not NULL island geometries.

Suggested change
"""Checks whether the values in the input column are NULL island geometries (POINT(0 0)).
Args:
column: column to check; can be a string column name or a column expression
Returns:
Column object indicating whether the values in the input column are NULL island geometries
"""Checks whether the values in the input column are *not* NULL island geometries (i.e., not POINT(0 0)).
Args:
column: column to check; can be a string column name or a column expression
Returns:
Column object indicating whether the values in the input column are *not* NULL island geometries

Copilot uses AI. Check for mistakes.
column: column to check; can be a string column name or a column expression
Returns:
Column object indicating whether the values in the input column are NULL island geometries
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return description should indicate the column contains error messages for values that are NULL island geometries, not whether they are NULL island geometries. This is inconsistent with the function's actual behavior.

Suggested change
Column object indicating whether the values in the input column are NULL island geometries
Column object containing error messages for values that are NULL island geometries (POINT(0 0)), and null otherwise

Copilot uses AI. Check for mistakes.
| `is_geometrycollection` | Checks whether the values in the input column are geometrycollection geometries/geographies. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
| `is_ogc_valid` | Checks whether the values in the input column are valid geometries in the OGC sense. I.e a bowtie polygon is invalid because it has a self intersection. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
| `is_non_empty_geometry` | Checks whether the values in the input column are non-empty geometries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
| `is_not_null_island` | Checks whether the values in the input column are null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description states the check validates whether values are null island geometries, but the function name is_not_null_island suggests it should check whether values are not null island geometries (i.e., it flags values that ARE null island as failures).

Suggested change
| `is_not_null_island` | Checks whether the values in the input column are null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
| `is_not_null_island` | Checks whether the values in the input column are **not** null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |

Copilot uses AI. Check for mistakes.
| `has_dimension` | Checks whether the values in the input column are geometries of the specified dimension (2D projected dimension). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `dimension`: dimension to check |
| `has_x_coordinate_between` | Checks whether the values in the input column are geometries with x coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value |
| `has_y_coordinate_between` | Checks whether the values in the input column are geometries with y coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value | | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value |
| `has_y_coordinate_between` | Checks whether the values in the input column are geometries with y coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value |
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 70 appears to contain duplicate content at the end that was inadvertently carried over from the previous line. The line should end after the first complete description without the repeated parameters section.

Suggested change
| `has_y_coordinate_between` | Checks whether the values in the input column are geometries with y coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value |
| `has_y_coordinate_between` | Checks whether the values in the input column are geometries with y coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value |

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Add is_not_null_island check

3 participants