Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 17 additions & 2 deletions docs/dqx/docs/reference/quality_checks.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,10 @@ You can also define your own custom checks (see [Creating custom checks](#creati
| `is_geometrycollection` | Checks whether the values in the input column are geometrycollection geometries/geographies. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
| `is_ogc_valid` | Checks whether the values in the input column are valid geometries in the OGC sense. I.e a bowtie polygon is invalid because it has a self intersection. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
| `is_non_empty_geometry` | Checks whether the values in the input column are non-empty geometries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
| `is_not_null_island` | Checks whether the values in the input column are null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description is inconsistent with the check’s intent and message. Recommend: 'Ensures values are not at the Null Island (POINT(0 0)); flags rows at (0,0).'

Suggested change
| `is_not_null_island` | Checks whether the values in the input column are null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
| `is_not_null_island` | Ensures values are not at the Null Island (POINT(0 0)); flags rows at (0,0). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description states the check validates whether values are null island geometries, but the function name is_not_null_island suggests it should check whether values are not null island geometries (i.e., it flags values that ARE null island as failures).

Suggested change
| `is_not_null_island` | Checks whether the values in the input column are null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |
| `is_not_null_island` | Checks whether the values in the input column are **not** null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) |

Copilot uses AI. Check for mistakes.
| `has_dimension` | Checks whether the values in the input column are geometries of the specified dimension (2D projected dimension). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `dimension`: dimension to check |
| `has_x_coordinate_between` | Checks whether the values in the input column are geometries with x coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value |
| `has_y_coordinate_between` | Checks whether the values in the input column are geometries with y coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value | | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value |
| `has_y_coordinate_between` | Checks whether the values in the input column are geometries with y coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value |
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 70 appears to contain duplicate content at the end that was inadvertently carried over from the previous line. The line should end after the first complete description without the repeated parameters section.

Suggested change
| `has_y_coordinate_between` | Checks whether the values in the input column are geometries with y coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value |
| `has_y_coordinate_between` | Checks whether the values in the input column are geometries with y coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value |

Copilot uses AI. Check for mistakes.
</details>

<Admonition type="warning" title="Applicability">
Expand Down Expand Up @@ -580,7 +581,14 @@ For brevity, the `name` field in the examples is omitted and it will be auto-gen
function: is_non_empty_geometry
arguments:
column: point_geom


# is_not_null_island check
- criticality: error
check:
function: is_not_null_island
arguments:
column: point_geom

# has_dimension check
- criticality: error
check:
Expand Down Expand Up @@ -1044,6 +1052,13 @@ checks = [
column="point_geom"
),

# is_not_null_island check
DQRowRule(
criticality="error",
check_func=geo_check_funcs.is_not_null_island,
column="point_geom"
),

# has_dimension check
DQRowRule(
criticality="error",
Expand Down
29 changes: 29 additions & 0 deletions src/databricks/labs/dqx/geo/check_funcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,35 @@ def is_non_empty_geometry(column: str | Column) -> Column:
)


@register_rule("row")
def is_not_null_island(column: str | Column) -> Column:
"""Checks whether the values in the input column are NULL island geometries (POINT(0 0)).

Args:
column: column to check; can be a string column name or a column expression

Returns:
Column object indicating whether the values in the input column are NULL island geometries
Comment on lines +362 to +368
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states this checks whether values are NULL island geometries, but based on the function name is_not_null_island and the condition logic, it should state this checks whether values are not NULL island geometries.

Suggested change
"""Checks whether the values in the input column are NULL island geometries (POINT(0 0)).
Args:
column: column to check; can be a string column name or a column expression
Returns:
Column object indicating whether the values in the input column are NULL island geometries
"""Checks whether the values in the input column are *not* NULL island geometries (i.e., not POINT(0 0)).
Args:
column: column to check; can be a string column name or a column expression
Returns:
Column object indicating whether the values in the input column are *not* NULL island geometries

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return description should indicate the column contains error messages for values that are NULL island geometries, not whether they are NULL island geometries. This is inconsistent with the function's actual behavior.

Suggested change
Column object indicating whether the values in the input column are NULL island geometries
Column object containing error messages for values that are NULL island geometries (POINT(0 0)), and null otherwise

Copilot uses AI. Check for mistakes.

Note:
This function requires Databricks serverless compute or runtime 17.1 or above.
"""
Comment on lines +361 to +372
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring contradicts the function name and behavior. The check ensures values are not at Null Island and flags rows that are POINT(0 0). Suggest updating to: Summary: 'Ensures values are not at the Null Island (POINT(0 0)); flags rows at (0,0)'. Returns: 'Column with error message for rows at (0,0), otherwise null'.

Copilot uses AI. Check for mistakes.
col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column)
# NOTE: This function is currently only available in Databricks runtime 17.1 or above or in
# Databricks SQL, due to the use of the `try_to_geometry`, `st_geometrytype`, `st_x`, and `st_y` functions.
geom_cond = F.expr(f"try_to_geometry({col_str_norm}) IS NULL")
is_point_cond = F.expr(f"st_geometrytype(try_to_geometry({col_str_norm})) = '{POINT_TYPE}'")
is_zero_zero = F.expr(f"st_x(try_to_geometry({col_str_norm})) = 0.0 AND st_y(try_to_geometry({col_str_norm})) = 0.0")
Comment on lines +376 to +378
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression repeats try_to_geometry three times, which makes the code harder to read and update. Consider factoring the subexpression into a variable to DRY it up, e.g.: try_geom = f"try_to_geometry({col_str_norm})" then reuse in the subsequent F.expr calls.

Suggested change
geom_cond = F.expr(f"try_to_geometry({col_str_norm}) IS NULL")
is_point_cond = F.expr(f"st_geometrytype(try_to_geometry({col_str_norm})) = '{POINT_TYPE}'")
is_zero_zero = F.expr(f"st_x(try_to_geometry({col_str_norm})) = 0.0 AND st_y(try_to_geometry({col_str_norm})) = 0.0")
try_geom = f"try_to_geometry({col_str_norm})"
geom_cond = F.expr(f"{try_geom} IS NULL")
is_point_cond = F.expr(f"st_geometrytype({try_geom}) = '{POINT_TYPE}'")
is_zero_zero = F.expr(f"st_x({try_geom}) = 0.0 AND st_y({try_geom}) = 0.0")

Copilot uses AI. Check for mistakes.
condition = F.when(col_expr.isNull(), F.lit(None)).otherwise(~geom_cond & is_point_cond & is_zero_zero)
condition_str = f"column `{col_expr_str}` contains a null island"
Comment on lines +376 to +380
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about POINTZ and POINTZM types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you expect? When people talk about the null island, they mean the point at 0 lat, 0 long. I don't know if POINTZ/POINTZM makes sense in this setting, and was therefore a bit "liberal" in my definition.

Copy link
Contributor

@ghanse ghanse Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like POINT Z (0 0 0) and POINT ZM (0 0 0 0) would be equivalent? Just to validate that the data is not a placeholder geometry or generated by some process that defaults values to 0. We might be able to use ST_Z and ST_M.


return make_condition(
condition,
F.lit(condition_str),
f"{col_str_norm}_contains_null_island",
)


@register_rule("row")
def has_dimension(column: str | Column, dimension: int) -> Column:
"""Checks whether the geometries/geographies in the input column have a given dimension.
Expand Down
11 changes: 11 additions & 0 deletions tests/integration/test_apply_checks.py
Original file line number Diff line number Diff line change
Expand Up @@ -5819,6 +5819,17 @@ def test_apply_checks_all_geo_checks_using_classes(skip_if_runtime_not_geo_compa
check_func=geo_check_funcs.is_non_empty_geometry,
column=F.col("point_geom"),
),
# is_not_null_island check
DQRowRule(
criticality="error",
check_func=geo_check_funcs.is_not_null_island,
column="point_geom",
),
DQRowRule(
criticality="error",
check_func=geo_check_funcs.is_not_null_island,
column=F.col("point_geom"),
),
# has_dimension check
DQRowRule(
criticality="error",
Expand Down
24 changes: 24 additions & 0 deletions tests/integration/test_row_checks_geo.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
is_multilinestring,
is_multipoint,
is_multipolygon,
is_not_null_island,
is_point,
is_polygon,
is_ogc_valid,
Expand Down Expand Up @@ -333,6 +334,29 @@ def test_is_non_empty_geometry(skip_if_runtime_not_geo_compatible, spark):
assert_df_equality(actual, expected, ignore_nullable=True)


def test_is_not_null_island(skip_if_runtime_not_geo_compatible, spark):
input_schema = "geom: string"
test_df = spark.createDataFrame(
[["POINT(1 1)"], ["POINT(0 0)"], ["LINESTRING(0 0, 1 1)"], ["nonsense"], [None]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we test GeoJSON or WKB as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would be able to construct geometries directly in this constructor instead of starting with strings and converting them. The other formats are already checked in adjacent tests, so I think it is a bit overkill here.

input_schema,
)

actual = test_df.select(is_not_null_island("geom"))

checked_schema = "geom_contains_null_island: string"
expected = spark.createDataFrame(
[
[None],
["column `geom` contains a null island"],
[None],
[None],
[None],
],
checked_schema,
)
assert_df_equality(actual, expected, ignore_nullable=True)


def test_has_dimension(skip_if_runtime_not_geo_compatible, spark):
input_schema = "geom: string"
test_df = spark.createDataFrame(
Expand Down
12 changes: 12 additions & 0 deletions tests/perf/test_apply_checks.py
Original file line number Diff line number Diff line change
Expand Up @@ -1509,6 +1509,18 @@ def test_benchmark_is_non_empty_geometry(skip_if_runtime_not_geo_compatible, ben
actual_count = benchmark(lambda: checked.count())
assert actual_count == EXPECTED_ROWS

def test_benchmark_is_not_null_island(skip_if_runtime_not_geo_compatible, benchmark, ws, generated_geo_df):
dq_engine = DQEngine(workspace_client=ws, extra_params=EXTRA_PARAMS)
checks = [
DQRowRule(
criticality="error",
check_func=geo_check_funcs.is_not_null_island,
column="point_geom",
)
]
checked = dq_engine.apply_checks(generated_geo_df, checks)
actual_count = benchmark(lambda: checked.count())
assert actual_count == EXPECTED_ROWS

def test_benchmark_has_dimension(skip_if_runtime_not_geo_compatible, benchmark, ws, generated_geo_df):
dq_engine = DQEngine(workspace_client=ws, extra_params=EXTRA_PARAMS)
Expand Down
7 changes: 7 additions & 0 deletions tests/resources/all_row_geo_checks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,13 @@
arguments:
column: point_geom

# is_not_null_island check
- criticality: error
check:
function: is_not_null_island
arguments:
column: point_geom

# has_dimension check
- criticality: error
check:
Expand Down
Loading