-
Notifications
You must be signed in to change notification settings - Fork 69
add null island check #613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| geom_cond = F.expr(f"try_to_geometry({col_str_norm}) IS NULL") | ||
| is_point_cond = F.expr(f"st_geometrytype(try_to_geometry({col_str_norm})) = '{POINT_TYPE}'") | ||
| is_zero_zero = F.expr(f"st_x(try_to_geometry({col_str_norm})) = 0.0 AND st_y(try_to_geometry({col_str_norm})) = 0.0") | ||
| condition = F.when(col_expr.isNull(), F.lit(None)).otherwise(~geom_cond & is_point_cond & is_zero_zero) | ||
| condition_str = f"column `{col_expr_str}` contains a null island" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about POINTZ and POINTZM types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you expect? When people talk about the null island, they mean the point at 0 lat, 0 long. I don't know if POINTZ/POINTZM makes sense in this setting, and was therefore a bit "liberal" in my definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something like POINT Z (0 0 0) and POINT ZM (0 0 0 0) would be equivalent? Just to validate that the data is not a placeholder geometry or generated by some process that defaults values to 0. We might be able to use ST_Z and ST_M.
| def test_is_not_null_island(skip_if_runtime_not_geo_compatible, spark): | ||
| input_schema = "geom: string" | ||
| test_df = spark.createDataFrame( | ||
| [["POINT(1 1)"], ["POINT(0 0)"], ["LINESTRING(0 0, 1 1)"], ["nonsense"], [None]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we test GeoJSON or WKB as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would be able to construct geometries directly in this constructor instead of starting with strings and converting them. The other formats are already checked in adjacent tests, so I think it is a bit overkill here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a new geospatial row check is_not_null_island to flag geometries at the “Null Island” (POINT(0 0)), with corresponding tests, benchmarks, and documentation updates.
- Introduces is_not_null_island in geo/check_funcs with rule registration
- Adds integration and performance tests, and YAML/MDX examples
- Updates docs to list and demonstrate the new check
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/resources/all_row_geo_checks.yaml | Adds is_not_null_island to the all-geo-checks YAML. |
| tests/perf/test_apply_checks.py | Adds a benchmark for is_not_null_island. |
| tests/integration/test_row_checks_geo.py | Adds an integration test verifying behavior and error message for is_not_null_island. |
| tests/integration/test_apply_checks.py | Adds is_not_null_island to the “all geo checks” application test (string and Column inputs). |
| src/databricks/labs/dqx/geo/check_funcs.py | Implements and registers is_not_null_island. |
| docs/dqx/docs/reference/quality_checks.mdx | Documents the new check and adds examples; fixes a table row formatting issue. |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| def is_not_null_island(column: str | Column) -> Column: | ||
| """Checks whether the values in the input column are NULL island geometries (POINT(0 0)). | ||
| Args: | ||
| column: column to check; can be a string column name or a column expression | ||
| Returns: | ||
| Column object indicating whether the values in the input column are NULL island geometries | ||
| Note: | ||
| This function requires Databricks serverless compute or runtime 17.1 or above. | ||
| """ |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring contradicts the function name and behavior. The check ensures values are not at Null Island and flags rows that are POINT(0 0). Suggest updating to: Summary: 'Ensures values are not at the Null Island (POINT(0 0)); flags rows at (0,0)'. Returns: 'Column with error message for rows at (0,0), otherwise null'.
| | `is_geometrycollection` | Checks whether the values in the input column are geometrycollection geometries/geographies. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | | ||
| | `is_ogc_valid` | Checks whether the values in the input column are valid geometries in the OGC sense. I.e a bowtie polygon is invalid because it has a self intersection. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | | ||
| | `is_non_empty_geometry` | Checks whether the values in the input column are non-empty geometries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | | ||
| | `is_not_null_island` | Checks whether the values in the input column are null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description is inconsistent with the check’s intent and message. Recommend: 'Ensures values are not at the Null Island (POINT(0 0)); flags rows at (0,0).'
| | `is_not_null_island` | Checks whether the values in the input column are null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | | |
| | `is_not_null_island` | Ensures values are not at the Null Island (POINT(0 0)); flags rows at (0,0). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | |
| geom_cond = F.expr(f"try_to_geometry({col_str_norm}) IS NULL") | ||
| is_point_cond = F.expr(f"st_geometrytype(try_to_geometry({col_str_norm})) = '{POINT_TYPE}'") | ||
| is_zero_zero = F.expr(f"st_x(try_to_geometry({col_str_norm})) = 0.0 AND st_y(try_to_geometry({col_str_norm})) = 0.0") |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression repeats try_to_geometry three times, which makes the code harder to read and update. Consider factoring the subexpression into a variable to DRY it up, e.g.: try_geom = f"try_to_geometry({col_str_norm})" then reuse in the subsequent F.expr calls.
| geom_cond = F.expr(f"try_to_geometry({col_str_norm}) IS NULL") | |
| is_point_cond = F.expr(f"st_geometrytype(try_to_geometry({col_str_norm})) = '{POINT_TYPE}'") | |
| is_zero_zero = F.expr(f"st_x(try_to_geometry({col_str_norm})) = 0.0 AND st_y(try_to_geometry({col_str_norm})) = 0.0") | |
| try_geom = f"try_to_geometry({col_str_norm})" | |
| geom_cond = F.expr(f"{try_geom} IS NULL") | |
| is_point_cond = F.expr(f"st_geometrytype({try_geom}) = '{POINT_TYPE}'") | |
| is_zero_zero = F.expr(f"st_x({try_geom}) = 0.0 AND st_y({try_geom}) = 0.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Checks whether the values in the input column are NULL island geometries (POINT(0 0)). | ||
| Args: | ||
| column: column to check; can be a string column name or a column expression | ||
| Returns: | ||
| Column object indicating whether the values in the input column are NULL island geometries |
Copilot
AI
Nov 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring states this checks whether values are NULL island geometries, but based on the function name is_not_null_island and the condition logic, it should state this checks whether values are not NULL island geometries.
| """Checks whether the values in the input column are NULL island geometries (POINT(0 0)). | |
| Args: | |
| column: column to check; can be a string column name or a column expression | |
| Returns: | |
| Column object indicating whether the values in the input column are NULL island geometries | |
| """Checks whether the values in the input column are *not* NULL island geometries (i.e., not POINT(0 0)). | |
| Args: | |
| column: column to check; can be a string column name or a column expression | |
| Returns: | |
| Column object indicating whether the values in the input column are *not* NULL island geometries |
| column: column to check; can be a string column name or a column expression | ||
| Returns: | ||
| Column object indicating whether the values in the input column are NULL island geometries |
Copilot
AI
Nov 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return description should indicate the column contains error messages for values that are NULL island geometries, not whether they are NULL island geometries. This is inconsistent with the function's actual behavior.
| Column object indicating whether the values in the input column are NULL island geometries | |
| Column object containing error messages for values that are NULL island geometries (POINT(0 0)), and null otherwise |
| | `is_geometrycollection` | Checks whether the values in the input column are geometrycollection geometries/geographies. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | | ||
| | `is_ogc_valid` | Checks whether the values in the input column are valid geometries in the OGC sense. I.e a bowtie polygon is invalid because it has a self intersection. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | | ||
| | `is_non_empty_geometry` | Checks whether the values in the input column are non-empty geometries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | | ||
| | `is_not_null_island` | Checks whether the values in the input column are null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | |
Copilot
AI
Nov 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description states the check validates whether values are null island geometries, but the function name is_not_null_island suggests it should check whether values are not null island geometries (i.e., it flags values that ARE null island as failures).
| | `is_not_null_island` | Checks whether the values in the input column are null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | | |
| | `is_not_null_island` | Checks whether the values in the input column are **not** null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | |
| | `has_dimension` | Checks whether the values in the input column are geometries of the specified dimension (2D projected dimension). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `dimension`: dimension to check | | ||
| | `has_x_coordinate_between` | Checks whether the values in the input column are geometries with x coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value | | ||
| | `has_y_coordinate_between` | Checks whether the values in the input column are geometries with y coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value | | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value | | ||
| | `has_y_coordinate_between` | Checks whether the values in the input column are geometries with y coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value | |
Copilot
AI
Nov 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 70 appears to contain duplicate content at the end that was inadvertently carried over from the previous line. The line should end after the first complete description without the repeated parameters section.
| | `has_y_coordinate_between` | Checks whether the values in the input column are geometries with y coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value | | |
| | `has_y_coordinate_between` | Checks whether the values in the input column are geometries with y coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value | |
Changes
Add new geo check:
is_not_null_island.Linked issues
Resolves #603
Tests