-
Notifications
You must be signed in to change notification settings - Fork 69
add null island check #613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -64,9 +64,10 @@ You can also define your own custom checks (see [Creating custom checks](#creati | |||||
| | `is_geometrycollection` | Checks whether the values in the input column are geometrycollection geometries/geographies. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | | ||||||
| | `is_ogc_valid` | Checks whether the values in the input column are valid geometries in the OGC sense. I.e a bowtie polygon is invalid because it has a self intersection. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | | ||||||
| | `is_non_empty_geometry` | Checks whether the values in the input column are non-empty geometries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | | ||||||
| | `is_not_null_island` | Checks whether the values in the input column are null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | | ||||||
|
||||||
| | `is_not_null_island` | Checks whether the values in the input column are null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | | |
| | `is_not_null_island` | Checks whether the values in the input column are **not** null island geometries (POINT(0 0)). This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression) | |
Copilot
AI
Nov 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 70 appears to contain duplicate content at the end that was inadvertently carried over from the previous line. The line should end after the first complete description without the repeated parameters section.
| | `has_y_coordinate_between` | Checks whether the values in the input column are geometries with y coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value | | |
| | `has_y_coordinate_between` | Checks whether the values in the input column are geometries with y coordinate between the provided boundaries. This function requires Databricks serverless compute or runtime >= 17.1. | `column`: column to check (can be a string column name or a column expression); `min_value`: minimum value; `max_value`: maximum value | |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -357,6 +357,35 @@ def is_non_empty_geometry(column: str | Column) -> Column: | |||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| @register_rule("row") | ||||||||||||||||||||||||||||||||||
| def is_not_null_island(column: str | Column) -> Column: | ||||||||||||||||||||||||||||||||||
| """Checks whether the values in the input column are NULL island geometries (POINT(0 0)). | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| Args: | ||||||||||||||||||||||||||||||||||
| column: column to check; can be a string column name or a column expression | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| Returns: | ||||||||||||||||||||||||||||||||||
| Column object indicating whether the values in the input column are NULL island geometries | ||||||||||||||||||||||||||||||||||
|
Comment on lines
+362
to
+368
|
||||||||||||||||||||||||||||||||||
| """Checks whether the values in the input column are NULL island geometries (POINT(0 0)). | |
| Args: | |
| column: column to check; can be a string column name or a column expression | |
| Returns: | |
| Column object indicating whether the values in the input column are NULL island geometries | |
| """Checks whether the values in the input column are *not* NULL island geometries (i.e., not POINT(0 0)). | |
| Args: | |
| column: column to check; can be a string column name or a column expression | |
| Returns: | |
| Column object indicating whether the values in the input column are *not* NULL island geometries |
Copilot
AI
Nov 6, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return description should indicate the column contains error messages for values that are NULL island geometries, not whether they are NULL island geometries. This is inconsistent with the function's actual behavior.
| Column object indicating whether the values in the input column are NULL island geometries | |
| Column object containing error messages for values that are NULL island geometries (POINT(0 0)), and null otherwise |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring contradicts the function name and behavior. The check ensures values are not at Null Island and flags rows that are POINT(0 0). Suggest updating to: Summary: 'Ensures values are not at the Null Island (POINT(0 0)); flags rows at (0,0)'. Returns: 'Column with error message for rows at (0,0), otherwise null'.
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression repeats try_to_geometry three times, which makes the code harder to read and update. Consider factoring the subexpression into a variable to DRY it up, e.g.: try_geom = f"try_to_geometry({col_str_norm})" then reuse in the subsequent F.expr calls.
| geom_cond = F.expr(f"try_to_geometry({col_str_norm}) IS NULL") | |
| is_point_cond = F.expr(f"st_geometrytype(try_to_geometry({col_str_norm})) = '{POINT_TYPE}'") | |
| is_zero_zero = F.expr(f"st_x(try_to_geometry({col_str_norm})) = 0.0 AND st_y(try_to_geometry({col_str_norm})) = 0.0") | |
| try_geom = f"try_to_geometry({col_str_norm})" | |
| geom_cond = F.expr(f"{try_geom} IS NULL") | |
| is_point_cond = F.expr(f"st_geometrytype({try_geom}) = '{POINT_TYPE}'") | |
| is_zero_zero = F.expr(f"st_x({try_geom}) = 0.0 AND st_y({try_geom}) = 0.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about POINTZ and POINTZM types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you expect? When people talk about the null island, they mean the point at 0 lat, 0 long. I don't know if POINTZ/POINTZM makes sense in this setting, and was therefore a bit "liberal" in my definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something like POINT Z (0 0 0) and POINT ZM (0 0 0 0) would be equivalent? Just to validate that the data is not a placeholder geometry or generated by some process that defaults values to 0. We might be able to use ST_Z and ST_M.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,6 +13,7 @@ | |
| is_multilinestring, | ||
| is_multipoint, | ||
| is_multipolygon, | ||
| is_not_null_island, | ||
| is_point, | ||
| is_polygon, | ||
| is_ogc_valid, | ||
|
|
@@ -333,6 +334,29 @@ def test_is_non_empty_geometry(skip_if_runtime_not_geo_compatible, spark): | |
| assert_df_equality(actual, expected, ignore_nullable=True) | ||
|
|
||
|
|
||
| def test_is_not_null_island(skip_if_runtime_not_geo_compatible, spark): | ||
| input_schema = "geom: string" | ||
| test_df = spark.createDataFrame( | ||
| [["POINT(1 1)"], ["POINT(0 0)"], ["LINESTRING(0 0, 1 1)"], ["nonsense"], [None]], | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we test GeoJSON or WKB as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ideally we would be able to construct geometries directly in this constructor instead of starting with strings and converting them. The other formats are already checked in adjacent tests, so I think it is a bit overkill here. |
||
| input_schema, | ||
| ) | ||
|
|
||
| actual = test_df.select(is_not_null_island("geom")) | ||
|
|
||
| checked_schema = "geom_contains_null_island: string" | ||
| expected = spark.createDataFrame( | ||
| [ | ||
| [None], | ||
| ["column `geom` contains a null island"], | ||
| [None], | ||
| [None], | ||
| [None], | ||
| ], | ||
| checked_schema, | ||
| ) | ||
| assert_df_equality(actual, expected, ignore_nullable=True) | ||
|
|
||
|
|
||
| def test_has_dimension(skip_if_runtime_not_geo_compatible, spark): | ||
| input_schema = "geom: string" | ||
| test_df = spark.createDataFrame( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description is inconsistent with the check’s intent and message. Recommend: 'Ensures values are not at the Null Island (POINT(0 0)); flags rows at (0,0).'