-
Notifications
You must be signed in to change notification settings - Fork 69
add null island check #613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -357,6 +357,35 @@ def is_non_empty_geometry(column: str | Column) -> Column: | |||||||||||||||
| ) | ||||||||||||||||
|
|
||||||||||||||||
|
|
||||||||||||||||
| @register_rule("row") | ||||||||||||||||
| def is_not_null_island(column: str | Column) -> Column: | ||||||||||||||||
| """Checks whether the values in the input column are NULL island geometries (POINT(0 0)). | ||||||||||||||||
|
|
||||||||||||||||
| Args: | ||||||||||||||||
| column: column to check; can be a string column name or a column expression | ||||||||||||||||
|
|
||||||||||||||||
| Returns: | ||||||||||||||||
| Column object indicating whether the values in the input column are NULL island geometries | ||||||||||||||||
|
||||||||||||||||
| Column object indicating whether the values in the input column are NULL island geometries | |
| Column object containing error messages for values that are NULL island geometries (POINT(0 0)), and null otherwise |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring contradicts the function name and behavior. The check ensures values are not at Null Island and flags rows that are POINT(0 0). Suggest updating to: Summary: 'Ensures values are not at the Null Island (POINT(0 0)); flags rows at (0,0)'. Returns: 'Column with error message for rows at (0,0), otherwise null'.
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression repeats try_to_geometry three times, which makes the code harder to read and update. Consider factoring the subexpression into a variable to DRY it up, e.g.: try_geom = f"try_to_geometry({col_str_norm})" then reuse in the subsequent F.expr calls.
| geom_cond = F.expr(f"try_to_geometry({col_str_norm}) IS NULL") | |
| is_point_cond = F.expr(f"st_geometrytype(try_to_geometry({col_str_norm})) = '{POINT_TYPE}'") | |
| is_zero_zero = F.expr(f"st_x(try_to_geometry({col_str_norm})) = 0.0 AND st_y(try_to_geometry({col_str_norm})) = 0.0") | |
| try_geom = f"try_to_geometry({col_str_norm})" | |
| geom_cond = F.expr(f"{try_geom} IS NULL") | |
| is_point_cond = F.expr(f"st_geometrytype({try_geom}) = '{POINT_TYPE}'") | |
| is_zero_zero = F.expr(f"st_x({try_geom}) = 0.0 AND st_y({try_geom}) = 0.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about POINTZ and POINTZM types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you expect? When people talk about the null island, they mean the point at 0 lat, 0 long. I don't know if POINTZ/POINTZM makes sense in this setting, and was therefore a bit "liberal" in my definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something like POINT Z (0 0 0) and POINT ZM (0 0 0 0) would be equivalent? Just to validate that the data is not a placeholder geometry or generated by some process that defaults values to 0. We might be able to use ST_Z and ST_M.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -13,6 +13,7 @@ | |
| is_multilinestring, | ||
| is_multipoint, | ||
| is_multipolygon, | ||
| is_not_null_island, | ||
| is_point, | ||
| is_polygon, | ||
| is_ogc_valid, | ||
|
|
@@ -333,6 +334,29 @@ def test_is_non_empty_geometry(skip_if_runtime_not_geo_compatible, spark): | |
| assert_df_equality(actual, expected, ignore_nullable=True) | ||
|
|
||
|
|
||
| def test_is_not_null_island(skip_if_runtime_not_geo_compatible, spark): | ||
| input_schema = "geom: string" | ||
| test_df = spark.createDataFrame( | ||
| [["POINT(1 1)"], ["POINT(0 0)"], ["LINESTRING(0 0, 1 1)"], ["nonsense"], [None]], | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we test GeoJSON or WKB as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ideally we would be able to construct geometries directly in this constructor instead of starting with strings and converting them. The other formats are already checked in adjacent tests, so I think it is a bit overkill here. |
||
| input_schema, | ||
| ) | ||
|
|
||
| actual = test_df.select(is_not_null_island("geom")) | ||
|
|
||
| checked_schema = "geom_contains_null_island: string" | ||
| expected = spark.createDataFrame( | ||
| [ | ||
| [None], | ||
| ["column `geom` contains a null island"], | ||
| [None], | ||
| [None], | ||
| [None], | ||
| ], | ||
| checked_schema, | ||
| ) | ||
| assert_df_equality(actual, expected, ignore_nullable=True) | ||
|
|
||
|
|
||
| def test_has_dimension(skip_if_runtime_not_geo_compatible, spark): | ||
| input_schema = "geom: string" | ||
| test_df = spark.createDataFrame( | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring states this checks whether values are NULL island geometries, but based on the function name
is_not_null_islandand the condition logic, it should state this checks whether values are not NULL island geometries.