-
Notifications
You must be signed in to change notification settings - Fork 69
Feat add json validation checks #616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 36 commits
6cd9f02
98371b7
1a58e16
98803bc
acf3767
0766f2b
3d0fd34
e5712fc
9bf6d98
1e4d783
fcdb1ce
2393404
eddc874
c378b6d
cb6f9ef
82c7a22
f1ec4af
dfa9649
89f2811
02466c1
ccb6e05
156a9c2
8d30ff6
1873d72
5109c27
ceecf7d
0c94089
246833b
05365e0
7be64e6
c7d8406
66cbb13
70e19bd
a168d64
c3c23e7
984bbb8
3e63312
9ed893a
a72bdb1
b8505e4
a177c01
e0c3438
853c8c0
3b0fd52
44881fe
7b19d00
96cbc8e
0ff6ccb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -1747,6 +1747,117 @@ def apply(df: DataFrame) -> DataFrame: | |||||
| return condition, apply | ||||||
|
|
||||||
|
|
||||||
| @register_rule("row") | ||||||
| def is_valid_json(column: str | Column) -> Column: | ||||||
| """ | ||||||
| Checks whether the values in the input column is a valid JSON string. | ||||||
|
|
||||||
| Args: | ||||||
| column: Column name (str) or Column expression to check for valid JSON. | ||||||
|
|
||||||
| Returns: | ||||||
| A Spark Column representing the condition for invalid JSON strings. | ||||||
| """ | ||||||
| col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column) | ||||||
| return make_condition( | ||||||
| ~F.when(F.col(col_expr_str).isNotNull(), F.try_parse_json(col_expr_str).isNotNull()), | ||||||
ghanse marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
| F.concat_ws( | ||||||
| "", | ||||||
| F.lit("Value '"), | ||||||
| col_expr.cast("string"), | ||||||
| F.lit(f"' in Column '{col_expr_str}' is not a valid JSON string"), | ||||||
| ), | ||||||
| f"{col_str_norm}_is_not_valid_json", | ||||||
| ) | ||||||
|
|
||||||
|
|
||||||
| @register_rule("row") | ||||||
| def has_json_keys(column: str | Column, keys: list[str], require_all: bool = True) -> Column: | ||||||
| """ | ||||||
| Checks whether the values in the input column contain specific JSON keys. | ||||||
|
|
||||||
| Args: | ||||||
| column: The name of the column or the column expression to check for JSON keys. | ||||||
| keys: The list of JSON keys to check for. | ||||||
| require_all: If True, all specified keys must be present. If False, at least one key must be present. | ||||||
|
|
||||||
| Returns: | ||||||
| Column: A Spark Column representing the condition for missing JSON keys. | ||||||
| """ | ||||||
| if not keys: | ||||||
| raise InvalidParameterError("The 'keys' parameter must be a non-empty list of strings.") | ||||||
| if any(not isinstance(k, str) for k in keys): | ||||||
| raise InvalidParameterError("All keys must be of type string.") | ||||||
|
|
||||||
| col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column) | ||||||
| json_keys_array = F.json_object_keys(col_expr) | ||||||
|
||||||
| json_keys_array = F.json_object_keys(col_expr) | |
| json_keys_array = F.expr(f"json_object_keys({col_expr_str})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, json_object_keys will only return keys of the outer object. This is probably fine, but we should document it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls group examples for
has_json_keysandis_valid_jsontogether