-
Notifications
You must be signed in to change notification settings - Fork 69
Feat add json validation checks #616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feat add json validation checks #616
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds row-level JSON validation checks and integrates them into examples and tests.
- Introduces is_valid_json and has_json_keys row checks.
- Updates YAML examples, reference docs, and integration/unit tests to cover the new checks.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/databricks/labs/dqx/check_funcs.py | Adds JSON validation/check functions; core logic for new checks. |
| tests/unit/test_build_rules.py | Extends metadata conversion tests to include new JSON checks. |
| tests/integration/test_apply_checks.py | Adds col_json_str to test schemas and values; exercises new checks in streaming and class-based tests. |
| tests/resources/all_row_checks.yaml | Includes is_valid_json check in the “all row checks” YAML. |
| src/databricks/labs/dqx/llm/resources/yaml_checks_examples.yml | Adds examples for is_valid_json and has_json_keys. |
| docs/dqx/docs/reference/quality_checks.mdx | Documents new checks and shows usage examples. |
Comments suppressed due to low confidence (1)
docs/dqx/docs/reference/quality_checks.mdx:1
- Both examples use the same name 'col_json_str_has_json_keys', which is confusing and may collide in practice. Use distinct, descriptive names (e.g., 'col_json_str_has_no_json_key1' and 'col_json_str_has_no_json_key1_key2').
---
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column) | ||
| return make_condition( | ||
| ~F.when(F.col(col_expr_str).isNotNull(), F.try_parse_json(col_expr_str).isNotNull()), | ||
| F.concat_ws( | ||
| "", F.lit("Value '"), col_expr.cast("string"), F.lit(f"' in Column '{col_expr_str}' is not a valid JSON") | ||
| ), | ||
| f"{col_str_norm}_is_not_valid_json", | ||
| ) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_valid_json dereferences the column by name (F.col(col_expr_str)) rather than using the resolved expression (col_expr). This breaks when the caller supplies a column expression (e.g., F.trim('c')), which may not exist as a named column. Use col_expr consistently in both the null and try_parse_json checks.
|
|
||
| unique_keys_lit = F.lit(list(set(keys))) | ||
| col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column) | ||
| json_keys_array = F.json_object_keys(col_expr) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyspark.sql.functions does not expose json_object_keys; this will raise AttributeError at runtime. Use F.expr to call the SQL function or derive keys via from_json + map_keys. For example: json_keys_array = F.expr(f"json_object_keys({col_expr_str})").
| json_keys_array = F.json_object_keys(col_expr) | |
| json_keys_array = F.expr(f"json_object_keys({col_expr_str})") |
|
|
||
| if require_all: | ||
| condition = F.size(F.array_except(unique_keys_lit, json_keys_array)) == 0 | ||
| else: | ||
| condition = F.when(is_valid_json(col_str_norm).isNull(), F.arrays_overlap(json_keys_array, unique_keys_lit)) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has_json_keys returns a null condition (and thus passes) for invalid JSON or null values, which means it won't flag missing keys unless a separate is_valid_json rule is also configured. To make has_json_keys self-contained, include a JSON-validity guard in the condition (so invalid JSON fails this check as well). For example: compute json_valid = F.try_parse_json(col_expr).isNotNull() and combine it with the key presence logic.
| if require_all: | |
| condition = F.size(F.array_except(unique_keys_lit, json_keys_array)) == 0 | |
| else: | |
| condition = F.when(is_valid_json(col_str_norm).isNull(), F.arrays_overlap(json_keys_array, unique_keys_lit)) | |
| json_valid = F.try_parse_json(col_expr).isNotNull() | |
| if require_all: | |
| condition = json_valid & (F.size(F.array_except(unique_keys_lit, json_keys_array)) == 0) | |
| else: | |
| condition = json_valid & F.arrays_overlap(json_keys_array, unique_keys_lit) |
| if not isinstance(key, (str)): | ||
| raise InvalidParameterError("All keys must be of type string.") | ||
|
|
||
| unique_keys_lit = F.lit(list(set(keys))) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using set(keys) loses ordering and can lead to non-deterministic error messages (and test flakiness). Prefer a stable order: unique_keys_lit = F.lit(sorted(set(keys))).
| unique_keys_lit = F.lit(list(set(keys))) | |
| unique_keys_lit = F.lit(sorted(set(keys))) |
| F.concat_ws(", ", F.lit(keys)), | ||
| F.lit("]"), | ||
| ), | ||
| f"{col_str_norm}_has_no_json_keys", |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alias for has_json_keys does not include the target keys. If this rule is applied multiple times to the same column with different key sets, the resulting alias collides (e.g., duplicate 'col_json_str_has_no_json_keys'). Include the keys (and optionally require_all) in the alias to ensure uniqueness, e.g., alias_name = f"{col_str_norm}has_no_json{'_'.join(sorted(set(keys)))}".
| f"{col_str_norm}_has_no_json_keys", | |
| f"{col_str_norm}_has_no_json_{'_'.join(sorted(set(keys)))}{'_all' if require_all else '_any'}", |
| | `is_valid_json` | Checks whether the values in the input column are valid JSON objects. | `column`: column to check (can be a string column name or a column expression) | | ||
| | `has_json_keys` | Checks whether the values in the input column contain specific JSON keys. | `column`: column to check (can be a string column name or a column expression); `keys`: list of JSON keys to check for | |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation gaps: (1) is_valid_json validates any JSON (object/array/etc.), not only 'JSON objects'—please reword. (2) has_json_keys supports a require_all flag (default True) but it's not documented—add it to the parameter list.
| | `is_valid_json` | Checks whether the values in the input column are valid JSON objects. | `column`: column to check (can be a string column name or a column expression) | | |
| | `has_json_keys` | Checks whether the values in the input column contain specific JSON keys. | `column`: column to check (can be a string column name or a column expression); `keys`: list of JSON keys to check for | | |
| | `is_valid_json` | Checks whether the values in the input column are valid JSON (objects, arrays, strings, numbers, etc.), not just JSON objects. | `column`: column to check (can be a string column name or a column expression) | | |
| | `has_json_keys` | Checks whether the values in the input column contain specific JSON keys. | `column`: column to check (can be a string column name or a column expression); `keys`: list of JSON keys to check for; `require_all`: whether all keys must be present (default: True) | |
| keys: | ||
| - key1 | ||
| - criticality: error | ||
| name: col_json_str_has_no_json_keys2 |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example name 'col_json_str_has_no_json_keys2' is ambiguous. Consider aligning with the keys being checked (e.g., 'col_json_str_has_no_json_key1_key2') for clarity and consistency with other examples.
| name: col_json_str_has_no_json_keys2 | |
| name: col_json_str_has_no_json_key1_key2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add performance and integration tests for the new check functions. The original ticket also mentioned has_json_schema
| | `is_valid_date` | Checks whether the values in the input column have valid date formats. | `column`: column to check (can be a string column name or a column expression); `date_format`: optional date format (e.g. 'yyyy-mm-dd') | | ||
| | `is_valid_timestamp` | Checks whether the values in the input column have valid timestamp formats. | `column`: column to check (can be a string column name or a column expression); `timestamp_format`: optional timestamp format (e.g. 'yyyy-mm-dd HH:mm:ss') | | ||
| | `is_valid_json` | Checks whether the values in the input column are valid JSON objects. | `column`: column to check (can be a string column name or a column expression) | | ||
| | `has_json_keys` | Checks whether the values in the input column contain specific JSON keys. | `column`: column to check (can be a string column name or a column expression); `keys`: list of JSON keys to check for | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
require_all misssing
| name="col6_is_not_valid_timestamp2" | ||
| ), | ||
|
|
||
| # is_valid_json check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls group examples for has_json_keys and is_valid_json together
| Checks whether the values in the input column contain specific JSON keys. | ||
| Args: | ||
| column (str | Column): The name of the column or the column itself to check for JSON keys. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| column (str | Column): The name of the column or the column itself to check for JSON keys. | |
| column (str | Column): The name of the column or the column expression to check for JSON keys. |
| function: is_valid_json | ||
| arguments: | ||
| column: col_json_str | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add has_json_keys as well
if you need a spark session, you have to use integration tests. |
| raise InvalidParameterError("All keys must be of type string.") | ||
|
|
||
| col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column) | ||
| json_keys_array = F.json_object_keys(col_expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, json_object_keys will only return keys of the outer object. This is probably fine, but we should document it explicitly.
…ack/dqx into feat-add-json-validation-checks
…ack/dqx into feat-add-json-validation-checks

Changes
Linked issues
Resolves #595
Tests