Skip to content

Conversation

@cornzyblack
Copy link
Contributor

@cornzyblack cornzyblack commented Oct 15, 2025

Changes

Linked issues

Resolves #595

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

@mwojtyczka mwojtyczka requested a review from Copilot October 16, 2025 09:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds row-level JSON validation checks and integrates them into examples and tests.

  • Introduces is_valid_json and has_json_keys row checks.
  • Updates YAML examples, reference docs, and integration/unit tests to cover the new checks.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/databricks/labs/dqx/check_funcs.py Adds JSON validation/check functions; core logic for new checks.
tests/unit/test_build_rules.py Extends metadata conversion tests to include new JSON checks.
tests/integration/test_apply_checks.py Adds col_json_str to test schemas and values; exercises new checks in streaming and class-based tests.
tests/resources/all_row_checks.yaml Includes is_valid_json check in the “all row checks” YAML.
src/databricks/labs/dqx/llm/resources/yaml_checks_examples.yml Adds examples for is_valid_json and has_json_keys.
docs/dqx/docs/reference/quality_checks.mdx Documents new checks and shows usage examples.
Comments suppressed due to low confidence (1)

docs/dqx/docs/reference/quality_checks.mdx:1

  • Both examples use the same name 'col_json_str_has_json_keys', which is confusing and may collide in practice. Use distinct, descriptive names (e.g., 'col_json_str_has_no_json_key1' and 'col_json_str_has_no_json_key1_key2').
---

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 1761 to 1768
col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column)
return make_condition(
~F.when(F.col(col_expr_str).isNotNull(), F.try_parse_json(col_expr_str).isNotNull()),
F.concat_ws(
"", F.lit("Value '"), col_expr.cast("string"), F.lit(f"' in Column '{col_expr_str}' is not a valid JSON")
),
f"{col_str_norm}_is_not_valid_json",
)
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_valid_json dereferences the column by name (F.col(col_expr_str)) rather than using the resolved expression (col_expr). This breaks when the caller supplies a column expression (e.g., F.trim('c')), which may not exist as a named column. Use col_expr consistently in both the null and try_parse_json checks.

Copilot uses AI. Check for mistakes.

unique_keys_lit = F.lit(list(set(keys)))
col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column)
json_keys_array = F.json_object_keys(col_expr)
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyspark.sql.functions does not expose json_object_keys; this will raise AttributeError at runtime. Use F.expr to call the SQL function or derive keys via from_json + map_keys. For example: json_keys_array = F.expr(f"json_object_keys({col_expr_str})").

Suggested change
json_keys_array = F.json_object_keys(col_expr)
json_keys_array = F.expr(f"json_object_keys({col_expr_str})")

Copilot uses AI. Check for mistakes.
Comment on lines 1793 to 1797

if require_all:
condition = F.size(F.array_except(unique_keys_lit, json_keys_array)) == 0
else:
condition = F.when(is_valid_json(col_str_norm).isNull(), F.arrays_overlap(json_keys_array, unique_keys_lit))
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has_json_keys returns a null condition (and thus passes) for invalid JSON or null values, which means it won't flag missing keys unless a separate is_valid_json rule is also configured. To make has_json_keys self-contained, include a JSON-validity guard in the condition (so invalid JSON fails this check as well). For example: compute json_valid = F.try_parse_json(col_expr).isNotNull() and combine it with the key presence logic.

Suggested change
if require_all:
condition = F.size(F.array_except(unique_keys_lit, json_keys_array)) == 0
else:
condition = F.when(is_valid_json(col_str_norm).isNull(), F.arrays_overlap(json_keys_array, unique_keys_lit))
json_valid = F.try_parse_json(col_expr).isNotNull()
if require_all:
condition = json_valid & (F.size(F.array_except(unique_keys_lit, json_keys_array)) == 0)
else:
condition = json_valid & F.arrays_overlap(json_keys_array, unique_keys_lit)

Copilot uses AI. Check for mistakes.
if not isinstance(key, (str)):
raise InvalidParameterError("All keys must be of type string.")

unique_keys_lit = F.lit(list(set(keys)))
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using set(keys) loses ordering and can lead to non-deterministic error messages (and test flakiness). Prefer a stable order: unique_keys_lit = F.lit(sorted(set(keys))).

Suggested change
unique_keys_lit = F.lit(list(set(keys)))
unique_keys_lit = F.lit(sorted(set(keys)))

Copilot uses AI. Check for mistakes.
F.concat_ws(", ", F.lit(keys)),
F.lit("]"),
),
f"{col_str_norm}_has_no_json_keys",
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alias for has_json_keys does not include the target keys. If this rule is applied multiple times to the same column with different key sets, the resulting alias collides (e.g., duplicate 'col_json_str_has_no_json_keys'). Include the keys (and optionally require_all) in the alias to ensure uniqueness, e.g., alias_name = f"{col_str_norm}has_no_json{'_'.join(sorted(set(keys)))}".

Suggested change
f"{col_str_norm}_has_no_json_keys",
f"{col_str_norm}_has_no_json_{'_'.join(sorted(set(keys)))}{'_all' if require_all else '_any'}",

Copilot uses AI. Check for mistakes.
Comment on lines 41 to 42
| `is_valid_json` | Checks whether the values in the input column are valid JSON objects. | `column`: column to check (can be a string column name or a column expression) |
| `has_json_keys` | Checks whether the values in the input column contain specific JSON keys. | `column`: column to check (can be a string column name or a column expression); `keys`: list of JSON keys to check for |
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation gaps: (1) is_valid_json validates any JSON (object/array/etc.), not only 'JSON objects'—please reword. (2) has_json_keys supports a require_all flag (default True) but it's not documented—add it to the parameter list.

Suggested change
| `is_valid_json` | Checks whether the values in the input column are valid JSON objects. | `column`: column to check (can be a string column name or a column expression) |
| `has_json_keys` | Checks whether the values in the input column contain specific JSON keys. | `column`: column to check (can be a string column name or a column expression); `keys`: list of JSON keys to check for |
| `is_valid_json` | Checks whether the values in the input column are valid JSON (objects, arrays, strings, numbers, etc.), not just JSON objects. | `column`: column to check (can be a string column name or a column expression) |
| `has_json_keys` | Checks whether the values in the input column contain specific JSON keys. | `column`: column to check (can be a string column name or a column expression); `keys`: list of JSON keys to check for; `require_all`: whether all keys must be present (default: True) |

Copilot uses AI. Check for mistakes.
keys:
- key1
- criticality: error
name: col_json_str_has_no_json_keys2
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example name 'col_json_str_has_no_json_keys2' is ambiguous. Consider aligning with the keys being checked (e.g., 'col_json_str_has_no_json_key1_key2') for clarity and consistency with other examples.

Suggested change
name: col_json_str_has_no_json_keys2
name: col_json_str_has_no_json_key1_key2

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@mwojtyczka mwojtyczka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add performance and integration tests for the new check functions. The original ticket also mentioned has_json_schema

| `is_valid_date` | Checks whether the values in the input column have valid date formats. | `column`: column to check (can be a string column name or a column expression); `date_format`: optional date format (e.g. 'yyyy-mm-dd') |
| `is_valid_timestamp` | Checks whether the values in the input column have valid timestamp formats. | `column`: column to check (can be a string column name or a column expression); `timestamp_format`: optional timestamp format (e.g. 'yyyy-mm-dd HH:mm:ss') |
| `is_valid_json` | Checks whether the values in the input column are valid JSON objects. | `column`: column to check (can be a string column name or a column expression) |
| `has_json_keys` | Checks whether the values in the input column contain specific JSON keys. | `column`: column to check (can be a string column name or a column expression); `keys`: list of JSON keys to check for |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require_all misssing

name="col6_is_not_valid_timestamp2"
),

# is_valid_json check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls group examples for has_json_keys and is_valid_json together

Checks whether the values in the input column contain specific JSON keys.
Args:
column (str | Column): The name of the column or the column itself to check for JSON keys.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
column (str | Column): The name of the column or the column itself to check for JSON keys.
column (str | Column): The name of the column or the column expression to check for JSON keys.

function: is_valid_json
arguments:
column: col_json_str

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add has_json_keys as well

@cornzyblack
Copy link
Contributor Author

cornzyblack commented Oct 17, 2025

Updates:

  • WIP: Still working on has_json_schema
  • Added unit, integration and performance tests for is_valid_json and has_json_keys (grouped them)

Notes

  • Unit tests for has_valid_json_schema might not be possible (see image below) as it returns an error when I ran it, suspecting it is because _expected_schema = _get_schema(schema) uses types.StructType.fromDDL(input_schema), which requires a Spark session.
  • I renamed has_json_schema to has_valid_json_schema as I feel that sounds more descriptive. (Happy to revert when it is done if needed).
image

@mwojtyczka
Copy link
Contributor

mwojtyczka commented Oct 18, 2025

  • has_valid_json_schema

if you need a spark session, you have to use integration tests.
Agree, has_valid_json_schema is better

raise InvalidParameterError("All keys must be of type string.")

col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column)
json_keys_array = F.json_object_keys(col_expr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, json_object_keys will only return keys of the outer object. This is probably fine, but we should document it explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: JSON Validation Checks

3 participants