Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
6cd9f02
move criticiality of rule inro _validate_attributes
cornzyblack Jul 17, 2025
98371b7
since criticality is validated after creation, filter by criticality …
cornzyblack Jul 17, 2025
1a58e16
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack Jul 18, 2025
98803bc
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack Jul 23, 2025
acf3767
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack Jul 24, 2025
0766f2b
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack Aug 2, 2025
3d0fd34
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack Aug 7, 2025
e5712fc
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack Aug 7, 2025
9bf6d98
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack Aug 13, 2025
1e4d783
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack Sep 1, 2025
fcdb1ce
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack Sep 8, 2025
2393404
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack Sep 16, 2025
eddc874
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack Sep 19, 2025
c378b6d
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack Sep 25, 2025
cb6f9ef
Merge branch 'main' of github.com:cornzyblack/dqx
cornzyblack Oct 15, 2025
82c7a22
feat: add check for valid json
cornzyblack Oct 15, 2025
f1ec4af
feat: add checks for is_valid_json
cornzyblack Oct 15, 2025
dfa9649
feat: add is_valid_json
cornzyblack Oct 15, 2025
89f2811
feat: add has_json_keys
cornzyblack Oct 15, 2025
02466c1
refactor: change logic
cornzyblack Oct 15, 2025
ccb6e05
refactor: invert
cornzyblack Oct 15, 2025
156a9c2
refactor: negate
cornzyblack Oct 15, 2025
8d30ff6
refactor: update
cornzyblack Oct 15, 2025
1873d72
refactor: update
cornzyblack Oct 15, 2025
5109c27
refactor: update
cornzyblack Oct 15, 2025
ceecf7d
refactor: updates
cornzyblack Oct 15, 2025
0c94089
refactor: change and update
cornzyblack Oct 15, 2025
246833b
Merge branch 'main' into feat-add-json-validation-checks
mwojtyczka Oct 16, 2025
05365e0
refactor: fix docs
cornzyblack Oct 16, 2025
7be64e6
refactor: updates
cornzyblack Oct 16, 2025
c7d8406
refactor: update logic
cornzyblack Oct 16, 2025
66cbb13
refactor: explcit True
cornzyblack Oct 16, 2025
70e19bd
refactor: remove repetition
cornzyblack Oct 16, 2025
a168d64
refactor: remove as it depends on spark
cornzyblack Oct 16, 2025
c3c23e7
feat: add perf test for 2 tests (remaining 1)
cornzyblack Oct 17, 2025
984bbb8
Merge branch 'main' into feat-add-json-validation-checks
mwojtyczka Oct 18, 2025
3e63312
refactor: switch back to has_json_schema
cornzyblack Oct 21, 2025
9ed893a
Merge branch 'feat-add-json-validation-checks' of github.com:cornzybl…
cornzyblack Oct 21, 2025
a72bdb1
docs: document properly that function only checks outside keys
cornzyblack Oct 21, 2025
b8505e4
refactor: comment out to test
cornzyblack Oct 21, 2025
a177c01
refactor: try using transform for strict comparison
cornzyblack Oct 21, 2025
e0c3438
feat: implement changes
cornzyblack Oct 21, 2025
853c8c0
Merge branch 'main' into feat-add-json-validation-checks
cornzyblack Oct 22, 2025
3b0fd52
format and add tests
cornzyblack Oct 22, 2025
44881fe
Merge branch 'feat-add-json-validation-checks' of github.com:cornzybl…
cornzyblack Oct 22, 2025
7b19d00
refactor: add to markdown
cornzyblack Oct 22, 2025
96cbc8e
updates
cornzyblack Oct 22, 2025
0ff6ccb
Merge branch 'main' into feat-add-json-validation-checks
mwojtyczka Oct 31, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 89 additions & 18 deletions docs/dqx/docs/reference/quality_checks.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ You can also define your own custom checks (see [Creating custom checks](#creati
| `is_not_greater_than` | Checks whether the values in the input column are not greater than the provided limit. | `column`: column to check (can be a string column name or a column expression); `limit`: limit as number, date, timestamp, column name or sql expression |
| `is_valid_date` | Checks whether the values in the input column have valid date formats. | `column`: column to check (can be a string column name or a column expression); `date_format`: optional date format (e.g. 'yyyy-mm-dd') |
| `is_valid_timestamp` | Checks whether the values in the input column have valid timestamp formats. | `column`: column to check (can be a string column name or a column expression); `timestamp_format`: optional timestamp format (e.g. 'yyyy-mm-dd HH:mm:ss') |
| `is_valid_json` | Checks whether the values in the input column are valid JSON objects. | `column`: column to check (can be a string column name or a column expression) |
| `has_json_keys` | Checks whether the values in the input column contain specific keys in the outermost JSON object. | `column`: column to check (can be a string column name or a column expression); `keys`: A list of JSON keys to verify within the outermost JSON object; `require_all`: optional boolean flag to require all keys to be present |
| `has_valid_json_schema` | Checks whether the values in the specified column, which contain JSON strings, conform to the expected schema. | `column`: column to check (can be a string column name or a column expression); `schema`: the schema as a DDL string (e.g., "id INT, name STRING") or StructType object; |
| `is_not_in_future` | Checks whether the values in the input column contain a timestamp that is not in the future, where 'future' is defined as current_timestamp + offset (in seconds). | `column`: column to check (can be a string column name or a column expression); `offset`: offset to use; `curr_timestamp`: current timestamp, if not provided current_timestamp() function is used |
| `is_not_in_near_future` | Checks whether the values in the input column contain a timestamp that is not in the near future, where 'near future' is defined as greater than the current timestamp but less than the current_timestamp + offset (in seconds). | `column`: column to check (can be a string column name or a column expression); `offset`: offset to use; `curr_timestamp`: current timestamp, if not provided current_timestamp() function is used |
| `is_older_than_n_days` | Checks whether the values in one input column are at least N days older than the values in another column. | `column`: column to check (can be a string column name or a column expression); `days`: number of days; `curr_date`: current date, if not provided current_date() function is used; `negate`: if the condition should be negated |
Expand Down Expand Up @@ -323,6 +326,41 @@ For brevity, the `name` field in the examples is omitted and it will be auto-gen
column: col5
date_format: yyyy-MM-dd

# is_valid_json check
- criticality: error
check:
function: is_valid_json
arguments:
column: col_json_str

# has_json_keys check
- criticality: error
check:
function: has_json_keys
arguments:
column: col_json_str
keys:
- key1

- criticality: error
name: col_json_str_does_not_have_json_keys2
check:
function: has_json_keys
arguments:
column: col_json_str
keys:
- key1
- key2
require_all: False

- criticality: error
name: col_json_str2_has_invalid_json_schema
check:
function: has_valid_json_schema
arguments:
column: col_json_str2
schema: "STRUCT<a: BIGINT, b: BIGINT>"

# is_valid_timestamp check
- criticality: error
check:
Expand Down Expand Up @@ -532,42 +570,42 @@ For brevity, the `name` field in the examples is omitted and it will be auto-gen
function: is_linestring
arguments:
column: linestring_geom

# is_polygon check
- criticality: error
check:
function: is_polygon
arguments:
column: polygon_geom

# is_multipoint check
- criticality: error
check:
function: is_multipoint
arguments:
column: multipoint_geom

# is_multilinestring check
- criticality: error
check:
function: is_multilinestring
arguments:
column: multilinestring_geom

# is_multipolygon check
- criticality: error
check:
function: is_multipolygon
arguments:
column: multipolygon_geom

# is_geometrycollection check
- criticality: error
check:
function: is_geometrycollection
arguments:
column: geometrycollection_geom

# is_ogc_valid check
- criticality: error
check:
Expand All @@ -581,15 +619,15 @@ For brevity, the `name` field in the examples is omitted and it will be auto-gen
function: is_non_empty_geometry
arguments:
column: point_geom

# has_dimension check
- criticality: error
check:
function: has_dimension
arguments:
column: polygon_geom
dimension: 2

# has_x_coordinate_between check
- criticality: error
check:
Expand All @@ -598,7 +636,7 @@ For brevity, the `name` field in the examples is omitted and it will be auto-gen
column: polygon_geom
min_value: 0.0
max_value: 10.0

# has_y_coordinate_between check
- criticality: error
check:
Expand All @@ -607,6 +645,7 @@ For brevity, the `name` field in the examples is omitted and it will be auto-gen
column: polygon_geom
min_value: 0.0
max_value: 10.0

```
</details>

Expand Down Expand Up @@ -879,6 +918,38 @@ checks = [
name="col6_is_not_valid_timestamp2"
),

# is_valid_json check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls group examples for has_json_keys and is_valid_json together

DQRowRule(
criticality="error",
check_func=check_funcs.is_valid_json,
column="col_json_str"
),

# has_json_keys check
DQRowRule(
criticality="error",
check_func=check_funcs.has_json_keys,
column="col_json_str", # or as expr: F.col("col_json_str")
check_func_kwargs={"keys": ["key1"]},
name="col_json_str_has_json_keys"
),

DQRowRule(
criticality="error",
check_func=check_funcs.has_json_keys,
column="col_json_str", # or as expr: F.col("col_json_str")
check_func_kwargs={"keys": ["key1", "key2"], "require_all": False},
name="col_json_str_has_json_keys"
),

DQRowRule(
criticality="error",
check_func=check_funcs.has_valid_json_schema,
column="col_json_str2", # or as expr: F.col("col_json_str")
check_func_kwargs={"schema": "STRUCT<a: BIGINT, b: BIGINT>"},
name="col_json_str2_has_valid_json_schema"
),

# is_not_in_future check
DQRowRule(
criticality="error",
Expand Down Expand Up @@ -1016,7 +1087,7 @@ checks = [
check_func=geo_check_funcs.is_multilinestring,
column="multilinestring_geom"
),

# is_multipolygon check
DQRowRule(
criticality="error",
Expand Down Expand Up @@ -3022,7 +3093,7 @@ The PII detection extras include a built-in `does_not_contain_pii` check that us
function: does_not_contain_pii
arguments:
column: description

# PII detection check with custom threshold and named entities
- criticality: error
check:
Expand All @@ -3039,7 +3110,7 @@ The PII detection extras include a built-in `does_not_contain_pii` check that us
```python
from databricks.labs.dqx.rule import DQRowRule
from databricks.labs.dqx.pii.pii_detection_funcs import does_not_contain_pii

checks = [
# Basic PII detection check
DQRowRule(
Expand All @@ -3057,7 +3128,7 @@ The PII detection extras include a built-in `does_not_contain_pii` check that us
check_func_kwargs={"threshold": 0.8, "entities": ["PERSON", "EMAIL_ADDRESS"]}
),
]
```
```
</TabItem>
</Tabs>

Expand Down Expand Up @@ -3094,7 +3165,7 @@ These can be loaded using `NLPEngineConfig`:
from databricks.labs.dqx.rule import DQRowRule
from databricks.labs.dqx.pii.pii_detection_funcs import does_not_contain_pii
from databricks.labs.dqx.pii.nlp_engine_config import NLPEngineConfig

checks = [
# PII detection check using spacy as a named entity recognizer
DQRowRule(
Expand All @@ -3103,7 +3174,7 @@ These can be loaded using `NLPEngineConfig`:
column="description",
check_func=does_not_contain_pii,
check_func_kwargs={"nlp_engine_config": NLPEngineConfig.SPACY_MEDIUM}
),
),
]
```
</TabItem>
Expand All @@ -3123,7 +3194,7 @@ Using custom models for named-entity recognition may require you to install thes
from databricks.labs.dqx.rule import DQRowRule
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient

nlp_engine_config = {
'nlp_engine_name': 'transformers_stanford_deidentifier_base',
'models': [
Expand Down Expand Up @@ -3166,9 +3237,9 @@ Using custom models for named-entity recognition may require you to install thes
column="description",
check_func=does_not_contain_pii,
check_func_kwargs={"nlp_engine_config": nlp_engine_config},
),
),
]

dq_engine = DQEngine(WorkspaceClient())
df = spark.read.table("main.default.table")
valid_df, quarantine_df = dq_engine.apply_checks_and_split(df, checks)
Expand Down
Loading