You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/get-started/index.qmd
+49-19
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,11 @@ jupyter: python3
4
4
html-table-processing: none
5
5
---
6
6
7
-
The pointblank library is all about assessing the state of data quality in a table. You provide the validation rules and the library will dutifully interrogate the data and provide useful reporting. We can use different types of tables like Polars and Pandas DataFrames, Parquet files, or various database tables (thanks to Ibis support). Let's walk through what table validation looks like in Pointblank!
7
+
The Pointblank library is all about assessing the state of data quality in a table. You provide the
8
+
validation rules and the library will dutifully interrogate the data and provide useful reporting.
9
+
We can use different types of tables like Polars and Pandas DataFrames, Parquet files, or various
10
+
database tables (thanks to Ibis support). Let's walk through what table validation looks like in
11
+
Pointblank.
8
12
9
13
## A Simple Validation Table
10
14
@@ -27,19 +31,24 @@ validation_1 = (
27
31
validation_1
28
32
```
29
33
30
-
Each row in this reporting table constitutes a single validation step. Roughly, the left-hand side outlines the validation rules and the right-hand side provides the results of each validation step. While simple in principle, there's a lot of useful information packed into this validation table.
34
+
Each row in this reporting table constitutes a single validation step. Roughly, the left-hand side
35
+
outlines the validation rules and the right-hand side provides the results of each validation step.
36
+
While simple in principle, there's a lot of useful information packed into this validation table.
31
37
32
38
Here's a diagram that describes a few of the important parts of the validation table:
- validation steps: each step is a distinct test on the table focused on a certain aspect of the table
44
+
- validation steps: each step is a separate test on the table, focused on a certain aspect of the
45
+
table
39
46
- validation rules: the validation type is provided here along with key constraints
40
-
- validation results: interrogation results are provided here, with a breakdown of test units (*total*, *passing*, and *failing*), threshold states, and more
47
+
- validation results: interrogation results are provided here, with a breakdown of test units
48
+
(*total*, *passing*, and *failing*), threshold states, and more
41
49
42
-
The intent is to provide the key information in one place, and have it be interpretable by a broad audience.
50
+
The intent is to provide the key information in one place, and have it be interpretable by data
51
+
stakeholders.
43
52
44
53
## Example Code, Step-by-Step
45
54
@@ -62,21 +71,30 @@ validation_2
62
71
63
72
Note these three key pieces in the code:
64
73
65
-
- the `Validate(data=...)` argument takes a DataFrame that you want to validate
74
+
- the `Validate(data=...)` argument takes a DataFrame or database table that you want to validate
66
75
- the methods starting with `col_*` specify validation steps that run on specific columns
67
76
- the `interrogate()` method executes the validation plan on the table
68
77
69
-
That's data validation with pointblank in a nutshell! In the next section we'll go a bit further by introducing a means to gauge data quality with failure thresholds.
78
+
This common pattern is used in a validation workflow, where `Validate()` and `interrogate()` bookend
79
+
a validation plan generated through calling validation methods. And that's data validation with
80
+
Pointblank in a nutshell! In the next section we'll go a bit further by introducing a means to gauge
81
+
data quality with failure thresholds.
70
82
71
83
## Understanding Test Units
72
84
73
-
Each validation step will execute a validation test. For example, `col_vals_lt()` tests that each value in a column is less than a specified number. One important piece that's reported is the number of test units that pass or fail.
85
+
Each validation step will execute a separate validation test on the target table. For example, the
86
+
`col_vals_lt()` validation tests that each value in a column is less than a specified number. A key
87
+
thing that's reported is the number of test units that pass or fail a validation step.
74
88
75
-
Test units are dependent on the test being run. The `col_vals_*` tests each value in a column, so each value will be a test unit.
89
+
Test units are dependent on the test being run. The `col_vals_*` tests each value in a column, so
90
+
each value will be a test unit (and the number of test units is the number of rows in the target
91
+
table).
76
92
77
-
This matters because you can set thresholds that signal `WARN`, `STOP`, and `NOTIFY` states based the proportion or number of failing test units.
93
+
This matters because you can set thresholds that signal `warn`, `stop`, and `notify` states based
94
+
the proportion or number of failing test units.
78
95
79
-
Here's a simple example that uses a single `col_vals_lt()` step along with thresholds.
96
+
Here's a simple example that uses a single `col_vals_lt()` step along with thresholds set in the
97
+
`thresholds=` argument of the validation method.
80
98
81
99
```{python}
82
100
validation_3 = (
@@ -88,19 +106,29 @@ validation_3 = (
88
106
validation_3
89
107
```
90
108
91
-
The code uses `thresholds=(2, 4)` to set a `WARN` threshold of `2` and a `STOP` threshold of `4`. Notice these pieces in the validation table:
109
+
The code uses `thresholds=(2, 4)` to set a `warn` threshold of `2` and a `stop` threshold of `4`.
110
+
If you look at the validation report table, we can see:
92
111
93
-
- The `FAIL` column shows that 2 tests units have failed
94
-
- The `W` column (short for `WARN`) shows a filled yellow circle indicating it's reached threshold
95
-
- The `S` column (short for `STOP`) shows an open red circle indicating it's below threshold
112
+
- the `FAIL` column shows that 2 tests units have failed
113
+
- the `W` column (short for `warn`) shows a filled yellow circle indicating those failing test units
114
+
reached that threshold value
115
+
- the `S` column (short for `stop`) shows an open red circle indicating that the number of failing
116
+
test units is below that threshold
96
117
97
-
The one final threshold, `N` (`NOTIFY`), wasn't set so appears on the validation table as a dash.
118
+
The one final threshold, `N` (for `notify`), wasn't set so it appears on the validation table as a
119
+
long dash.
98
120
99
-
Thresholds let you take action at different levels of severity. The next section discusses setting and acting on thresholds in detail.
121
+
Thresholds let you take action at different levels of severity. The next section discusses setting
122
+
and acting on thresholds in detail.
100
123
101
124
## Using Threshold Levels
102
125
103
-
Thresholds enable you to signal failure at different severity levels. In the near future, thresholds will be able to trigger custom actions. For example, when testing a column for NULLs with `col_vals_not_null()` you might want to warn on any NULLs and stop where there are 20% NULLs in the column.
126
+
Thresholds enable you to signal failure at different severity levels. In the near future, thresholds
127
+
will be able to trigger custom actions (should those actions be defined).
128
+
129
+
Here's an example where we test if a certain column has Null/missing values with
130
+
`col_vals_not_null()`. This is a case where we want to *warn* on any Null values and *stop* where
131
+
there are 20% Nulls in the column.
104
132
105
133
```{python}
106
134
validation_4 = (
@@ -112,6 +140,8 @@ validation_4 = (
112
140
validation_4
113
141
```
114
142
115
-
In this case, the `thresholds=` argument in the `cols_vals_not_null()` step was used, but we can also set thresholds globally by using `Validate(thresholds=...)`.
143
+
In this case, the `thresholds=` argument in the `cols_vals_not_null()` step was set to `(1, 0.2)`,
144
+
indicating 1 failing test unit is set for `warn` and a `0.2` fraction of all failing test units is
145
+
set to `stop`.
116
146
117
147
For more on thresholds, see the [Thresholds article](./thresholds.qmd).
0 commit comments