Skip to content

Commit dacba6b

Browse files
committed
Update index.qmd
1 parent 66d584a commit dacba6b

File tree

1 file changed

+49
-19
lines changed

1 file changed

+49
-19
lines changed

docs/get-started/index.qmd

+49-19
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,11 @@ jupyter: python3
44
html-table-processing: none
55
---
66

7-
The pointblank library is all about assessing the state of data quality in a table. You provide the validation rules and the library will dutifully interrogate the data and provide useful reporting. We can use different types of tables like Polars and Pandas DataFrames, Parquet files, or various database tables (thanks to Ibis support). Let's walk through what table validation looks like in Pointblank!
7+
The Pointblank library is all about assessing the state of data quality in a table. You provide the
8+
validation rules and the library will dutifully interrogate the data and provide useful reporting.
9+
We can use different types of tables like Polars and Pandas DataFrames, Parquet files, or various
10+
database tables (thanks to Ibis support). Let's walk through what table validation looks like in
11+
Pointblank.
812

913
## A Simple Validation Table
1014

@@ -27,19 +31,24 @@ validation_1 = (
2731
validation_1
2832
```
2933

30-
Each row in this reporting table constitutes a single validation step. Roughly, the left-hand side outlines the validation rules and the right-hand side provides the results of each validation step. While simple in principle, there's a lot of useful information packed into this validation table.
34+
Each row in this reporting table constitutes a single validation step. Roughly, the left-hand side
35+
outlines the validation rules and the right-hand side provides the results of each validation step.
36+
While simple in principle, there's a lot of useful information packed into this validation table.
3137

3238
Here's a diagram that describes a few of the important parts of the validation table:
3339

3440
![](/assets/pointblank-validation-table.png){width=100%}
3541

3642
There are three things that should be noted here:
3743

38-
- validation steps: each step is a distinct test on the table focused on a certain aspect of the table
44+
- validation steps: each step is a separate test on the table, focused on a certain aspect of the
45+
table
3946
- validation rules: the validation type is provided here along with key constraints
40-
- validation results: interrogation results are provided here, with a breakdown of test units (*total*, *passing*, and *failing*), threshold states, and more
47+
- validation results: interrogation results are provided here, with a breakdown of test units
48+
(*total*, *passing*, and *failing*), threshold states, and more
4149

42-
The intent is to provide the key information in one place, and have it be interpretable by a broad audience.
50+
The intent is to provide the key information in one place, and have it be interpretable by data
51+
stakeholders.
4352

4453
## Example Code, Step-by-Step
4554

@@ -62,21 +71,30 @@ validation_2
6271

6372
Note these three key pieces in the code:
6473

65-
- the `Validate(data=...)` argument takes a DataFrame that you want to validate
74+
- the `Validate(data=...)` argument takes a DataFrame or database table that you want to validate
6675
- the methods starting with `col_*` specify validation steps that run on specific columns
6776
- the `interrogate()` method executes the validation plan on the table
6877

69-
That's data validation with pointblank in a nutshell! In the next section we'll go a bit further by introducing a means to gauge data quality with failure thresholds.
78+
This common pattern is used in a validation workflow, where `Validate()` and `interrogate()` bookend
79+
a validation plan generated through calling validation methods. And that's data validation with
80+
Pointblank in a nutshell! In the next section we'll go a bit further by introducing a means to gauge
81+
data quality with failure thresholds.
7082

7183
## Understanding Test Units
7284

73-
Each validation step will execute a validation test. For example, `col_vals_lt()` tests that each value in a column is less than a specified number. One important piece that's reported is the number of test units that pass or fail.
85+
Each validation step will execute a separate validation test on the target table. For example, the
86+
`col_vals_lt()` validation tests that each value in a column is less than a specified number. A key
87+
thing that's reported is the number of test units that pass or fail a validation step.
7488

75-
Test units are dependent on the test being run. The `col_vals_*` tests each value in a column, so each value will be a test unit.
89+
Test units are dependent on the test being run. The `col_vals_*` tests each value in a column, so
90+
each value will be a test unit (and the number of test units is the number of rows in the target
91+
table).
7692

77-
This matters because you can set thresholds that signal `WARN`, `STOP`, and `NOTIFY` states based the proportion or number of failing test units.
93+
This matters because you can set thresholds that signal `warn`, `stop`, and `notify` states based
94+
the proportion or number of failing test units.
7895

79-
Here's a simple example that uses a single `col_vals_lt()` step along with thresholds.
96+
Here's a simple example that uses a single `col_vals_lt()` step along with thresholds set in the
97+
`thresholds=` argument of the validation method.
8098

8199
```{python}
82100
validation_3 = (
@@ -88,19 +106,29 @@ validation_3 = (
88106
validation_3
89107
```
90108

91-
The code uses `thresholds=(2, 4)` to set a `WARN` threshold of `2` and a `STOP` threshold of `4`. Notice these pieces in the validation table:
109+
The code uses `thresholds=(2, 4)` to set a `warn` threshold of `2` and a `stop` threshold of `4`.
110+
If you look at the validation report table, we can see:
92111

93-
- The `FAIL` column shows that 2 tests units have failed
94-
- The `W` column (short for `WARN`) shows a filled yellow circle indicating it's reached threshold
95-
- The `S` column (short for `STOP`) shows an open red circle indicating it's below threshold
112+
- the `FAIL` column shows that 2 tests units have failed
113+
- the `W` column (short for `warn`) shows a filled yellow circle indicating those failing test units
114+
reached that threshold value
115+
- the `S` column (short for `stop`) shows an open red circle indicating that the number of failing
116+
test units is below that threshold
96117

97-
The one final threshold, `N` (`NOTIFY`), wasn't set so appears on the validation table as a dash.
118+
The one final threshold, `N` (for `notify`), wasn't set so it appears on the validation table as a
119+
long dash.
98120

99-
Thresholds let you take action at different levels of severity. The next section discusses setting and acting on thresholds in detail.
121+
Thresholds let you take action at different levels of severity. The next section discusses setting
122+
and acting on thresholds in detail.
100123

101124
## Using Threshold Levels
102125

103-
Thresholds enable you to signal failure at different severity levels. In the near future, thresholds will be able to trigger custom actions. For example, when testing a column for NULLs with `col_vals_not_null()` you might want to warn on any NULLs and stop where there are 20% NULLs in the column.
126+
Thresholds enable you to signal failure at different severity levels. In the near future, thresholds
127+
will be able to trigger custom actions (should those actions be defined).
128+
129+
Here's an example where we test if a certain column has Null/missing values with
130+
`col_vals_not_null()`. This is a case where we want to *warn* on any Null values and *stop* where
131+
there are 20% Nulls in the column.
104132

105133
```{python}
106134
validation_4 = (
@@ -112,6 +140,8 @@ validation_4 = (
112140
validation_4
113141
```
114142

115-
In this case, the `thresholds=` argument in the `cols_vals_not_null()` step was used, but we can also set thresholds globally by using `Validate(thresholds=...)`.
143+
In this case, the `thresholds=` argument in the `cols_vals_not_null()` step was set to `(1, 0.2)`,
144+
indicating 1 failing test unit is set for `warn` and a `0.2` fraction of all failing test units is
145+
set to `stop`.
116146

117147
For more on thresholds, see the [Thresholds article](./thresholds.qmd).

0 commit comments

Comments
 (0)