Skip to content

Commit 74e4c13

Browse files
committed
Update api-docs.txt
1 parent a5d7c1a commit 74e4c13

File tree

1 file changed

+201
-0
lines changed

1 file changed

+201
-0
lines changed

pointblank/data/api-docs.txt

+201
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,42 @@ Thresholds(warning: 'int | float | bool | None' = None, error: 'int | float | bo
245245
(in the [`Validate`](`pointblank.Validate`) class).
246246

247247

248+
Actions(warning: 'str | Callable | list[str | Callable] | None' = None, error: 'str | Callable | list[str | Callable] | None' = None, critical: 'str | Callable | list[str | Callable] | None' = None) -> None
249+
250+
Definition of action values.
251+
252+
Actions complement threshold values by defining what action should be taken when a threshold
253+
level is reached. The action can be a string or a `Callable`. When a string is used, it is
254+
interpreted as a message to be displayed. When a `Callable` is used, it will be invoked at
255+
interrogation time if the threshold level is met or exceeded.
256+
257+
There are three threshold levels: 'warning', 'error', and 'critical'. These levels correspond
258+
to different levels of severity when a threshold is reached. Those thresholds can be defined
259+
using the [`Thresholds`](`pointblank.Thresholds`) class or various shorthand forms. Actions
260+
don't have to be defined for all threshold levels; if an action is not defined for a level in
261+
exceedence, no action will be taken.
262+
263+
Parameters
264+
----------
265+
warning
266+
A string, `Callable`, or list of `Callable`/string values for the 'warning' level. Using
267+
`None` means no action should be performed at the 'warning' level.
268+
error
269+
A string, `Callable`, or list of `Callable`/string values for the 'error' level. Using
270+
`None` means no action should be performed at the 'error' level.
271+
critical
272+
A string, `Callable`, or list of `Callable`/string values for the 'critical' level. Using
273+
`None` means no action should be performed at the 'critical' level.
274+
275+
Returns
276+
-------
277+
Actions
278+
An `Actions` object. This can be used when using the [`Validate`](`pointblank.Validate`)
279+
class (to set actions for meeting different threshold levels globally) or when defining
280+
validation steps like [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) (so that actions
281+
are scoped to individual validation steps, overriding any globally set actions).
282+
283+
248284
Schema(columns: 'str | list[str] | list[tuple[str, str]] | list[tuple[str]] | dict[str, str] | None' = None, tbl: 'any | None' = None, **kwargs)
249285
Definition of a schema object.
250286

@@ -491,6 +527,171 @@ Definition of a schema object.
491527
`Schema` object is used in a validation workflow.
492528

493529

530+
DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None) -> None
531+
532+
Draft a validation plan for a given table using an LLM.
533+
534+
By using a large language model (LLM) to draft a validation plan, you can quickly generate a
535+
starting point for validating a table. This can be useful when you have a new table and you
536+
want to get a sense of how to validate it (and adjustments could always be made later). The
537+
`DraftValidation` class uses the `chatlas` package to draft a validation plan for a given table
538+
using an LLM from either the `"anthropic"`, `"openai"`, or `"bedrock"` provider. You can install
539+
all requirements for the class by using an optional install of Pointblank via `pip install
540+
pointblank[generate]`.
541+
542+
:::{.callout-warning}
543+
The `DraftValidation()` class is still experimental. Please report any issues you encounter in
544+
the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues).
545+
:::
546+
547+
Parameters
548+
----------
549+
data
550+
The data to be used for drafting a validation plan.
551+
model
552+
The model to be used. This should be in the form of `provider:model` (e.g.,
553+
`"anthropic:claude-3-5-sonnet-latest"`). Supported providers are `"anthropic"`, `"openai"`,
554+
and `"bedrock"` (Amazon Bedrock).
555+
api_key
556+
The API key to be used for the model.
557+
558+
Returns
559+
-------
560+
str
561+
The drafted validation plan.
562+
563+
Constructing the `model` Argument
564+
---------------------------------
565+
The `model=` argument should be constructed using the provider and model name separated by a
566+
colon. The provider can be either `"anthropic"` or `"openai"`. The model name should be the
567+
specific model to be used. For example, model names are subject to change so consult the
568+
provider's documentation for the most up-to-date model names.
569+
570+
Notes on Authentication
571+
-----------------------
572+
Providing a valid API key as a string in the `api_key` argument is adequate for getting started
573+
but you should consider using a more secure method for handling API keys.
574+
575+
One way to do this is to load the API key from an environent variable and retrieve it using the
576+
`os` module (specifically the `os.getenv()` function). Places to store the API key might
577+
include `.bashrc`, `.bash_profile`, `.zshrc`, or `.zsh_profile`.
578+
579+
Another solution is to store one or more model provider API keys in an `.env` file (in the root
580+
of your project). If the API keys have correct names (e.g., `ANTHROPIC_API_KEY` or
581+
`OPENAI_API_KEY`) then DraftValidation will automatically load the API key from the `.env` file
582+
and there's no need to provide the `api_key` argument. An `.env` file might look like this:
583+
584+
```plaintext
585+
ANTHROPIC_API_KEY="your_anthropic_api_key_here"
586+
OPENAI_API_KEY="your_openai_api_key_here"
587+
```
588+
589+
There's no need to have the `python-dotenv` package installed when using `.env` files in this
590+
way.
591+
592+
Notes on Data Sent to the Model Provider
593+
----------------------------------------
594+
The data sent to the model provider is a JSON summary of the table. This data summary is
595+
generated internally by `DraftValidation` using the `DataScan` class. The summary includes the
596+
following information:
597+
598+
- the number of rows and columns in the table
599+
- the type of dataset (e.g., Polars, DuckDB, Pandas, etc.)
600+
- the column names and their types
601+
- column level statistics such as the number of missing values, min, max, mean, and median, etc.
602+
- a short list of data values in each column
603+
604+
The JSON summary is used to provide the model with the necessary information to draft a
605+
validation plan. As such, even very large tables can be used with the `DraftValidation` class
606+
since the contents of the table are not sent to the model provider.
607+
608+
Examples
609+
--------
610+
Let's look at how the `DraftValidation` class can be used to draft a validation plan for a
611+
table. The table to be used is `"nycflights"`, which is available here via the
612+
[`load_dataset()`](`pointblank.load_dataset`) function. The model to be used is
613+
`"anthropic:claude-3-5-sonnet-latest"`. The example assumes that the API key is stored in an
614+
`.env` file as `ANTHROPIC_API_KEY`.
615+
616+
```python
617+
import pointblank as pb
618+
619+
# Load the "nycflights" dataset as a DuckDB table
620+
data = pb.load_dataset(dataset="nycflights", tbl_type="duckdb")
621+
622+
# Draft a validation plan for the "nycflights" table
623+
pb.DraftValidation(data=nycflights, model="anthropic:claude-3-5-sonnet-latest")
624+
```
625+
626+
The output will be a drafted validation plan for the `"nycflights"` table and this will appear
627+
in the console.
628+
629+
````plaintext
630+
```python
631+
import pointblank as pb
632+
633+
# Define schema based on column names and dtypes
634+
schema = pb.Schema(columns=[
635+
("year", "int64"),
636+
("month", "int64"),
637+
("day", "int64"),
638+
("dep_time", "int64"),
639+
("sched_dep_time", "int64"),
640+
("dep_delay", "int64"),
641+
("arr_time", "int64"),
642+
("sched_arr_time", "int64"),
643+
("arr_delay", "int64"),
644+
("carrier", "string"),
645+
("flight", "int64"),
646+
("tailnum", "string"),
647+
("origin", "string"),
648+
("dest", "string"),
649+
("air_time", "int64"),
650+
("distance", "int64"),
651+
("hour", "int64"),
652+
("minute", "int64")
653+
])
654+
655+
# The validation plan
656+
validation = (
657+
pb.Validate(
658+
data=your_data,
659+
label="Draft Validation",
660+
thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35)
661+
)
662+
.col_schema_match(schema=schema)
663+
.col_vals_not_null(columns=[
664+
"year", "month", "day", "sched_dep_time", "carrier", "flight",
665+
"origin", "dest", "distance", "hour", "minute"
666+
])
667+
.col_vals_between(columns="month", left=1, right=12)
668+
.col_vals_between(columns="day", left=1, right=31)
669+
.col_vals_between(columns="sched_dep_time", left=106, right=2359)
670+
.col_vals_between(columns="dep_delay", left=-43, right=1301, na_pass=True)
671+
.col_vals_between(columns="air_time", left=20, right=695, na_pass=True)
672+
.col_vals_between(columns="distance", left=17, right=4983)
673+
.col_vals_between(columns="hour", left=1, right=23)
674+
.col_vals_between(columns="minute", left=0, right=59)
675+
.col_vals_in_set(columns="origin", set=["EWR", "LGA", "JFK"])
676+
.col_count_match(count=18)
677+
.row_count_match(count=336776)
678+
.rows_distinct()
679+
.interrogate()
680+
)
681+
682+
validation
683+
```
684+
````
685+
686+
The drafted validation plan can be copied and pasted into a Python script or notebook for
687+
further use. In other words, the generated plan can be adjusted as needed to suit the specific
688+
requirements of the table being validated.
689+
690+
Note that the output does not know how the data was obtained, so it uses the placeholder
691+
`your_data` in the `data=` argument of the `Validate` class. This should be replaced with the
692+
actual data variable.
693+
694+
494695

495696
## The Validation Steps family
496697

0 commit comments

Comments
 (0)