@@ -245,6 +245,42 @@ Thresholds(warning: 'int | float | bool | None' = None, error: 'int | float | bo
245
245
(in the [`Validate`](`pointblank.Validate`) class).
246
246
247
247
248
+ Actions(warning: 'str | Callable | list[str | Callable] | None' = None, error: 'str | Callable | list[str | Callable] | None' = None, critical: 'str | Callable | list[str | Callable] | None' = None) -> None
249
+
250
+ Definition of action values.
251
+
252
+ Actions complement threshold values by defining what action should be taken when a threshold
253
+ level is reached. The action can be a string or a `Callable`. When a string is used, it is
254
+ interpreted as a message to be displayed. When a `Callable` is used, it will be invoked at
255
+ interrogation time if the threshold level is met or exceeded.
256
+
257
+ There are three threshold levels: 'warning', 'error', and 'critical'. These levels correspond
258
+ to different levels of severity when a threshold is reached. Those thresholds can be defined
259
+ using the [`Thresholds`](`pointblank.Thresholds`) class or various shorthand forms. Actions
260
+ don't have to be defined for all threshold levels; if an action is not defined for a level in
261
+ exceedence, no action will be taken.
262
+
263
+ Parameters
264
+ ----------
265
+ warning
266
+ A string, `Callable`, or list of `Callable`/string values for the 'warning' level. Using
267
+ `None` means no action should be performed at the 'warning' level.
268
+ error
269
+ A string, `Callable`, or list of `Callable`/string values for the 'error' level. Using
270
+ `None` means no action should be performed at the 'error' level.
271
+ critical
272
+ A string, `Callable`, or list of `Callable`/string values for the 'critical' level. Using
273
+ `None` means no action should be performed at the 'critical' level.
274
+
275
+ Returns
276
+ -------
277
+ Actions
278
+ An `Actions` object. This can be used when using the [`Validate`](`pointblank.Validate`)
279
+ class (to set actions for meeting different threshold levels globally) or when defining
280
+ validation steps like [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) (so that actions
281
+ are scoped to individual validation steps, overriding any globally set actions).
282
+
283
+
248
284
Schema(columns: 'str | list[str] | list[tuple[str, str]] | list[tuple[str]] | dict[str, str] | None' = None, tbl: 'any | None' = None, **kwargs)
249
285
Definition of a schema object.
250
286
@@ -491,6 +527,171 @@ Definition of a schema object.
491
527
`Schema` object is used in a validation workflow.
492
528
493
529
530
+ DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None) -> None
531
+
532
+ Draft a validation plan for a given table using an LLM.
533
+
534
+ By using a large language model (LLM) to draft a validation plan, you can quickly generate a
535
+ starting point for validating a table. This can be useful when you have a new table and you
536
+ want to get a sense of how to validate it (and adjustments could always be made later). The
537
+ `DraftValidation` class uses the `chatlas` package to draft a validation plan for a given table
538
+ using an LLM from either the `"anthropic"`, `"openai"`, or `"bedrock"` provider. You can install
539
+ all requirements for the class by using an optional install of Pointblank via `pip install
540
+ pointblank[generate]`.
541
+
542
+ :::{.callout-warning}
543
+ The `DraftValidation()` class is still experimental. Please report any issues you encounter in
544
+ the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues).
545
+ :::
546
+
547
+ Parameters
548
+ ----------
549
+ data
550
+ The data to be used for drafting a validation plan.
551
+ model
552
+ The model to be used. This should be in the form of `provider:model` (e.g.,
553
+ `"anthropic:claude-3-5-sonnet-latest"`). Supported providers are `"anthropic"`, `"openai"`,
554
+ and `"bedrock"` (Amazon Bedrock).
555
+ api_key
556
+ The API key to be used for the model.
557
+
558
+ Returns
559
+ -------
560
+ str
561
+ The drafted validation plan.
562
+
563
+ Constructing the `model` Argument
564
+ ---------------------------------
565
+ The `model=` argument should be constructed using the provider and model name separated by a
566
+ colon. The provider can be either `"anthropic"` or `"openai"`. The model name should be the
567
+ specific model to be used. For example, model names are subject to change so consult the
568
+ provider's documentation for the most up-to-date model names.
569
+
570
+ Notes on Authentication
571
+ -----------------------
572
+ Providing a valid API key as a string in the `api_key` argument is adequate for getting started
573
+ but you should consider using a more secure method for handling API keys.
574
+
575
+ One way to do this is to load the API key from an environent variable and retrieve it using the
576
+ `os` module (specifically the `os.getenv()` function). Places to store the API key might
577
+ include `.bashrc`, `.bash_profile`, `.zshrc`, or `.zsh_profile`.
578
+
579
+ Another solution is to store one or more model provider API keys in an `.env` file (in the root
580
+ of your project). If the API keys have correct names (e.g., `ANTHROPIC_API_KEY` or
581
+ `OPENAI_API_KEY`) then DraftValidation will automatically load the API key from the `.env` file
582
+ and there's no need to provide the `api_key` argument. An `.env` file might look like this:
583
+
584
+ ```plaintext
585
+ ANTHROPIC_API_KEY="your_anthropic_api_key_here"
586
+ OPENAI_API_KEY="your_openai_api_key_here"
587
+ ```
588
+
589
+ There's no need to have the `python-dotenv` package installed when using `.env` files in this
590
+ way.
591
+
592
+ Notes on Data Sent to the Model Provider
593
+ ----------------------------------------
594
+ The data sent to the model provider is a JSON summary of the table. This data summary is
595
+ generated internally by `DraftValidation` using the `DataScan` class. The summary includes the
596
+ following information:
597
+
598
+ - the number of rows and columns in the table
599
+ - the type of dataset (e.g., Polars, DuckDB, Pandas, etc.)
600
+ - the column names and their types
601
+ - column level statistics such as the number of missing values, min, max, mean, and median, etc.
602
+ - a short list of data values in each column
603
+
604
+ The JSON summary is used to provide the model with the necessary information to draft a
605
+ validation plan. As such, even very large tables can be used with the `DraftValidation` class
606
+ since the contents of the table are not sent to the model provider.
607
+
608
+ Examples
609
+ --------
610
+ Let's look at how the `DraftValidation` class can be used to draft a validation plan for a
611
+ table. The table to be used is `"nycflights"`, which is available here via the
612
+ [`load_dataset()`](`pointblank.load_dataset`) function. The model to be used is
613
+ `"anthropic:claude-3-5-sonnet-latest"`. The example assumes that the API key is stored in an
614
+ `.env` file as `ANTHROPIC_API_KEY`.
615
+
616
+ ```python
617
+ import pointblank as pb
618
+
619
+ # Load the "nycflights" dataset as a DuckDB table
620
+ data = pb.load_dataset(dataset="nycflights", tbl_type="duckdb")
621
+
622
+ # Draft a validation plan for the "nycflights" table
623
+ pb.DraftValidation(data=nycflights, model="anthropic:claude-3-5-sonnet-latest")
624
+ ```
625
+
626
+ The output will be a drafted validation plan for the `"nycflights"` table and this will appear
627
+ in the console.
628
+
629
+ ````plaintext
630
+ ```python
631
+ import pointblank as pb
632
+
633
+ # Define schema based on column names and dtypes
634
+ schema = pb.Schema(columns=[
635
+ ("year", "int64"),
636
+ ("month", "int64"),
637
+ ("day", "int64"),
638
+ ("dep_time", "int64"),
639
+ ("sched_dep_time", "int64"),
640
+ ("dep_delay", "int64"),
641
+ ("arr_time", "int64"),
642
+ ("sched_arr_time", "int64"),
643
+ ("arr_delay", "int64"),
644
+ ("carrier", "string"),
645
+ ("flight", "int64"),
646
+ ("tailnum", "string"),
647
+ ("origin", "string"),
648
+ ("dest", "string"),
649
+ ("air_time", "int64"),
650
+ ("distance", "int64"),
651
+ ("hour", "int64"),
652
+ ("minute", "int64")
653
+ ])
654
+
655
+ # The validation plan
656
+ validation = (
657
+ pb.Validate(
658
+ data=your_data,
659
+ label="Draft Validation",
660
+ thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35)
661
+ )
662
+ .col_schema_match(schema=schema)
663
+ .col_vals_not_null(columns=[
664
+ "year", "month", "day", "sched_dep_time", "carrier", "flight",
665
+ "origin", "dest", "distance", "hour", "minute"
666
+ ])
667
+ .col_vals_between(columns="month", left=1, right=12)
668
+ .col_vals_between(columns="day", left=1, right=31)
669
+ .col_vals_between(columns="sched_dep_time", left=106, right=2359)
670
+ .col_vals_between(columns="dep_delay", left=-43, right=1301, na_pass=True)
671
+ .col_vals_between(columns="air_time", left=20, right=695, na_pass=True)
672
+ .col_vals_between(columns="distance", left=17, right=4983)
673
+ .col_vals_between(columns="hour", left=1, right=23)
674
+ .col_vals_between(columns="minute", left=0, right=59)
675
+ .col_vals_in_set(columns="origin", set=["EWR", "LGA", "JFK"])
676
+ .col_count_match(count=18)
677
+ .row_count_match(count=336776)
678
+ .rows_distinct()
679
+ .interrogate()
680
+ )
681
+
682
+ validation
683
+ ```
684
+ ````
685
+
686
+ The drafted validation plan can be copied and pasted into a Python script or notebook for
687
+ further use. In other words, the generated plan can be adjusted as needed to suit the specific
688
+ requirements of the table being validated.
689
+
690
+ Note that the output does not know how the data was obtained, so it uses the placeholder
691
+ `your_data` in the `data=` argument of the `Validate` class. This should be replaced with the
692
+ actual data variable.
693
+
694
+
494
695
495
696
## The Validation Steps family
496
697
0 commit comments