Update api-docs.txt

rich-iannone · rich-iannone · commit 74e4c13ffa9b · 2025-02-19T23:25:21.000-05:00
diff --git a/pointblank/data/api-docs.txt b/pointblank/data/api-docs.txt
@@ -245,6 +245,42 @@ Thresholds(warning: 'int | float | bool | None' = None, error: 'int | float | bo
     (in the [`Validate`](`pointblank.Validate`) class).
 
 
+Actions(warning: 'str | Callable | list[str | Callable] | None' = None, error: 'str | Callable | list[str | Callable] | None' = None, critical: 'str | Callable | list[str | Callable] | None' = None) -> None
+
+    Definition of action values.
+
+    Actions complement threshold values by defining what action should be taken when a threshold
+    level is reached. The action can be a string or a `Callable`. When a string is used, it is
+    interpreted as a message to be displayed. When a `Callable` is used, it will be invoked at
+    interrogation time if the threshold level is met or exceeded.
+
+    There are three threshold levels: 'warning', 'error', and 'critical'. These levels correspond
+    to different levels of severity when a threshold is reached. Those thresholds can be defined
+    using the [`Thresholds`](`pointblank.Thresholds`) class or various shorthand forms. Actions
+    don't have to be defined for all threshold levels; if an action is not defined for a level in
+    exceedence, no action will be taken.
+
+    Parameters
+    ----------
+    warning
+        A string, `Callable`, or list of `Callable`/string values for the 'warning' level. Using
+        `None` means no action should be performed at the 'warning' level.
+    error
+        A string, `Callable`, or list of `Callable`/string values for the 'error' level. Using
+        `None` means no action should be performed at the 'error' level.
+    critical
+        A string, `Callable`, or list of `Callable`/string values for the 'critical' level. Using
+        `None` means no action should be performed at the 'critical' level.
+
+    Returns
+    -------
+    Actions
+        An `Actions` object. This can be used when using the [`Validate`](`pointblank.Validate`)
+        class (to set actions for meeting different threshold levels globally) or when defining
+        validation steps like [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`) (so that actions
+        are scoped to individual validation steps, overriding any globally set actions).
+
+
 Schema(columns: 'str | list[str] | list[tuple[str, str]] | list[tuple[str]] | dict[str, str] | None' = None, tbl: 'any | None' = None, **kwargs)
 Definition of a schema object.
 
@@ -491,6 +527,171 @@ Definition of a schema object.
     `Schema` object is used in a validation workflow.
 
 
+DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None) -> None
+
+    Draft a validation plan for a given table using an LLM.
+
+    By using a large language model (LLM) to draft a validation plan, you can quickly generate a
+    starting point for validating a table. This can be useful when you have a new table and you
+    want to get a sense of how to validate it (and adjustments could always be made later). The
+    `DraftValidation` class uses the `chatlas` package to draft a validation plan for a given table
+    using an LLM from either the `"anthropic"`, `"openai"`, or `"bedrock"` provider. You can install
+    all requirements for the class by using an optional install of Pointblank via `pip install
+    pointblank[generate]`.
+
+    :::{.callout-warning}
+    The `DraftValidation()` class is still experimental. Please report any issues you encounter in
+    the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues).
+    :::
+
+    Parameters
+    ----------
+    data
+        The data to be used for drafting a validation plan.
+    model
+        The model to be used. This should be in the form of `provider:model` (e.g.,
+        `"anthropic:claude-3-5-sonnet-latest"`). Supported providers are `"anthropic"`, `"openai"`,
+        and `"bedrock"` (Amazon Bedrock).
+    api_key
+        The API key to be used for the model.
+
+    Returns
+    -------
+    str
+        The drafted validation plan.
+
+    Constructing the `model` Argument
+    ---------------------------------
+    The `model=` argument should be constructed using the provider and model name separated by a
+    colon. The provider can be either `"anthropic"` or `"openai"`. The model name should be the
+    specific model to be used. For example, model names are subject to change so consult the
+    provider's documentation for the most up-to-date model names.
+
+    Notes on Authentication
+    -----------------------
+    Providing a valid API key as a string in the `api_key` argument is adequate for getting started
+    but you should consider using a more secure method for handling API keys.
+
+    One way to do this is to load the API key from an environent variable and retrieve it using the
+    `os` module (specifically the `os.getenv()` function). Places to store the API key might
+    include `.bashrc`, `.bash_profile`, `.zshrc`, or `.zsh_profile`.
+
+    Another solution is to store one or more model provider API keys in an `.env` file (in the root
+    of your project). If the API keys have correct names (e.g., `ANTHROPIC_API_KEY` or
+    `OPENAI_API_KEY`) then DraftValidation will automatically load the API key from the `.env` file
+    and there's no need to provide the `api_key` argument. An `.env` file might look like this:
+
+    ```plaintext
+    ANTHROPIC_API_KEY="your_anthropic_api_key_here"
+    OPENAI_API_KEY="your_openai_api_key_here"
+    ```
+
+    There's no need to have the `python-dotenv` package installed when using `.env` files in this
+    way.
+
+    Notes on Data Sent to the Model Provider
+    ----------------------------------------
+    The data sent to the model provider is a JSON summary of the table. This data summary is
+    generated internally by `DraftValidation` using the `DataScan` class. The summary includes the
+    following information:
+
+    - the number of rows and columns in the table
+    - the type of dataset (e.g., Polars, DuckDB, Pandas, etc.)
+    - the column names and their types
+    - column level statistics such as the number of missing values, min, max, mean, and median, etc.
+    - a short list of data values in each column
+
+    The JSON summary is used to provide the model with the necessary information to draft a
+    validation plan. As such, even very large tables can be used with the `DraftValidation` class
+    since the contents of the table are not sent to the model provider.
+
+    Examples
+    --------
+    Let's look at how the `DraftValidation` class can be used to draft a validation plan for a
+    table. The table to be used is `"nycflights"`, which is available here via the
+    [`load_dataset()`](`pointblank.load_dataset`) function. The model to be used is
+    `"anthropic:claude-3-5-sonnet-latest"`. The example assumes that the API key is stored in an
+    `.env` file as `ANTHROPIC_API_KEY`.
+
+    ```python
+    import pointblank as pb
+
+    # Load the "nycflights" dataset as a DuckDB table
+    data = pb.load_dataset(dataset="nycflights", tbl_type="duckdb")
+
+    # Draft a validation plan for the "nycflights" table
+    pb.DraftValidation(data=nycflights, model="anthropic:claude-3-5-sonnet-latest")
+    ```
+
+    The output will be a drafted validation plan for the `"nycflights"` table and this will appear
+    in the console.
+
+    ````plaintext
+    ```python
+    import pointblank as pb
+
+    # Define schema based on column names and dtypes
+    schema = pb.Schema(columns=[
+        ("year", "int64"),
+        ("month", "int64"),
+        ("day", "int64"),
+        ("dep_time", "int64"),
+        ("sched_dep_time", "int64"),
+        ("dep_delay", "int64"),
+        ("arr_time", "int64"),
+        ("sched_arr_time", "int64"),
+        ("arr_delay", "int64"),
+        ("carrier", "string"),
+        ("flight", "int64"),
+        ("tailnum", "string"),
+        ("origin", "string"),
+        ("dest", "string"),
+        ("air_time", "int64"),
+        ("distance", "int64"),
+        ("hour", "int64"),
+        ("minute", "int64")
+    ])
+
+    # The validation plan
+    validation = (
+        pb.Validate(
+            data=your_data,
+            label="Draft Validation",
+            thresholds=pb.Thresholds(warning=0.10, error=0.25, critical=0.35)
+        )
+        .col_schema_match(schema=schema)
+        .col_vals_not_null(columns=[
+            "year", "month", "day", "sched_dep_time", "carrier", "flight",
+            "origin", "dest", "distance", "hour", "minute"
+        ])
+        .col_vals_between(columns="month", left=1, right=12)
+        .col_vals_between(columns="day", left=1, right=31)
+        .col_vals_between(columns="sched_dep_time", left=106, right=2359)
+        .col_vals_between(columns="dep_delay", left=-43, right=1301, na_pass=True)
+        .col_vals_between(columns="air_time", left=20, right=695, na_pass=True)
+        .col_vals_between(columns="distance", left=17, right=4983)
+        .col_vals_between(columns="hour", left=1, right=23)
+        .col_vals_between(columns="minute", left=0, right=59)
+        .col_vals_in_set(columns="origin", set=["EWR", "LGA", "JFK"])
+        .col_count_match(count=18)
+        .row_count_match(count=336776)
+        .rows_distinct()
+        .interrogate()
+    )
+
+    validation
+    ```
+    ````
+
+    The drafted validation plan can be copied and pasted into a Python script or notebook for
+    further use. In other words, the generated plan can be adjusted as needed to suit the specific
+    requirements of the table being validated.
+
+    Note that the output does not know how the data was obtained, so it uses the placeholder
+    `your_data` in the `data=` argument of the `Validate` class. This should be replaced with the
+    actual data variable.
+
+
 
 ## The Validation Steps family