In this developer guide we explain how to build this project and how you can add new transformer tasks and tests.
There are two ways to install the Transformers Extension Package:
This project needs Python 3.10 or above installed on the development machine.
In addition, in order to build Python packages you need to have the Poetry
(>= 2.1.0) package manager. Then you can install and build the transformers-extension as follows:
poetry install
poetry buildInstead of building yourself, the latest version of the Python package of this extension can be downloaded
from the Releases in the GitHub Repository (see the latest release).
Please download the built archive
exasol_transformers_extension-<version-number>-py3-none-any.whl(transformers_extension.whl in older versions)
and install it as follows:
pip install <path/wheel-filename.whl> --extra-index-url https://download.pytorch.org/whl/cpuThe wheel should be installed in transformers-extension/dist. After updating and building a new release
there may be multiple wheels installed here. This leads to problems, so check and delete the old wheels if necessary.
You may also need to check
transformers-extension/language_container/exasol_transformers_extension_container/flavor_base/release/dist for the same reason.
All unit and integration tests can be run within the Poetry environment created for the project using nox. See the nox file for all tasks run by nox. There are three tasks for tests.
Run unit tests:
poetry run -- nox -s test:unitStart a test database and run all integration tests:
poetry run -- nox -s start_database
poetry run -- nox -s test:integrationrun parts of the integration tests:
poetry run -- nox -s onprem_integration_tests
poetry run -- nox -s saas_integration_tests
poetry run -- nox -s without_db_integration_testsYou can find more information regarding the tests in the Tests section below
We use the Python toolbox, however some things are modified for this project, mainly because we run our integration tests differently here.
- we don't use the "slow-checks" workflow for running integration tests, instead we run them in AWS
- this means code coverage is currently not run on integration tests.
- we also don't use the "build-and-publish" workflow, because we need to build and upload the slc at the moment. instead we use release droid with the release_droid_upload_github_release_assets workflow for now.
In the transformers-extension library, the 8 most popular NLP tasks provided by Transformers API have already been defined. We created separate UDF scripts for each NLP task. You can find these tasks and UDF script usage details in the User Guide. This section shows you step by step how to add a new NLP task to this library.
The new task's UDF template should be added to the exasol_transformers_extension/resources/templates/
directory. Please pay attention that the UDF script is uses "SET UDF" and the inputs
are received ordered by pre-determined columns. In addition, the first 4 input
arguments of the UDF script should be:
device_id: To run on GPU, specify the valid cuda device ID. Otherwise, you can provide NULL for this parameter.bucketfs_conn: The BucketFS connection namesub_dir: The directory where the model is stored in the BucketFS.model_name: The name of the model to use for prediction. You can find the details of the models in huggingface models page.
Please note that the output emitted by the UDF is created by adding the model inference output to the inputs.
Before implementing the UDF logic (examined in item 4 in this section), the
run function responsible for calling the newly created UDF script should be
defined in exasol_transformers_extension/udfs/callers/.
Also add the new udf to the lists in tests/utils/db_queries.py
The added UDF template and defined UDF caller should be added to the dictionary
in the exasol_transformers_extension/deployment/constants.py script. Thus,
we know which template belongs to which script during deployment.
The UDF script should be a subclass of BaseModelUDF.
The UDF class must be defined under the exasol_transformers_extension/udfs/models/
directory.
It should be named in the schema of "AiUDF".
Here, you can define settings for transformers , as well as a TransformationPipeline.
The TransformationPipeline holds a list of Transformations. A transformation is an
implementation of the Transformation-Protocol.
They get a DataFrame as input, and transform it in some way. You may also define
expected_input_columns, new_columns and removed_columns for each Transformation.
If none of the existing Transformation implementations (see Section 4.1) suit your task, you can write
your own. More information can be found in Section 4.2.
This TransformationPipeline will then be given to the BaseModelUDF, which manages
the data input and output. Here, the
TransformationPipeline will be executed in order.
For a prediction udf you will want one of the Transformation to be a
WithModelTransformation(PredictionTaskTransformation(PredictionTask)
The PredictionTask is an implementation of the PredictionTask-Protocol.
It holds the
logic for a specific NLP-Task. Multiple UDF classes can use the same PredictionTask.
If none of the existing PredictionTask implementations (see Section 4.3) suit your task, you can write
your own. More information can be found in Section 4.4.
The PredictionTask get then wrapped in the PredictionTaskTransformation which
handles calling the PredictionTasks functions. Then, the PredictionTaskTransformation
gets wrapped in the WithModelTransformation which handles loading the required
transformers models.
A typical UDF script will look like this:
class Ai<name>UDF(BaseModelUDF):
def __init__(
self,
exa, # the exasol context, comes from the udf caller
batch_size=100,
pipeline=transformers.pipeline,
# defines which types of model the udf will load correctly,
# depends on the task you want to solve
base_model=transformers.AutoModelFor<ModelType>,
tokenizer=transformers.AutoTokenizer,
# defines the PredictionTask implementation you want to use.
# depends on the transformers task-type you want to use.TaskType
prediction_task=<TaskType>PredictionTask(...)
):
# Define your pipeline here. You might also want to use a
# span Transformation or a Default values transformation
transformations = TransformationPipeline(
[
UniqueModelDataframeTransformation(),
UniqueModelParamsDataframeTransformation(...),
WithModelTransformation(
exa,
PredictionTaskTransformation(
prediction_task=prediction_task,
...
),
),
]
)
# initialize the super class
super().__init__(
batch_size,
pipeline,
base_model,
tokenizer,
prediction_task=prediction_task,
transformations=transformations,
)We have already implemented the following Transformations,
which you might want to use in your UDF:
- UniqueModelDataframeTransformation : Splits the input DataFrame into multiple DataFrames, based on which model is found in the model_name, bucketfs_conn and sub_dir columns.
- UniqueModelParamsDataframeTransformation : Splits the input DataFrame into multiple DataFrames, based on which model-parameters are found. Calls PredictionTask.extract_unique_param_based_dataframes, since the model-parameters are tied to the transformers task-type.
- PredictionTaskTransformation : Calls prediction_task.execute_prediction, prediction_task.create_dataframes_from_predictions, prediction_task.append_predictions_to_input_dataframe and returns a DataFrame containing input and prediction results.
- SpanColumnsTokenClassificationTransformation : Transformation for adding result span columns to the output of the token-classification prediction task.
- SpanColumnsZeroShotTransformation : Transformation for adding result span columns to the output of the zero-shot-classification prediction task.
- RemoveColumnsTransformation : Transformation for removing all Columns in removed_columns from the dataframe.
- AddDefaultColumnsTransformation: Transformation for adding columns in new_columns and filling them with default values from default_udf_parameters.py
You can add your own implementation of the Transformation-Protocol if needed, please do so in the
exasol_transformers_extension/udfs/models/transformation
directory.
Typical Transformations have lists of columns as input, which can be used to ensure
the output format is correct even in case of an error.
This is important, since the udf can only emit DataFrames with the correct format.
These lists are typically named expected_input_columns, new_columns
and removed_columns.
Your new <YourTransformation>Transformation class should
implement the following function:
transform: holds the logic of the transformation itself.check_input_format: Checks if all needed columns for transform are present.ensure_output_format: Ensure all promised output columns are present.
You Might want to use exasol_transformers_extension/udfs/models/transformation/utils For some common tasks.
We have already implemented the following PredictionTasks,
which you might want to use in your UDF:
- FillMaskPredictionTask: Task logic for using the "fill-mask" transformers task.
- AnswerPredictionTask :
Task logic for using the "question-answering" transformers task. - EntailmentPredictionTask : Task logic for using the "text-classification" transformers task. Expects two text inputs per row.
- TextClassifyPredictionTask :
Task logic for using the "text-classification" transformers task in a prediction udf. Expects one text inputs per row. - TextGenPredictionTask : Task logic for using the "text-generation" transformers task.
- TokenClassifyPredictionTask : Task logic for using the "token-classification" transformers task.
- TranslatePredictionTask: Task logic for using the "translation" transformers task.
- ZeroShotPredictionTask : Task logic for using the "zero-shot-classification" transformers task.
The <YourTask>PredictionTask class, in which we implement the logic of the desired task,
must be defined under the exasol_transformers_extension/udfs/models/prediction_tasks
directory. This class should extend the PredictionTask-Protocol.
The PredictionTask is a Protocol for ensuring the following methods are implemented
and have correct input and output types:
extract_unique_param_based_dataframes: Even if the data in a given dataframe all have the same model, there might be differences within the given dataframe with different model parameters (e.g. top_k parameter in AIFillMaskExtendedUDF). This method is responsible for extracting unique dataframes which share both the same model and model parameters.execute_prediction: Performs prediction on a given text list using recently loaded models.create_dataframes_from_predictions: Converts list of predictions to pandas dataframe.append_predictions_to_input_dataframe: Reformats the dataframe used in prediction, such that each input row has a row for each prediction result.
Moreover, Some parameters can be set to manage the (model)output:
- we use
desired_fields_in_predictionto filter the output of the model.
- Unit tests use the udf-mock-python library that tests UDFs locally without a database.
- Different scenarios with different UDF inputs and different model parameters
are defined under the
test/unit/udf_wrapper_params/directory. - These different scenarios are parameterized in the UDF tests.
These tests are grouped into two groups and there are separate tests for each UDF script in each group:
without dbtests the UDF class and functionality that includes the UDF logic.with_dbperforms end-to-end test by running the UDF query statements in the database.
The automatic run of the Integration tests on GitHub push are moved into AWS for this repository. They are
only run if you add [CodeBuild] to the commit message.
Currently, the CodeBuild project is managed manually and is triggered with a webhook on branch push.
For this our aws-ci user is added to this Repository. The webhook can be configured in the AWS CodeBuild
project directly.
The CodeBuild project also uses our DockerHub user for the build. For this it has access to the AWS SecretsManager.
- Hugging Face models consist of 2 parts, the model and the Tokenizer. Most of our functions deal with both parts
We have multiple scripts and udfs for installing and downloading models:
InstallDefaultModelsUDF reads the DEFAULT_MODEL_SPECS and installs the default models by calling InUDFModelDownloader.
ModelDownloaderUDF gets a model as input, installs it by calling InUDFModelDownloader.
InUDFModelDownloader creates a bucketfs_location from exa and installs a model by calling HuggingFaceHubBucketFSModelTransferSP.
HuggingFaceHubBucketFSModelTransferSP downloads a model to a temporary directory and then installs the model to the given bucketfs_model_path using BucketFSModelUploader.
BucketFSModelUploader uploads model files to bucketfs using upload_model_files_to_bucketfs.
upload_model_command calls the upload_model function.
upload_model function creates a bucketfs_location from params and calls install_huggingface_model.
install_huggingface_model downloads and uploads a model to bucketfs using HuggingFaceHubBucketFSModelTransfer.
