diff --git a/README.md b/README.md
index c7543569..66ef23a7 100644
--- a/README.md
+++ b/README.md
@@ -1,23 +1,22 @@
-

+
Hugging Face Inference Toolkit
-
Hugging Face Inference Toolkit is for serving 🤗 Transformers models in containers. This library provides default pre-processing, predict and postprocessing for Transformers, Sentence Tranfsformers. It is also possible to define custom `handler.py` for customization. The Toolkit is build to work with the [Hugging Face Hub](https://huggingface.co/models).
---
## 💻 Getting Started with Hugging Face Inference Toolkit
-* Clone the repository `git clone https://github.com/huggingface/huggingface-inference-toolkit``
-* Install the dependencies in dev mode `pip install -e ".[torch, st, diffusers, test,quality]"`
- * If you develop on AWS inferentia2 install with `pip install -e ".[test,quality]" optimum-neuron[neuronx] --upgrade`
+* Clone the repository `git clone
+* Install the dependencies in dev mode `pip install -e ".[torch,st,diffusers,test,quality]"`
+ * If you develop on AWS inferentia2 install with `pip install -e ".[test,quality]" optimum-neuron[neuronx] --upgrade`
+ * If you develop on Google Cloud install with `pip install -e ".[torch,st,diffusers,google,test,quality]"`
* Unit Testing: `make unit-test`
* Integration testing: `make integ-test`
-
### Local run
```bash
@@ -27,22 +26,22 @@ HF_MODEL_ID=hf-internal-testing/tiny-random-distilbert HF_MODEL_DIR=tmp2 HF_TASK
### Container
-
1. build the preferred container for either CPU or GPU for PyTorch.
-_cpu images_
+_CPU Images_
+
```bash
make inference-pytorch-cpu
```
-_gpu images_
+_GPU Images_
+
```bash
make inference-pytorch-gpu
```
2. Run the container and provide either environment variables to the HUB model you want to use or mount a volume to the container, where your model is stored.
-
```bash
docker run -ti -p 5000:5000 -e HF_MODEL_ID=distilbert-base-uncased-distilled-squad -e HF_TASK=question-answering integration-test-pytorch:cpu
docker run -ti -p 5000:5000 --gpus all -e HF_MODEL_ID=nlpconnect/vit-gpt2-image-captioning -e HF_TASK=image-to-text integration-test-pytorch:gpu
@@ -51,7 +50,6 @@ docker run -ti -p 5000:5000 --gpus all -e HF_MODEL_ID=stabilityai/stable-diffusi
docker run -ti -p 5000:5000 -e HF_MODEL_DIR=/repository -v $(pwd)/distilbert-base-uncased-emotion:/repository integration-test-pytorch:cpu
```
-
3. Send request. The API schema is the same as from the [inference API](https://huggingface.co/docs/api-inference/detailed_parameters)
```bash
@@ -59,17 +57,19 @@ curl --request POST \
--url http://localhost:5000 \
--header 'Content-Type: application/json' \
--data '{
- "inputs": {
- "question": "What is used for inference?",
- "context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
- }
+ "inputs": {
+ "question": "What is used for inference?",
+ "context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
+ }
}'
```
### Custom Handler and dependency support
-The Hugging Face Inference Toolkit allows user to provide a custom inference through a `handler.py` file which is located in the repository.
-For an example check [https://huggingface.co/philschmid/custom-pipeline-text-classification](https://huggingface.co/philschmid/custom-pipeline-text-classification):
+The Hugging Face Inference Toolkit allows user to provide a custom inference through a `handler.py` file which is located in the repository.
+
+For an example check [philschmid/custom-pipeline-text-classification](https://huggingface.co/philschmid/custom-pipeline-text-classification):
+
```bash
model.tar.gz/
|- pytorch_model.bin
@@ -77,17 +77,17 @@ model.tar.gz/
|- handler.py
|- requirements.txt
```
+
In this example, `pytroch_model.bin` is the model file saved from training, `handler.py` is the custom inference handler, and `requirements.txt` is a requirements file to add additional dependencies.
The custom module can override the following methods:
-
### Vertex AI Support
-The Hugging Face Inference Toolkit is also supported on Vertex AI, based on [Custom container requirements for prediction](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements). [Environment variables set by Vertex AI](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables) are automatically detected and used by the toolkit.
+The Hugging Face Inference Toolkit is also supported on Vertex AI, based on [Custom container requirements for prediction](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements). [Environment variables set by Vertex AI](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables) are automatically detected and used by the toolkit.
#### Local run with HF_MODEL_ID and HF_TASK
-Start Hugging Face Inference Toolkit with the following environment variables.
+Start Hugging Face Inference Toolkit with the following environment variables.
```bash
mkdir tmp2/
@@ -101,8 +101,8 @@ curl --request POST \
--url http://localhost:8080/pred \
--header 'Content-Type: application/json' \
--data '{
- "instances": ["I love this product", "I hate this product"],
- "parameters": { "top_k": 2 }
+ "instances": ["I love this product", "I hate this product"],
+ "parameters": { "top_k": 2 }
}'
```
@@ -124,18 +124,19 @@ docker run -ti -p 8080:8080 -e AIP_MODE=PREDICTION -e AIP_HTTP_PORT=8080 -e AIP_
```bash
curl --request POST \
- --url http://localhost:8080/pred \
- --header 'Content-Type: application/json' \
- --data '{
- "instances": ["I love this product", "I hate this product"],
- "parameters": { "top_k": 2 }
+ --url http://localhost:8080/pred \
+ --header 'Content-Type: application/json' \
+ --data '{
+ "instances": ["I love this product", "I hate this product"],
+ "parameters": { "top_k": 2 }
}'
```
-### AWS Inferentia2 Support
+### AWS Inferentia2 Support
The Hugging Face Inference Toolkit provides support for deploying Hugging Face on AWS Inferentia2. To deploy a model on Inferentia2 you have 3 options:
-* Provide `HF_MODEL_ID`, the model repo id on huggingface.co which contains the compiled model under `.neuron` format. e.g. `optimum/bge-base-en-v1.5-neuronx`
+
+* Provide `HF_MODEL_ID`, the model repo id on huggingface.co which contains the compiled model under `.neuron` format e.g. `optimum/bge-base-en-v1.5-neuronx`
* Provide the `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH` environment variables to compile the model on the fly, e.g. `HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128`
* Include `neuron` dictionary in the [config.json](https://huggingface.co/optimum/tiny_random_bert_neuron/blob/main/config.json) file in the model archive, e.g. `neuron: {"static_batch_size": 1, "static_sequence_length": 128}`
@@ -143,16 +144,19 @@ The currently supported tasks can be found [here](https://huggingface.co/docs/op
#### Local run with HF_MODEL_ID and HF_TASK
-Start Hugging Face Inference Toolkit with the following environment variables.
+Start Hugging Face Inference Toolkit with the following environment variables.
_Note: You need to run this on an Inferentia2 instance._
-- transformers `text-classification` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
+* transformers `text-classification` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
+
```bash
mkdir tmp2/
HF_MODEL_ID="distilbert/distilbert-base-uncased-finetuned-sst-2-english" HF_TASK="text-classification" HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128 HF_MODEL_DIR=tmp2 uvicorn src.huggingface_inference_toolkit.webservice_starlette:app --port 5000
```
-- sentence transformers `feature-extration` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
+
+* sentence transformers `feature-extration` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
+
```bash
HF_MODEL_ID="sentence-transformers/all-MiniLM-L6-v2" HF_TASK="feature-extraction" HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128 HF_MODEL_DIR=tmp2 uvicorn src.huggingface_inference_toolkit.webservice_starlette:app --port 5000
```
@@ -161,16 +165,15 @@ Send request
```bash
curl --request POST \
- --url http://localhost:5000 \
- --header 'Content-Type: application/json' \
- --data '{
- "inputs": "Wow, this is such a great product. I love it!"
+ --url http://localhost:5000 \
+ --header 'Content-Type: application/json' \
+ --data '{
+ "inputs": "Wow, this is such a great product. I love it!"
}'
```
#### Container run with HF_MODEL_ID and HF_TASK
-
1. build the preferred container for either CPU or GPU for PyTorch o.
```bash
@@ -187,26 +190,25 @@ docker run -ti -p 5000:5000 -e HF_MODEL_ID="distilbert/distilbert-base-uncased-f
```bash
curl --request POST \
- --url http://localhost:5000 \
- --header 'Content-Type: application/json' \
- --data '{
- "inputs": "Wow, this is such a great product. I love it!",
- "parameters": { "top_k": 2 }
+ --url http://localhost:5000 \
+ --header 'Content-Type: application/json' \
+ --data '{
+ "inputs": "Wow, this is such a great product. I love it!",
+ "parameters": { "top_k": 2 }
}'
```
-
---
## 🛠️ Environment variables
-The Hugging Face Inference Toolkit implements various additional environment variables to simplify your deployment experience. A full list of environment variables is given below. All potential environment varialbes can be found in [const.py](src/huggingface_inference_toolkit/const.py)
+The Hugging Face Inference Toolkit implements various additional environment variables to simplify your deployment experience. A full list of environment variables is given below. All potential environment variables can be found in [const.py](src/huggingface_inference_toolkit/const.py)
### `HF_MODEL_DIR`
-The `HF_MODEL_DIR` environment variable defines the directory where your model is stored or will be stored.
-If `HF_MODEL_ID` is not set the toolkit expects a the model artifact at this directory. This value should be set to the value where you mount your model artifacts.
-If `HF_MODEL_ID` is set the toolkit and the directory where `HF_MODEL_DIR` is pointing to is empty. The toolkit will download the model from the Hub to this directory.
+The `HF_MODEL_DIR` environment variable defines the directory where your model is stored or will be stored.
+If `HF_MODEL_ID` is not set the toolkit expects a the model artifact at this directory. This value should be set to the value where you mount your model artifacts.
+If `HF_MODEL_ID` is set the toolkit and the directory where `HF_MODEL_DIR` is pointing to is empty. The toolkit will download the model from the Hub to this directory.
The default value is `/opt/huggingface/model`
@@ -246,6 +248,14 @@ The `HF_HUB_TOKEN` environment variable defines the your Hugging Face authorizat
HF_HUB_TOKEN="api_XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
```
+### `HF_TRUST_REMOTE_CODE`
+
+The `HF_TRUST_REMOTE_CODE` environment variable defines whether to trust remote code. This flag is already used for community defined inference code, and is therefore quite representative of the level of confidence you are giving the model providers when loading models from the Hugging Face Hub. The default value is `"0"`; set it to `"1"` to trust remote code.
+
+```bash
+HF_TRUST_REMOTE_CODE="0"
+```
+
### `HF_FRAMEWORK`
The `HF_FRAMEWORK` environment variable defines the base deep learning framework used in the container. This is important when loading large models from the Hugguing Face Hub to avoid extra file downloads.
@@ -256,7 +266,7 @@ HF_FRAMEWORK="pytorch"
#### `HF_OPTIMUM_BATCH_SIZE`
-The `HF_OPTIMUM_BATCH_SIZE` environment variable defines the batch size, which is used when compiling the model to Neuron. The default value is `1`. Not required when model is already converted.
+The `HF_OPTIMUM_BATCH_SIZE` environment variable defines the batch size, which is used when compiling the model to Neuron. The default value is `1`. Not required when model is already converted.
```bash
HF_OPTIMUM_BATCH_SIZE="1"
@@ -264,7 +274,7 @@ HF_OPTIMUM_BATCH_SIZE="1"
#### `HF_OPTIMUM_SEQUENCE_LENGTH`
-The `HF_OPTIMUM_SEQUENCE_LENGTH` environment variable defines the sequence length, which is used when compiling the model to Neuron. There is no default value. Not required when model is already converted.
+The `HF_OPTIMUM_SEQUENCE_LENGTH` environment variable defines the sequence length, which is used when compiling the model to Neuron. There is no default value. Not required when model is already converted.
```bash
HF_OPTIMUM_SEQUENCE_LENGTH="128"
@@ -272,12 +282,12 @@ HF_OPTIMUM_SEQUENCE_LENGTH="128"
---
-## ⚙ Supported Frontend
+## ⚙ Supported Front-Ends
-- [x] Starlette (HF Endpoints)
-- [x] Starlette (Vertex AI)
-- [ ] Starlette (Azure ML)
-- [ ] Starlette (SageMaker)
+* [x] Starlette (HF Endpoints)
+* [x] Starlette (Vertex AI)
+* [ ] Starlette (Azure ML)
+* [ ] Starlette (SageMaker)
---
@@ -287,6 +297,6 @@ HF_OPTIMUM_SEQUENCE_LENGTH="128"
## 📜 License
-TBD.
+TBD.
---
diff --git a/setup.py b/setup.py
index 7068f7a6..374ee85b 100644
--- a/setup.py
+++ b/setup.py
@@ -5,12 +5,12 @@
# We don't declare our dependency on transformers here because we build with
# different packages for different variants
-VERSION = "0.4.1.dev0"
+VERSION = "0.4.2"
# Ubuntu packages
# libsndfile1-dev: torchaudio requires the development version of the libsndfile package which can be installed via a system package manager. On Ubuntu it can be installed as follows: apt install libsndfile1-dev
# ffmpeg: ffmpeg is required for audio processing. On Ubuntu it can be installed as follows: apt install ffmpeg
-# libavcodec-extra : libavcodec-extra inculdes additional codecs for ffmpeg
+# libavcodec-extra : libavcodec-extra includes additional codecs for ffmpeg
install_requires = [
"transformers[sklearn,sentencepiece,audio,vision]==4.41.1",
diff --git a/src/huggingface_inference_toolkit/const.py b/src/huggingface_inference_toolkit/const.py
index 993fea26..eb6ddcb8 100644
--- a/src/huggingface_inference_toolkit/const.py
+++ b/src/huggingface_inference_toolkit/const.py
@@ -1,13 +1,18 @@
import os
from pathlib import Path
+from huggingface_inference_toolkit.env_utils import strtobool
+
HF_MODEL_DIR = os.environ.get("HF_MODEL_DIR", "/opt/huggingface/model")
HF_MODEL_ID = os.environ.get("HF_MODEL_ID", None)
HF_TASK = os.environ.get("HF_TASK", None)
HF_FRAMEWORK = os.environ.get("HF_FRAMEWORK", None)
HF_REVISION = os.environ.get("HF_REVISION", None)
HF_HUB_TOKEN = os.environ.get("HF_HUB_TOKEN", None)
+HF_TRUST_REMOTE_CODE = strtobool(os.environ.get("HF_TRUST_REMOTE_CODE", "0"))
# custom handler consts
HF_DEFAULT_PIPELINE_NAME = os.environ.get("HF_DEFAULT_PIPELINE_NAME", "handler.py")
# default is pipeline.PreTrainedPipeline
-HF_MODULE_NAME = os.environ.get("HF_MODULE_NAME", f"{Path(HF_DEFAULT_PIPELINE_NAME).stem}.EndpointHandler")
+HF_MODULE_NAME = os.environ.get(
+ "HF_MODULE_NAME", f"{Path(HF_DEFAULT_PIPELINE_NAME).stem}.EndpointHandler"
+)
diff --git a/src/huggingface_inference_toolkit/diffusers_utils.py b/src/huggingface_inference_toolkit/diffusers_utils.py
index f6241032..afe96676 100644
--- a/src/huggingface_inference_toolkit/diffusers_utils.py
+++ b/src/huggingface_inference_toolkit/diffusers_utils.py
@@ -1,4 +1,5 @@
import importlib.util
+from typing import Union
from transformers.utils.import_utils import is_torch_bf16_gpu_available
@@ -21,14 +22,16 @@ def is_diffusers_available():
class IEAutoPipelineForText2Image:
- def __init__(self, model_dir: str, device: str = None): # needs "cuda" for GPU
+ def __init__(
+ self, model_dir: str, device: Union[str, None] = None, **kwargs
+ ): # needs "cuda" for GPU
dtype = torch.float32
if device == "cuda":
dtype = torch.bfloat16 if is_torch_bf16_gpu_available() else torch.float16
device_map = "auto" if device == "cuda" else None
self.pipeline = AutoPipelineForText2Image.from_pretrained(
- model_dir, torch_dtype=dtype, device_map=device_map
+ model_dir, torch_dtype=dtype, device_map=device_map, **kwargs
)
# try to use DPMSolverMultistepScheduler
if isinstance(self.pipeline, StableDiffusionPipeline):
@@ -66,5 +69,5 @@ def __call__(
def get_diffusers_pipeline(task=None, model_dir=None, device=-1, **kwargs):
"""Get a pipeline for Diffusers models."""
device = "cuda" if device == 0 else "cpu"
- pipeline = DIFFUSERS_TASKS[task](model_dir=model_dir, device=device)
+ pipeline = DIFFUSERS_TASKS[task](model_dir=model_dir, device=device, **kwargs)
return pipeline
diff --git a/src/huggingface_inference_toolkit/env_utils.py b/src/huggingface_inference_toolkit/env_utils.py
new file mode 100644
index 00000000..e582ec98
--- /dev/null
+++ b/src/huggingface_inference_toolkit/env_utils.py
@@ -0,0 +1,22 @@
+def strtobool(val: str) -> bool:
+ """Convert a string representation of truth to True or False booleans.
+ True values are 'y', 'yes', 't', 'true', 'on', and '1'; false values
+ are 'n', 'no', 'f', 'false', 'off', and '0'.
+
+ Raises:
+ ValueError: if 'val' is anything else.
+
+ Note:
+ Function `strtobool` copied and adapted from `distutils`, as it's deprecated from Python 3.10 onwards.
+
+ References:
+ - https://github.com/python/cpython/blob/48f9d3e3faec5faaa4f7c9849fecd27eae4da213/Lib/distutils/util.py#L308-L321
+ """
+ val = val.lower()
+ if val in ("y", "yes", "t", "true", "on", "1"):
+ return True
+ if val in ("n", "no", "f", "false", "off", "0"):
+ return False
+ raise ValueError(
+ f"Invalid truth value, it should be a string but {val} was provided instead."
+ )
diff --git a/src/huggingface_inference_toolkit/handler.py b/src/huggingface_inference_toolkit/handler.py
index 5b164af8..636f185b 100644
--- a/src/huggingface_inference_toolkit/handler.py
+++ b/src/huggingface_inference_toolkit/handler.py
@@ -2,6 +2,7 @@
from pathlib import Path
from typing import Optional, Union
+from huggingface_inference_toolkit.const import HF_TRUST_REMOTE_CODE
from huggingface_inference_toolkit.utils import (
check_and_register_custom_pipeline_from_directory,
get_pipeline,
@@ -16,7 +17,10 @@ class HuggingFaceHandler:
def __init__(self, model_dir: Union[str, Path], task=None, framework="pt"):
self.pipeline = get_pipeline(
- model_dir=model_dir, task=task, framework=framework
+ model_dir=model_dir,
+ task=task,
+ framework=framework,
+ trust_remote_code=HF_TRUST_REMOTE_CODE,
)
def __call__(self, data):
@@ -66,7 +70,7 @@ def __call__(self, data):
payload = {"inputs": inputs, "parameters": parameters}
predictions.append(super().__call__(payload))
- # reutrn predictions
+ # return predictions
return {"predictions": predictions}
diff --git a/src/huggingface_inference_toolkit/sentence_transformers_utils.py b/src/huggingface_inference_toolkit/sentence_transformers_utils.py
index 72bb2ee2..a4df0955 100644
--- a/src/huggingface_inference_toolkit/sentence_transformers_utils.py
+++ b/src/huggingface_inference_toolkit/sentence_transformers_utils.py
@@ -54,5 +54,5 @@ def get_sentence_transformers_pipeline(
**kwargs
):
device = "cuda" if device == 0 else "cpu"
- pipeline = SENTENCE_TRANSFORMERS_TASKS[task](model_dir=model_dir, device=device)
+ pipeline = SENTENCE_TRANSFORMERS_TASKS[task](model_dir=model_dir, device=device, **kwargs)
return pipeline
diff --git a/src/huggingface_inference_toolkit/utils.py b/src/huggingface_inference_toolkit/utils.py
index a0519d92..89261d71 100644
--- a/src/huggingface_inference_toolkit/utils.py
+++ b/src/huggingface_inference_toolkit/utils.py
@@ -8,7 +8,11 @@
from transformers.file_utils import is_tf_available, is_torch_available
from transformers.pipelines import Pipeline
-from huggingface_inference_toolkit.const import HF_DEFAULT_PIPELINE_NAME, HF_MODULE_NAME
+from huggingface_inference_toolkit.const import (
+ HF_DEFAULT_PIPELINE_NAME,
+ HF_MODULE_NAME,
+ HF_TRUST_REMOTE_CODE,
+)
from huggingface_inference_toolkit.diffusers_utils import (
get_diffusers_pipeline,
is_diffusers_available,
diff --git a/src/huggingface_inference_toolkit/webservice_starlette.py b/src/huggingface_inference_toolkit/webservice_starlette.py
index f8da6dcb..52cf161b 100644
--- a/src/huggingface_inference_toolkit/webservice_starlette.py
+++ b/src/huggingface_inference_toolkit/webservice_starlette.py
@@ -80,7 +80,7 @@ async def predict(request):
# checks if input schema is correct
if "inputs" not in deserialized_body and "instances" not in deserialized_body:
raise ValueError(
- f"Body needs to provide a inputs key, recieved: {orjson.dumps(deserialized_body)}"
+ f"Body needs to provide a inputs key, received: {orjson.dumps(deserialized_body)}"
)
# check for query parameter and add them to the body