huggingface · alvarobartt · Aug 9, 2024 · Aug 8, 2024 · Aug 8, 2024 · Aug 8, 2024
diff --git a/README.md b/README.md
@@ -1,23 +1,22 @@
 
 <div style="display:flex; text-align:center; justify-content:center;">
-<img src="https://huggingface.co/front/assets/huggingface_logo.svg" width="100"/> 
+<img src="https://huggingface.co/front/assets/huggingface_logo.svg" width="100"/>
 <h1 style="margin-top:auto;"> Hugging Face Inference Toolkit <h1>
 </div>
 
-
 Hugging Face Inference Toolkit is for serving 🤗 Transformers models in containers. This library provides default pre-processing, predict and postprocessing for Transformers, Sentence Tranfsformers. It is also possible to define custom `handler.py` for customization. The Toolkit is build to work with the [Hugging Face Hub](https://huggingface.co/models).
 
 ---
 
 ## 💻  Getting Started with Hugging Face Inference Toolkit
 
-* Clone the repository `git clone https://github.com/huggingface/huggingface-inference-toolkit``
-* Install the dependencies in dev mode `pip install -e ".[torch, st, diffusers, test,quality]"`
-	* If you develop on AWS inferentia2 install with `pip install -e ".[test,quality]" optimum-neuron[neuronx] --upgrade`
+* Clone the repository `git clone <https://github.com/huggingface/huggingface-inference-toolkit``>
+* Install the dependencies in dev mode `pip install -e ".[torch,st,diffusers,test,quality]"`
+  * If you develop on AWS inferentia2 install with `pip install -e ".[test,quality]" optimum-neuron[neuronx] --upgrade`
+  * If you develop on Google Cloud install with `pip install -e ".[torch,st,diffusers,google,test,quality]"`
 * Unit Testing: `make unit-test`
 * Integration testing: `make integ-test`
 
-
 ### Local run
 
 ```bash
@@ -27,22 +26,22 @@ HF_MODEL_ID=hf-internal-testing/tiny-random-distilbert HF_MODEL_DIR=tmp2 HF_TASK
 
 ### Container
 
-
 1. build the preferred container for either CPU or GPU for PyTorch.
 
-_cpu images_
+_CPU Images_
+
 ```bash
 make inference-pytorch-cpu
 ```
 
-_gpu images_
+_GPU Images_
+
 ```bash
 make inference-pytorch-gpu
 ```
 
 2. Run the container and provide either environment variables to the HUB model you want to use or mount a volume to the container, where your model is stored.
 
-
 ```bash
 docker run -ti -p 5000:5000 -e HF_MODEL_ID=distilbert-base-uncased-distilled-squad -e HF_TASK=question-answering integration-test-pytorch:cpu
 docker run -ti -p 5000:5000 --gpus all -e HF_MODEL_ID=nlpconnect/vit-gpt2-image-captioning -e HF_TASK=image-to-text integration-test-pytorch:gpu
@@ -51,43 +50,44 @@ docker run -ti -p 5000:5000 --gpus all -e HF_MODEL_ID=stabilityai/stable-diffusi
 docker run -ti -p 5000:5000 -e HF_MODEL_DIR=/repository -v $(pwd)/distilbert-base-uncased-emotion:/repository integration-test-pytorch:cpu
 ```
 
-
 3. Send request. The API schema is the same as from the [inference API](https://huggingface.co/docs/api-inference/detailed_parameters)
 
 ```bash
 curl --request POST \
   --url http://localhost:5000 \
   --header 'Content-Type: application/json' \
   --data '{
-	"inputs": {
-		"question": "What is used for inference?",
-		"context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
-	}
+ "inputs": {
+  "question": "What is used for inference?",
+  "context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
+ }
 }'
 ```
 
 ### Custom Handler and dependency support
 
-The Hugging Face Inference Toolkit allows user to provide a custom inference through a `handler.py` file which is located in the repository. 
-For an example check [https://huggingface.co/philschmid/custom-pipeline-text-classification](https://huggingface.co/philschmid/custom-pipeline-text-classification):  
+The Hugging Face Inference Toolkit allows user to provide a custom inference through a `handler.py` file which is located in the repository.
+
+For an example check [philschmid/custom-pipeline-text-classification](https://huggingface.co/philschmid/custom-pipeline-text-classification):  
+
 ```bash
 model.tar.gz/
 |- pytorch_model.bin
 |- ....
 |- handler.py
 |- requirements.txt 
 ```
+
 In this example, `pytroch_model.bin` is the model file saved from training, `handler.py` is the custom inference handler, and `requirements.txt` is a requirements file to add additional dependencies.
 The custom module can override the following methods:  
 
-
 ### Vertex AI Support
 
-The Hugging Face Inference Toolkit is also supported on Vertex AI, based on [Custom container requirements for prediction](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements). [Environment variables set by Vertex AI](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables) are automatically detected and used by the toolkit. 
+The Hugging Face Inference Toolkit is also supported on Vertex AI, based on [Custom container requirements for prediction](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements). [Environment variables set by Vertex AI](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables) are automatically detected and used by the toolkit.
 
 #### Local run with HF_MODEL_ID and HF_TASK
 
-Start Hugging Face Inference Toolkit with the following environment variables. 
+Start Hugging Face Inference Toolkit with the following environment variables.
 
 ```bash
 mkdir tmp2/
@@ -101,8 +101,8 @@ curl --request POST \
   --url http://localhost:8080/pred \
   --header 'Content-Type: application/json' \
   --data '{
-	"instances": ["I love this product", "I hate this product"],
-	"parameters": { "top_k": 2 }
+ "instances": ["I love this product", "I hate this product"],
+ "parameters": { "top_k": 2 }
 }'
 ```
 
@@ -124,35 +124,39 @@ docker run -ti -p 8080:8080 -e AIP_MODE=PREDICTION -e AIP_HTTP_PORT=8080 -e AIP_
 
 ```bash
 curl --request POST \
-	--url http://localhost:8080/pred \
-	--header 'Content-Type: application/json' \
-	--data '{
-	"instances": ["I love this product", "I hate this product"],
-	"parameters": { "top_k": 2 }
+ --url http://localhost:8080/pred \
+ --header 'Content-Type: application/json' \
+ --data '{
+ "instances": ["I love this product", "I hate this product"],
+ "parameters": { "top_k": 2 }
 }'
 ```
 
-### AWS Inferentia2 Support 
+### AWS Inferentia2 Support
 
 The Hugging Face Inference Toolkit provides support for deploying Hugging Face on AWS Inferentia2. To deploy a model on Inferentia2 you have 3 options:
-* Provide `HF_MODEL_ID`, the model repo id on huggingface.co which contains the compiled model under `.neuron` format. e.g. `optimum/bge-base-en-v1.5-neuronx`
+
+* Provide `HF_MODEL_ID`, the model repo id on huggingface.co which contains the compiled model under `.neuron` format e.g. `optimum/bge-base-en-v1.5-neuronx`
 * Provide the `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH` environment variables to compile the model on the fly, e.g. `HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128`
 * Include `neuron` dictionary in the [config.json](https://huggingface.co/optimum/tiny_random_bert_neuron/blob/main/config.json) file in the model archive, e.g. `neuron: {"static_batch_size": 1, "static_sequence_length": 128}`
 
 The currently supported tasks can be found [here](https://huggingface.co/docs/optimum-neuron/en/package_reference/supported_models). If you plan to deploy an LLM, we recommend taking a look at [Neuronx TGI](https://huggingface.co/blog/text-generation-inference-on-inferentia2), which is purposly build for LLMs.
 
 #### Local run with HF_MODEL_ID and HF_TASK
 
-Start Hugging Face Inference Toolkit with the following environment variables. 
+Start Hugging Face Inference Toolkit with the following environment variables.
 
 _Note: You need to run this on an Inferentia2 instance._
 
-- transformers `text-classification` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
+* transformers `text-classification` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
+
 ```bash
 mkdir tmp2/
 HF_MODEL_ID="distilbert/distilbert-base-uncased-finetuned-sst-2-english" HF_TASK="text-classification" HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128  HF_MODEL_DIR=tmp2 uvicorn src.huggingface_inference_toolkit.webservice_starlette:app  --port 5000
 ```
-- sentence transformers `feature-extration` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
+
+* sentence transformers `feature-extration` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
+
 ```bash
 HF_MODEL_ID="sentence-transformers/all-MiniLM-L6-v2" HF_TASK="feature-extraction" HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128 HF_MODEL_DIR=tmp2 uvicorn src.huggingface_inference_toolkit.webservice_starlette:app  --port 5000
 ```
@@ -161,16 +165,15 @@ Send request
 
 ```bash
 curl --request POST \
-	--url http://localhost:5000 \
-	--header 'Content-Type: application/json' \
-	--data '{
-	"inputs": "Wow, this is such a great product. I love it!"
+ --url http://localhost:5000 \
+ --header 'Content-Type: application/json' \
+ --data '{
+ "inputs": "Wow, this is such a great product. I love it!"
 }'
 ```
 
 #### Container run with HF_MODEL_ID and HF_TASK
 
-
 1. build the preferred container for either CPU or GPU for PyTorch o.
 
 ```bash
@@ -187,26 +190,25 @@ docker run -ti -p 5000:5000 -e HF_MODEL_ID="distilbert/distilbert-base-uncased-f
 
 ```bash
 curl --request POST \
-	--url http://localhost:5000 \
-	--header 'Content-Type: application/json' \
-	--data '{
-	"inputs": "Wow, this is such a great product. I love it!",
-	"parameters": { "top_k": 2 }
+ --url http://localhost:5000 \
+ --header 'Content-Type: application/json' \
+ --data '{
+ "inputs": "Wow, this is such a great product. I love it!",
+ "parameters": { "top_k": 2 }
 }'
 ```
 
-
 ---
 
 ## 🛠️ Environment variables
 
-The Hugging Face Inference Toolkit implements various additional environment variables to simplify your deployment experience. A full list of environment variables is given below. All potential environment varialbes can be found in [const.py](src/huggingface_inference_toolkit/const.py)
+The Hugging Face Inference Toolkit implements various additional environment variables to simplify your deployment experience. A full list of environment variables is given below. All potential environment variables can be found in [const.py](src/huggingface_inference_toolkit/const.py)
 
 ### `HF_MODEL_DIR`
 
-The `HF_MODEL_DIR` environment variable defines the directory where your model is stored or will be stored. 
-If `HF_MODEL_ID` is not set the toolkit expects a the model artifact at this directory. This value should be set to the value where you mount your model artifacts. 
-If `HF_MODEL_ID` is set the toolkit and the directory where `HF_MODEL_DIR` is pointing to is empty. The toolkit will download the model from the Hub to this directory. 
+The `HF_MODEL_DIR` environment variable defines the directory where your model is stored or will be stored.
+If `HF_MODEL_ID` is not set the toolkit expects a the model artifact at this directory. This value should be set to the value where you mount your model artifacts.
+If `HF_MODEL_ID` is set the toolkit and the directory where `HF_MODEL_DIR` is pointing to is empty. The toolkit will download the model from the Hub to this directory.
 
 The default value is `/opt/huggingface/model`
 
@@ -246,6 +248,14 @@ The `HF_HUB_TOKEN` environment variable defines the your Hugging Face authorizat
 HF_HUB_TOKEN="api_XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
 ```
 
+### `HF_TRUST_REMOTE_CODE`
+
+The `HF_TRUST_REMOTE_CODE` environment variable defines whether to trust remote code. This flag is already used for community defined inference code, and is therefore quite representative of the level of confidence you are giving the model providers when loading models from the Hugging Face Hub. The default value is `"0"`; set it to `"1"` to trust remote code.
+
+```bash
+HF_TRUST_REMOTE_CODE="0"
+```
+
 ### `HF_FRAMEWORK`
 
 The `HF_FRAMEWORK` environment variable defines the base deep learning framework used in the container. This is important when loading large models from the Hugguing Face Hub to avoid extra file downloads.
@@ -256,28 +266,28 @@ HF_FRAMEWORK="pytorch"
 
 #### `HF_OPTIMUM_BATCH_SIZE`
 
-The `HF_OPTIMUM_BATCH_SIZE` environment variable defines the batch size, which is used when compiling the model to Neuron. The default value is `1`. Not required when model is already converted. 
+The `HF_OPTIMUM_BATCH_SIZE` environment variable defines the batch size, which is used when compiling the model to Neuron. The default value is `1`. Not required when model is already converted.
 
 ```bash
 HF_OPTIMUM_BATCH_SIZE="1"
 ```
 
 #### `HF_OPTIMUM_SEQUENCE_LENGTH`
 
-The `HF_OPTIMUM_SEQUENCE_LENGTH` environment variable defines the sequence length, which is used when compiling the model to Neuron. There is no default value. Not required when model is already converted. 
+The `HF_OPTIMUM_SEQUENCE_LENGTH` environment variable defines the sequence length, which is used when compiling the model to Neuron. There is no default value. Not required when model is already converted.
 
 ```bash
 HF_OPTIMUM_SEQUENCE_LENGTH="128"
 ```
 
 ---
 
-##  ⚙ Supported Frontend
+## ⚙ Supported Front-Ends
 
-- [x] Starlette (HF Endpoints)
-- [x] Starlette (Vertex AI)
-- [ ] Starlette (Azure ML)
-- [ ] Starlette (SageMaker)
+* [x] Starlette (HF Endpoints)
+* [x] Starlette (Vertex AI)
+* [ ] Starlette (Azure ML)
+* [ ] Starlette (SageMaker)
 
 ---
 
@@ -287,6 +297,6 @@ HF_OPTIMUM_SEQUENCE_LENGTH="128"
 
 ## 📜  License
 
-TBD. 
+TBD.
 
 ---
diff --git a/setup.py b/setup.py
@@ -5,12 +5,12 @@
 # We don't declare our dependency on transformers here because we build with
 # different packages for different variants
 
-VERSION = "0.4.1.dev0"
+VERSION = "0.4.2"
 
 # Ubuntu packages
 # libsndfile1-dev: torchaudio requires the development version of the libsndfile package which can be installed via a system package manager. On Ubuntu it can be installed as follows: apt install libsndfile1-dev
 # ffmpeg: ffmpeg is required for audio processing. On Ubuntu it can be installed as follows: apt install ffmpeg
-# libavcodec-extra : libavcodec-extra  inculdes additional codecs for ffmpeg
+# libavcodec-extra : libavcodec-extra  includes additional codecs for ffmpeg
 
 install_requires = [
     "transformers[sklearn,sentencepiece,audio,vision]==4.41.1",

diff --git a/src/huggingface_inference_toolkit/const.py b/src/huggingface_inference_toolkit/const.py
@@ -1,13 +1,18 @@
 import os
 from pathlib import Path
 
+from huggingface_inference_toolkit.env_utils import strtobool
+
 HF_MODEL_DIR = os.environ.get("HF_MODEL_DIR", "/opt/huggingface/model")
 HF_MODEL_ID = os.environ.get("HF_MODEL_ID", None)
 HF_TASK = os.environ.get("HF_TASK", None)
 HF_FRAMEWORK = os.environ.get("HF_FRAMEWORK", None)
 HF_REVISION = os.environ.get("HF_REVISION", None)
 HF_HUB_TOKEN = os.environ.get("HF_HUB_TOKEN", None)
+HF_TRUST_REMOTE_CODE = strtobool(os.environ.get("HF_TRUST_REMOTE_CODE", "0"))
 # custom handler consts
 HF_DEFAULT_PIPELINE_NAME = os.environ.get("HF_DEFAULT_PIPELINE_NAME", "handler.py")
 # default is pipeline.PreTrainedPipeline
-HF_MODULE_NAME = os.environ.get("HF_MODULE_NAME", f"{Path(HF_DEFAULT_PIPELINE_NAME).stem}.EndpointHandler")
+HF_MODULE_NAME = os.environ.get(
+    "HF_MODULE_NAME", f"{Path(HF_DEFAULT_PIPELINE_NAME).stem}.EndpointHandler"
+)
diff --git a/src/huggingface_inference_toolkit/diffusers_utils.py b/src/huggingface_inference_toolkit/diffusers_utils.py
@@ -1,4 +1,5 @@
 import importlib.util
+from typing import Union
 
 from transformers.utils.import_utils import is_torch_bf16_gpu_available
 
@@ -21,14 +22,16 @@ def is_diffusers_available():
 
 
 class IEAutoPipelineForText2Image:
-    def __init__(self, model_dir: str, device: str = None):  # needs "cuda" for GPU
+    def __init__(
+        self, model_dir: str, device: Union[str, None] = None, **kwargs
+    ):  # needs "cuda" for GPU
         dtype = torch.float32
         if device == "cuda":
             dtype = torch.bfloat16 if is_torch_bf16_gpu_available() else torch.float16
         device_map = "auto" if device == "cuda" else None
 
         self.pipeline = AutoPipelineForText2Image.from_pretrained(
-            model_dir, torch_dtype=dtype, device_map=device_map
+            model_dir, torch_dtype=dtype, device_map=device_map, **kwargs
         )
         # try to use DPMSolverMultistepScheduler
         if isinstance(self.pipeline, StableDiffusionPipeline):
@@ -66,5 +69,5 @@ def __call__(
 def get_diffusers_pipeline(task=None, model_dir=None, device=-1, **kwargs):
     """Get a pipeline for Diffusers models."""
     device = "cuda" if device == 0 else "cpu"
-    pipeline = DIFFUSERS_TASKS[task](model_dir=model_dir, device=device)
+    pipeline = DIFFUSERS_TASKS[task](model_dir=model_dir, device=device, **kwargs)
     return pipeline
diff --git a/src/huggingface_inference_toolkit/env_utils.py b/src/huggingface_inference_toolkit/env_utils.py
@@ -0,0 +1,22 @@
+def strtobool(val: str) -> bool:
+    """Convert a string representation of truth to True or False booleans.
+    True values are 'y', 'yes', 't', 'true', 'on', and '1'; false values
+    are 'n', 'no', 'f', 'false', 'off', and '0'.
+
+    Raises:
+        ValueError: if 'val' is anything else.
+
+    Note:
+        Function `strtobool` copied and adapted from `distutils`, as it's deprecated from Python 3.10 onwards.
+
+    References:
+        - https://github.com/python/cpython/blob/48f9d3e3faec5faaa4f7c9849fecd27eae4da213/Lib/distutils/util.py#L308-L321
+    """
+    val = val.lower()
+    if val in ("y", "yes", "t", "true", "on", "1"):
+        return True
+    if val in ("n", "no", "f", "false", "off", "0"):
+        return False
+    raise ValueError(
+        f"Invalid truth value, it should be a string but {val} was provided instead."
+    )