Skip to content

Add HF_TRUST_REMOTE_CODE environment variable #78

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 9, 2024
120 changes: 65 additions & 55 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,22 @@

<div style="display:flex; text-align:center; justify-content:center;">
<img src="https://huggingface.co/front/assets/huggingface_logo.svg" width="100"/>
<img src="https://huggingface.co/front/assets/huggingface_logo.svg" width="100"/>
<h1 style="margin-top:auto;"> Hugging Face Inference Toolkit <h1>
</div>


Hugging Face Inference Toolkit is for serving 🤗 Transformers models in containers. This library provides default pre-processing, predict and postprocessing for Transformers, Sentence Tranfsformers. It is also possible to define custom `handler.py` for customization. The Toolkit is build to work with the [Hugging Face Hub](https://huggingface.co/models).

---

## 💻 Getting Started with Hugging Face Inference Toolkit

* Clone the repository `git clone https://github.com/huggingface/huggingface-inference-toolkit``
* Install the dependencies in dev mode `pip install -e ".[torch, st, diffusers, test,quality]"`
* If you develop on AWS inferentia2 install with `pip install -e ".[test,quality]" optimum-neuron[neuronx] --upgrade`
* Clone the repository `git clone <https://github.com/huggingface/huggingface-inference-toolkit``>
* Install the dependencies in dev mode `pip install -e ".[torch,st,diffusers,test,quality]"`
* If you develop on AWS inferentia2 install with `pip install -e ".[test,quality]" optimum-neuron[neuronx] --upgrade`
* If you develop on Google Cloud install with `pip install -e ".[torch,st,diffusers,google,test,quality]"`
* Unit Testing: `make unit-test`
* Integration testing: `make integ-test`


### Local run

```bash
Expand All @@ -27,22 +26,22 @@ HF_MODEL_ID=hf-internal-testing/tiny-random-distilbert HF_MODEL_DIR=tmp2 HF_TASK

### Container


1. build the preferred container for either CPU or GPU for PyTorch.

_cpu images_
_CPU Images_

```bash
make inference-pytorch-cpu
```

_gpu images_
_GPU Images_

```bash
make inference-pytorch-gpu
```

2. Run the container and provide either environment variables to the HUB model you want to use or mount a volume to the container, where your model is stored.


```bash
docker run -ti -p 5000:5000 -e HF_MODEL_ID=distilbert-base-uncased-distilled-squad -e HF_TASK=question-answering integration-test-pytorch:cpu
docker run -ti -p 5000:5000 --gpus all -e HF_MODEL_ID=nlpconnect/vit-gpt2-image-captioning -e HF_TASK=image-to-text integration-test-pytorch:gpu
Expand All @@ -51,43 +50,44 @@ docker run -ti -p 5000:5000 --gpus all -e HF_MODEL_ID=stabilityai/stable-diffusi
docker run -ti -p 5000:5000 -e HF_MODEL_DIR=/repository -v $(pwd)/distilbert-base-uncased-emotion:/repository integration-test-pytorch:cpu
```


3. Send request. The API schema is the same as from the [inference API](https://huggingface.co/docs/api-inference/detailed_parameters)

```bash
curl --request POST \
--url http://localhost:5000 \
--header 'Content-Type: application/json' \
--data '{
"inputs": {
"question": "What is used for inference?",
"context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
}
"inputs": {
"question": "What is used for inference?",
"context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
}
}'
```

### Custom Handler and dependency support

The Hugging Face Inference Toolkit allows user to provide a custom inference through a `handler.py` file which is located in the repository.
For an example check [https://huggingface.co/philschmid/custom-pipeline-text-classification](https://huggingface.co/philschmid/custom-pipeline-text-classification):
The Hugging Face Inference Toolkit allows user to provide a custom inference through a `handler.py` file which is located in the repository.

For an example check [philschmid/custom-pipeline-text-classification](https://huggingface.co/philschmid/custom-pipeline-text-classification):

```bash
model.tar.gz/
|- pytorch_model.bin
|- ....
|- handler.py
|- requirements.txt
```

In this example, `pytroch_model.bin` is the model file saved from training, `handler.py` is the custom inference handler, and `requirements.txt` is a requirements file to add additional dependencies.
The custom module can override the following methods:


### Vertex AI Support

The Hugging Face Inference Toolkit is also supported on Vertex AI, based on [Custom container requirements for prediction](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements). [Environment variables set by Vertex AI](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables) are automatically detected and used by the toolkit.
The Hugging Face Inference Toolkit is also supported on Vertex AI, based on [Custom container requirements for prediction](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements). [Environment variables set by Vertex AI](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables) are automatically detected and used by the toolkit.

#### Local run with HF_MODEL_ID and HF_TASK

Start Hugging Face Inference Toolkit with the following environment variables.
Start Hugging Face Inference Toolkit with the following environment variables.

```bash
mkdir tmp2/
Expand All @@ -101,8 +101,8 @@ curl --request POST \
--url http://localhost:8080/pred \
--header 'Content-Type: application/json' \
--data '{
"instances": ["I love this product", "I hate this product"],
"parameters": { "top_k": 2 }
"instances": ["I love this product", "I hate this product"],
"parameters": { "top_k": 2 }
}'
```

Expand All @@ -124,35 +124,39 @@ docker run -ti -p 8080:8080 -e AIP_MODE=PREDICTION -e AIP_HTTP_PORT=8080 -e AIP_

```bash
curl --request POST \
--url http://localhost:8080/pred \
--header 'Content-Type: application/json' \
--data '{
"instances": ["I love this product", "I hate this product"],
"parameters": { "top_k": 2 }
--url http://localhost:8080/pred \
--header 'Content-Type: application/json' \
--data '{
"instances": ["I love this product", "I hate this product"],
"parameters": { "top_k": 2 }
}'
```

### AWS Inferentia2 Support
### AWS Inferentia2 Support

The Hugging Face Inference Toolkit provides support for deploying Hugging Face on AWS Inferentia2. To deploy a model on Inferentia2 you have 3 options:
* Provide `HF_MODEL_ID`, the model repo id on huggingface.co which contains the compiled model under `.neuron` format. e.g. `optimum/bge-base-en-v1.5-neuronx`

* Provide `HF_MODEL_ID`, the model repo id on huggingface.co which contains the compiled model under `.neuron` format e.g. `optimum/bge-base-en-v1.5-neuronx`
* Provide the `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH` environment variables to compile the model on the fly, e.g. `HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128`
* Include `neuron` dictionary in the [config.json](https://huggingface.co/optimum/tiny_random_bert_neuron/blob/main/config.json) file in the model archive, e.g. `neuron: {"static_batch_size": 1, "static_sequence_length": 128}`

The currently supported tasks can be found [here](https://huggingface.co/docs/optimum-neuron/en/package_reference/supported_models). If you plan to deploy an LLM, we recommend taking a look at [Neuronx TGI](https://huggingface.co/blog/text-generation-inference-on-inferentia2), which is purposly build for LLMs.

#### Local run with HF_MODEL_ID and HF_TASK

Start Hugging Face Inference Toolkit with the following environment variables.
Start Hugging Face Inference Toolkit with the following environment variables.

_Note: You need to run this on an Inferentia2 instance._

- transformers `text-classification` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
* transformers `text-classification` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`

```bash
mkdir tmp2/
HF_MODEL_ID="distilbert/distilbert-base-uncased-finetuned-sst-2-english" HF_TASK="text-classification" HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128 HF_MODEL_DIR=tmp2 uvicorn src.huggingface_inference_toolkit.webservice_starlette:app --port 5000
```
- sentence transformers `feature-extration` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`

* sentence transformers `feature-extration` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`

```bash
HF_MODEL_ID="sentence-transformers/all-MiniLM-L6-v2" HF_TASK="feature-extraction" HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128 HF_MODEL_DIR=tmp2 uvicorn src.huggingface_inference_toolkit.webservice_starlette:app --port 5000
```
Expand All @@ -161,16 +165,15 @@ Send request

```bash
curl --request POST \
--url http://localhost:5000 \
--header 'Content-Type: application/json' \
--data '{
"inputs": "Wow, this is such a great product. I love it!"
--url http://localhost:5000 \
--header 'Content-Type: application/json' \
--data '{
"inputs": "Wow, this is such a great product. I love it!"
}'
```

#### Container run with HF_MODEL_ID and HF_TASK


1. build the preferred container for either CPU or GPU for PyTorch o.

```bash
Expand All @@ -187,26 +190,25 @@ docker run -ti -p 5000:5000 -e HF_MODEL_ID="distilbert/distilbert-base-uncased-f

```bash
curl --request POST \
--url http://localhost:5000 \
--header 'Content-Type: application/json' \
--data '{
"inputs": "Wow, this is such a great product. I love it!",
"parameters": { "top_k": 2 }
--url http://localhost:5000 \
--header 'Content-Type: application/json' \
--data '{
"inputs": "Wow, this is such a great product. I love it!",
"parameters": { "top_k": 2 }
}'
```


---

## 🛠️ Environment variables

The Hugging Face Inference Toolkit implements various additional environment variables to simplify your deployment experience. A full list of environment variables is given below. All potential environment varialbes can be found in [const.py](src/huggingface_inference_toolkit/const.py)
The Hugging Face Inference Toolkit implements various additional environment variables to simplify your deployment experience. A full list of environment variables is given below. All potential environment variables can be found in [const.py](src/huggingface_inference_toolkit/const.py)

### `HF_MODEL_DIR`

The `HF_MODEL_DIR` environment variable defines the directory where your model is stored or will be stored.
If `HF_MODEL_ID` is not set the toolkit expects a the model artifact at this directory. This value should be set to the value where you mount your model artifacts.
If `HF_MODEL_ID` is set the toolkit and the directory where `HF_MODEL_DIR` is pointing to is empty. The toolkit will download the model from the Hub to this directory.
The `HF_MODEL_DIR` environment variable defines the directory where your model is stored or will be stored.
If `HF_MODEL_ID` is not set the toolkit expects a the model artifact at this directory. This value should be set to the value where you mount your model artifacts.
If `HF_MODEL_ID` is set the toolkit and the directory where `HF_MODEL_DIR` is pointing to is empty. The toolkit will download the model from the Hub to this directory.

The default value is `/opt/huggingface/model`

Expand Down Expand Up @@ -246,6 +248,14 @@ The `HF_HUB_TOKEN` environment variable defines the your Hugging Face authorizat
HF_HUB_TOKEN="api_XXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
```

### `HF_TRUST_REMOTE_CODE`

The `HF_TRUST_REMOTE_CODE` environment variable defines whether to trust remote code. This flag is already used for community defined inference code, and is therefore quite representative of the level of confidence you are giving the model providers when loading models from the Hugging Face Hub. The default value is `"0"`; set it to `"1"` to trust remote code.

```bash
HF_TRUST_REMOTE_CODE="0"
```

### `HF_FRAMEWORK`

The `HF_FRAMEWORK` environment variable defines the base deep learning framework used in the container. This is important when loading large models from the Hugguing Face Hub to avoid extra file downloads.
Expand All @@ -256,28 +266,28 @@ HF_FRAMEWORK="pytorch"

#### `HF_OPTIMUM_BATCH_SIZE`

The `HF_OPTIMUM_BATCH_SIZE` environment variable defines the batch size, which is used when compiling the model to Neuron. The default value is `1`. Not required when model is already converted.
The `HF_OPTIMUM_BATCH_SIZE` environment variable defines the batch size, which is used when compiling the model to Neuron. The default value is `1`. Not required when model is already converted.

```bash
HF_OPTIMUM_BATCH_SIZE="1"
```

#### `HF_OPTIMUM_SEQUENCE_LENGTH`

The `HF_OPTIMUM_SEQUENCE_LENGTH` environment variable defines the sequence length, which is used when compiling the model to Neuron. There is no default value. Not required when model is already converted.
The `HF_OPTIMUM_SEQUENCE_LENGTH` environment variable defines the sequence length, which is used when compiling the model to Neuron. There is no default value. Not required when model is already converted.

```bash
HF_OPTIMUM_SEQUENCE_LENGTH="128"
```

---

## ⚙ Supported Frontend
## ⚙ Supported Front-Ends

- [x] Starlette (HF Endpoints)
- [x] Starlette (Vertex AI)
- [ ] Starlette (Azure ML)
- [ ] Starlette (SageMaker)
* [x] Starlette (HF Endpoints)
* [x] Starlette (Vertex AI)
* [ ] Starlette (Azure ML)
* [ ] Starlette (SageMaker)

---

Expand All @@ -287,6 +297,6 @@ HF_OPTIMUM_SEQUENCE_LENGTH="128"

## 📜 License

TBD.
TBD.

---
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@
# We don't declare our dependency on transformers here because we build with
# different packages for different variants

VERSION = "0.4.1.dev0"
VERSION = "0.4.2"

# Ubuntu packages
# libsndfile1-dev: torchaudio requires the development version of the libsndfile package which can be installed via a system package manager. On Ubuntu it can be installed as follows: apt install libsndfile1-dev
# ffmpeg: ffmpeg is required for audio processing. On Ubuntu it can be installed as follows: apt install ffmpeg
# libavcodec-extra : libavcodec-extra inculdes additional codecs for ffmpeg
# libavcodec-extra : libavcodec-extra includes additional codecs for ffmpeg

install_requires = [
"transformers[sklearn,sentencepiece,audio,vision]==4.41.1",
Expand Down
7 changes: 6 additions & 1 deletion src/huggingface_inference_toolkit/const.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,18 @@
import os
from pathlib import Path

from huggingface_inference_toolkit.env_utils import strtobool

HF_MODEL_DIR = os.environ.get("HF_MODEL_DIR", "/opt/huggingface/model")
HF_MODEL_ID = os.environ.get("HF_MODEL_ID", None)
HF_TASK = os.environ.get("HF_TASK", None)
HF_FRAMEWORK = os.environ.get("HF_FRAMEWORK", None)
HF_REVISION = os.environ.get("HF_REVISION", None)
HF_HUB_TOKEN = os.environ.get("HF_HUB_TOKEN", None)
HF_TRUST_REMOTE_CODE = strtobool(os.environ.get("HF_TRUST_REMOTE_CODE", "0"))
# custom handler consts
HF_DEFAULT_PIPELINE_NAME = os.environ.get("HF_DEFAULT_PIPELINE_NAME", "handler.py")
# default is pipeline.PreTrainedPipeline
HF_MODULE_NAME = os.environ.get("HF_MODULE_NAME", f"{Path(HF_DEFAULT_PIPELINE_NAME).stem}.EndpointHandler")
HF_MODULE_NAME = os.environ.get(
"HF_MODULE_NAME", f"{Path(HF_DEFAULT_PIPELINE_NAME).stem}.EndpointHandler"
)
9 changes: 6 additions & 3 deletions src/huggingface_inference_toolkit/diffusers_utils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import importlib.util
from typing import Union

from transformers.utils.import_utils import is_torch_bf16_gpu_available

Expand All @@ -21,14 +22,16 @@ def is_diffusers_available():


class IEAutoPipelineForText2Image:
def __init__(self, model_dir: str, device: str = None): # needs "cuda" for GPU
def __init__(
self, model_dir: str, device: Union[str, None] = None, **kwargs
): # needs "cuda" for GPU
dtype = torch.float32
if device == "cuda":
dtype = torch.bfloat16 if is_torch_bf16_gpu_available() else torch.float16
device_map = "auto" if device == "cuda" else None

self.pipeline = AutoPipelineForText2Image.from_pretrained(
model_dir, torch_dtype=dtype, device_map=device_map
model_dir, torch_dtype=dtype, device_map=device_map, **kwargs
)
# try to use DPMSolverMultistepScheduler
if isinstance(self.pipeline, StableDiffusionPipeline):
Expand Down Expand Up @@ -66,5 +69,5 @@ def __call__(
def get_diffusers_pipeline(task=None, model_dir=None, device=-1, **kwargs):
"""Get a pipeline for Diffusers models."""
device = "cuda" if device == 0 else "cpu"
pipeline = DIFFUSERS_TASKS[task](model_dir=model_dir, device=device)
pipeline = DIFFUSERS_TASKS[task](model_dir=model_dir, device=device, **kwargs)
return pipeline
22 changes: 22 additions & 0 deletions src/huggingface_inference_toolkit/env_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
def strtobool(val: str) -> bool:
"""Convert a string representation of truth to True or False booleans.
True values are 'y', 'yes', 't', 'true', 'on', and '1'; false values
are 'n', 'no', 'f', 'false', 'off', and '0'.

Raises:
ValueError: if 'val' is anything else.

Note:
Function `strtobool` copied and adapted from `distutils`, as it's deprecated from Python 3.10 onwards.

References:
- https://github.com/python/cpython/blob/48f9d3e3faec5faaa4f7c9849fecd27eae4da213/Lib/distutils/util.py#L308-L321
"""
val = val.lower()
if val in ("y", "yes", "t", "true", "on", "1"):
return True
if val in ("n", "no", "f", "false", "off", "0"):
return False
raise ValueError(
f"Invalid truth value, it should be a string but {val} was provided instead."
)
Loading
Loading