Skip to content

[Inf2] Add Optimum Neuron support for Encoder models #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jul 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 4 additions & 13 deletions .github/workflows/build-container.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,21 +34,12 @@ jobs:
TAILSCALE_AUTHKEY: ${{ secrets.TAILSCALE_AUTHKEY }}
REGISTRY_USERNAME: ${{ secrets.REGISTRY_USERNAME }}
REGISTRY_PASSWORD: ${{ secrets.REGISTRY_PASSWORD }}
starlette-tensorflow-cpu:
starlette-pytorch-inf2:
uses: ./.github/workflows/docker-build-action.yaml
with:
image: inference-tensorflow-cpu
dockerfile: dockerfiles/tensorflow/cpu/Dockerfile
image: inference-pytorch-inf2
dockerfile: dockerfiles/pytorch/Dockerfile.inf2
secrets:
TAILSCALE_AUTHKEY: ${{ secrets.TAILSCALE_AUTHKEY }}
REGISTRY_USERNAME: ${{ secrets.REGISTRY_USERNAME }}
REGISTRY_PASSWORD: ${{ secrets.REGISTRY_PASSWORD }}
starlette-tensorflow-gpu:
uses: ./.github/workflows/docker-build-action.yaml
with:
image: inference-tensorflow-gpu
dockerfile: dockerfiles/tensorflow/gpu/Dockerfile
secrets:
TAILSCALE_AUTHKEY: ${{ secrets.TAILSCALE_AUTHKEY }}
REGISTRY_USERNAME: ${{ secrets.REGISTRY_USERNAME }}
REGISTRY_PASSWORD: ${{ secrets.REGISTRY_PASSWORD }}
REGISTRY_PASSWORD: ${{ secrets.REGISTRY_PASSWORD }}
2 changes: 2 additions & 0 deletions .github/workflows/integration-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ jobs:
with:
test_path: "tests/integ/test_pytorch_local_gpu.py"
build_img_cmd: "make inference-pytorch-gpu"
test_parallelism: "1"
pytorch-integration-remote-gpu:
name: Remote Integration Tests - GPU
uses: ./.github/workflows/integration-test-action.yaml
Expand All @@ -41,4 +42,5 @@ jobs:
with:
test_path: "tests/integ/test_pytorch_local_cpu.py"
build_img_cmd: "make inference-pytorch-cpu"
test_parallelism: "1"
runs_on: "['ci']"
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -179,4 +179,6 @@ model
tests/tmp
tmp/
act.sh
.act
.act
tmp*
log-*
152 changes: 95 additions & 57 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,16 @@
Hugging Face Inference Toolkit is for serving 🤗 Transformers models in containers. This library provides default pre-processing, predict and postprocessing for Transformers, Sentence Tranfsformers. It is also possible to define custom `handler.py` for customization. The Toolkit is build to work with the [Hugging Face Hub](https://huggingface.co/models).

---

## 💻 Getting Started with Hugging Face Inference Toolkit

* Clone the repository `git clone https://github.com/huggingface/huggingface-inference-toolkit``
* Install the dependencies in dev mode `pip install -e ".[torch, st, diffusers, test,quality]"`
* If you develop on AWS inferentia2 install with `pip install -e ".[test,quality]" optimum-neuron[neuronx] --upgrade`
* Unit Testing: `make unit-test`
* Integration testing: `make integ-test`


### Local run

```bash
Expand Down Expand Up @@ -58,6 +66,21 @@ curl --request POST \
}'
```

### Custom Handler and dependency support

The Hugging Face Inference Toolkit allows user to provide a custom inference through a `handler.py` file which is located in the repository.
For an example check [https://huggingface.co/philschmid/custom-pipeline-text-classification](https://huggingface.co/philschmid/custom-pipeline-text-classification):
```bash
model.tar.gz/
|- pytorch_model.bin
|- ....
|- handler.py
|- requirements.txt
```
In this example, `pytroch_model.bin` is the model file saved from training, `handler.py` is the custom inference handler, and `requirements.txt` is a requirements file to add additional dependencies.
The custom module can override the following methods:


### Vertex AI Support

The Hugging Face Inference Toolkit is also supported on Vertex AI, based on [Custom container requirements for prediction](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements). [Environment variables set by Vertex AI](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables) are automatically detected and used by the toolkit.
Expand Down Expand Up @@ -109,6 +132,69 @@ curl --request POST \
}'
```

### AWS Inferentia2 Support

The Hugging Face Inference Toolkit provides support for deploying Hugging Face on AWS Inferentia2. To deploy a model on Inferentia2 you have 3 options:
* Provide `HF_MODEL_ID`, the model repo id on huggingface.co which contains the compiled model under `.neuron` format. e.g. `optimum/bge-base-en-v1.5-neuronx`
* Provide the `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH` environment variables to compile the model on the fly, e.g. `HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128`
* Include `neuron` dictionary in the [config.json](https://huggingface.co/optimum/tiny_random_bert_neuron/blob/main/config.json) file in the model archive, e.g. `neuron: {"static_batch_size": 1, "static_sequence_length": 128}`

The currently supported tasks can be found [here](https://huggingface.co/docs/optimum-neuron/en/package_reference/supported_models). If you plan to deploy an LLM, we recommend taking a look at [Neuronx TGI](https://huggingface.co/blog/text-generation-inference-on-inferentia2), which is purposly build for LLMs.

#### Local run with HF_MODEL_ID and HF_TASK

Start Hugging Face Inference Toolkit with the following environment variables.

_Note: You need to run this on an Inferentia2 instance._

- transformers `text-classification` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
```bash
mkdir tmp2/
HF_MODEL_ID="distilbert/distilbert-base-uncased-finetuned-sst-2-english" HF_TASK="text-classification" HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128 HF_MODEL_DIR=tmp2 uvicorn src.huggingface_inference_toolkit.webservice_starlette:app --port 5000
```
- sentence transformers `feature-extration` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
```bash
HF_MODEL_ID="sentence-transformers/all-MiniLM-L6-v2" HF_TASK="feature-extraction" HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128 HF_MODEL_DIR=tmp2 uvicorn src.huggingface_inference_toolkit.webservice_starlette:app --port 5000
```

Send request

```bash
curl --request POST \
--url http://localhost:5000 \
--header 'Content-Type: application/json' \
--data '{
"inputs": "Wow, this is such a great product. I love it!"
}'
```

#### Container run with HF_MODEL_ID and HF_TASK


1. build the preferred container for either CPU or GPU for PyTorch o.

```bash
make inference-pytorch-inf2
```

2. Run the container and provide either environment variables to the HUB model you want to use or mount a volume to the container, where your model is stored.

```bash
docker run -ti -p 5000:5000 -e HF_MODEL_ID="distilbert/distilbert-base-uncased-finetuned-sst-2-english" -e HF_TASK="text-classification" -e HF_OPTIMUM_BATCH_SIZE=1 -e HF_OPTIMUM_SEQUENCE_LENGTH=128 --device=/dev/neuron0 integration-test-pytorch:inf2
```

3. Send request

```bash
curl --request POST \
--url http://localhost:5000 \
--header 'Content-Type: application/json' \
--data '{
"inputs": "Wow, this is such a great product. I love it!",
"parameters": { "top_k": 2 }
}'
```


---

Expand Down Expand Up @@ -168,61 +254,23 @@ The `HF_FRAMEWORK` environment variable defines the base deep learning framework
HF_FRAMEWORK="pytorch"
```

### `HF_ENDPOINT`
#### `HF_OPTIMUM_BATCH_SIZE`

The `HF_ENDPOINT` environment variable indicates whether the service is run inside the HF Inference endpoint service to adjust the `logging` config.
The `HF_OPTIMUM_BATCH_SIZE` environment variable defines the batch size, which is used when compiling the model to Neuron. The default value is `1`. Not required when model is already converted.

```bash
HF_ENDPOINT="True"
HF_OPTIMUM_BATCH_SIZE="1"
```

#### `HF_OPTIMUM_SEQUENCE_LENGTH`

---
The `HF_OPTIMUM_SEQUENCE_LENGTH` environment variable defines the sequence length, which is used when compiling the model to Neuron. There is no default value. Not required when model is already converted.

## 🧑🏻‍💻 Custom Handler and dependency support

The Hugging Face Inference Toolkit allows user to provide a custom inference through a `handler.py` file which is located in the repository.
For an example check [https://huggingface.co/philschmid/custom-pipeline-text-classification](https://huggingface.co/philschmid/custom-pipeline-text-classification):
```bash
model.tar.gz/
|- pytorch_model.bin
|- ....
|- handler.py
|- requirements.txt
HF_OPTIMUM_SEQUENCE_LENGTH="128"
```
In this example, `pytroch_model.bin` is the model file saved from training, `handler.py` is the custom inference handler, and `requirements.txt` is a requirements file to add additional dependencies.
The custom module can override the following methods:


## ☑️ Supported & Tested Tasks

Below you ll find a list of supported and tested transformers and sentence transformers tasks. Each of those are always tested through integration tests. In addition to those tasks you can always provide `custom`, which expect a `handler.py` file to be provided.

```bash
"text-classification",
"zero-shot-classification",
"ner",
"question-answering",
"fill-mask",
"summarization",
"translation_xx_to_yy",
"text2text-generation",
"text-generation",
"feature-extraction",
"image-classification",
"automatic-speech-recognition",
"audio-classification",
"object-detection",
"image-segmentation",
"table-question-answering",
"conversational"
"sentence-similarity",
"sentence-embeddings",
"sentence-ranking",
# TODO currently not supported due to multimodality input
# "visual-question-answering",
# "zero-shot-image-classification",
```
---

## ⚙ Supported Frontend

Expand All @@ -232,21 +280,11 @@ Below you ll find a list of supported and tested transformers and sentence trans
- [ ] Starlette (SageMaker)

---
## 🤝 Contributing

### Development

* Recommended Python version: 3.11
* We recommend `pyenv` for easily switching between different Python versions
* There are two options for unit and integration tests:
* `Make` - see `makefile`

#### Testing with Make

* Unit Testing: `make unit-test`
* Integration testing: `make integ-test`
## 🤝 Contributing

---

## 📜 License

TBD.
Expand Down
2 changes: 1 addition & 1 deletion dockerfiles/pytorch/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -51,4 +51,4 @@ ENTRYPOINT ["bash", "-c", "./entrypoint.sh"]
from base as vertex

# Install Vertex AI requiremented packages
RUN pip install --no-cache-dir google-cloud-storage
RUN pip install --no-cache-dir google-cloud-storage
122 changes: 122 additions & 0 deletions dockerfiles/pytorch/Dockerfile.inf2
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Build based on https://github.com/aws/deep-learning-containers/blob/master/huggingface/pytorch/inference/docker/2.1/py3/sdk2.18.0/Dockerfile.neuronx
FROM ubuntu:20.04
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FROM ubuntu:22.04 or 24:04 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its what AWS uses. We should stick to the image the use. Using 22.04 lead to errors in the past with neuron.


LABEL maintainer="Hugging Face"

ARG PYTHON=python3.10
ARG PYTHON_VERSION=3.10.12
ARG MAMBA_VERSION=23.1.0-4

# Neuron SDK components version numbers
ARG NEURONX_FRAMEWORK_VERSION=2.1.2.2.1.0
ARG NEURONX_DISTRIBUTED_VERSION=0.7.0
ARG NEURONX_CC_VERSION=2.13.66.0
ARG NEURONX_TRANSFORMERS_VERSION=0.10.0.21
ARG NEURONX_COLLECTIVES_LIB_VERSION=2.20.22.0-c101c322e
ARG NEURONX_RUNTIME_LIB_VERSION=2.20.22.0-1b3ca6425
ARG NEURONX_TOOLS_VERSION=2.17.1.0

# HF ARGS
ARG OPTIMUM_NEURON_VERSION=0.0.23

# See http://bugs.python.org/issue19846
ENV LANG C.UTF-8
ENV LD_LIBRARY_PATH /opt/aws/neuron/lib:/lib/x86_64-linux-gnu:/opt/conda/lib/:$LD_LIBRARY_PATH
ENV PATH /opt/conda/bin:/opt/aws/neuron/bin:$PATH

RUN apt-get update \
&& apt-get upgrade -y \
&& apt-get install -y --no-install-recommends software-properties-common \
&& add-apt-repository ppa:openjdk-r/ppa \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
build-essential \
apt-transport-https \
ca-certificates \
cmake \
curl \
emacs \
git \
jq \
libgl1-mesa-glx \
libsm6 \
libxext6 \
libxrender-dev \
openjdk-11-jdk \
vim \
wget \
unzip \
zlib1g-dev \
libcap-dev \
gpg-agent \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf /tmp/tmp* \
&& apt-get clean

RUN echo "deb https://apt.repos.neuron.amazonaws.com focal main" > /etc/apt/sources.list.d/neuron.list
RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -

# Install Neuronx tools
RUN apt-get update \
&& apt-get install -y \
aws-neuronx-tools=$NEURONX_TOOLS_VERSION \
aws-neuronx-collectives=$NEURONX_COLLECTIVES_LIB_VERSION \
aws-neuronx-runtime-lib=$NEURONX_RUNTIME_LIB_VERSION \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf /tmp/tmp* \
&& apt-get clean

# https://github.com/docker-library/openjdk/issues/261 https://github.com/docker-library/openjdk/pull/263/files
RUN keytool -importkeystore -srckeystore /etc/ssl/certs/java/cacerts -destkeystore /etc/ssl/certs/java/cacerts.jks -deststoretype JKS -srcstorepass changeit -deststorepass changeit -noprompt; \
mv /etc/ssl/certs/java/cacerts.jks /etc/ssl/certs/java/cacerts; \
/var/lib/dpkg/info/ca-certificates-java.postinst configure;

RUN curl -L -o ~/mambaforge.sh https://github.com/conda-forge/miniforge/releases/download/${MAMBA_VERSION}/Mambaforge-${MAMBA_VERSION}-Linux-x86_64.sh \
&& chmod +x ~/mambaforge.sh \
&& ~/mambaforge.sh -b -p /opt/conda \
&& rm ~/mambaforge.sh \
&& /opt/conda/bin/conda update -y conda \
&& /opt/conda/bin/conda install -c conda-forge -y \
python=$PYTHON_VERSION \
pyopenssl \
cython \
mkl-include \
mkl \
botocore \
parso \
scipy \
typing \
# Below 2 are included in miniconda base, but not mamba so need to install
conda-content-trust \
charset-normalizer \
&& /opt/conda/bin/conda update -y conda \
&& /opt/conda/bin/conda clean -ya

RUN conda install -c conda-forge \
scikit-learn \
h5py \
requests \
&& conda clean -ya \
&& pip install --upgrade pip --trusted-host pypi.org --trusted-host files.pythonhosted.org \
&& ln -s /opt/conda/bin/pip /usr/local/bin/pip3 \
&& pip install --no-cache-dir "protobuf>=3.18.3,<4" setuptools==69.5.1 packaging

WORKDIR /

# install Hugging Face libraries and its dependencies
RUN pip install --extra-index-url https://pip.repos.neuron.amazonaws.com --no-cache-dir optimum-neuron[neuronx]==${OPTIMUM_NEURON_VERSION} \
&& pip install --no-deps --no-cache-dir -U torchvision==0.16.*


COPY . .
# install wheel and setuptools
RUN pip install --no-cache-dir -U pip ".[st]"

# copy application
COPY src/huggingface_inference_toolkit huggingface_inference_toolkit
COPY src/huggingface_inference_toolkit/webservice_starlette.py webservice_starlette.py

# copy entrypoint and change permissions
COPY --chmod=0755 scripts/entrypoint.sh entrypoint.sh

ENTRYPOINT ["bash", "-c", "./entrypoint.sh"]
Loading
Loading