huggingface
diff --git a/‎.github/workflows/build-container.yaml
Lines changed: 4 additions & 13 deletions b/‎.github/workflows/build-container.yaml
Lines changed: 4 additions & 13 deletions
diff --git a/‎.github/workflows/integration-test.yaml
Lines changed: 2 additions & 0 deletions b/‎.github/workflows/integration-test.yaml
Lines changed: 2 additions & 0 deletions
diff --git a/‎.gitignore
Lines changed: 3 additions & 1 deletion b/‎.gitignore
Lines changed: 3 additions & 1 deletion
diff --git a/‎README.md
Lines changed: 95 additions & 57 deletions b/‎README.md
Lines changed: 95 additions & 57 deletions
diff --git a/‎dockerfiles/pytorch/Dockerfile
Lines changed: 1 addition & 1 deletion b/‎dockerfiles/pytorch/Dockerfile
Lines changed: 1 addition & 1 deletion
diff --git a/‎dockerfiles/pytorch/Dockerfile.inf2
Lines changed: 122 additions & 0 deletions b/‎dockerfiles/pytorch/Dockerfile.inf2
Lines changed: 122 additions & 0 deletions
@@ -34,21 +34,12 @@ jobs:
       TAILSCALE_AUTHKEY: ${{ secrets.TAILSCALE_AUTHKEY }}
       REGISTRY_USERNAME: ${{ secrets.REGISTRY_USERNAME }}
       REGISTRY_PASSWORD: ${{ secrets.REGISTRY_PASSWORD }}
-  starlette-tensorflow-cpu:
+  starlette-pytorch-inf2:
     uses: ./.github/workflows/docker-build-action.yaml
     with:
-      image: inference-tensorflow-cpu
-      dockerfile: dockerfiles/tensorflow/cpu/Dockerfile
+      image: inference-pytorch-inf2
+      dockerfile: dockerfiles/pytorch/Dockerfile.inf2
     secrets:
       TAILSCALE_AUTHKEY: ${{ secrets.TAILSCALE_AUTHKEY }}
       REGISTRY_USERNAME: ${{ secrets.REGISTRY_USERNAME }}
-      REGISTRY_PASSWORD: ${{ secrets.REGISTRY_PASSWORD }}
-  starlette-tensorflow-gpu:
-    uses: ./.github/workflows/docker-build-action.yaml
-    with:
-      image: inference-tensorflow-gpu
-      dockerfile: dockerfiles/tensorflow/gpu/Dockerfile
-    secrets:
-      TAILSCALE_AUTHKEY: ${{ secrets.TAILSCALE_AUTHKEY }}
-      REGISTRY_USERNAME: ${{ secrets.REGISTRY_USERNAME }}
-      REGISTRY_PASSWORD: ${{ secrets.REGISTRY_PASSWORD }}
+      REGISTRY_PASSWORD: ${{ secrets.REGISTRY_PASSWORD }}
@@ -22,6 +22,7 @@ jobs:
     with:
       test_path: "tests/integ/test_pytorch_local_gpu.py"
       build_img_cmd: "make inference-pytorch-gpu"
+      test_parallelism: "1"
   pytorch-integration-remote-gpu:
     name: Remote Integration Tests - GPU
     uses: ./.github/workflows/integration-test-action.yaml
@@ -41,4 +42,5 @@ jobs:
     with:
       test_path: "tests/integ/test_pytorch_local_cpu.py"
       build_img_cmd: "make inference-pytorch-cpu"
+      test_parallelism: "1"
       runs_on: "['ci']"
@@ -179,4 +179,6 @@ model
 tests/tmp
 tmp/
 act.sh
-.act
+.act
+tmp*
+log-*
@@ -8,8 +8,16 @@
 Hugging Face Inference Toolkit is for serving 🤗 Transformers models in containers. This library provides default pre-processing, predict and postprocessing for Transformers, Sentence Tranfsformers. It is also possible to define custom `handler.py` for customization. The Toolkit is build to work with the [Hugging Face Hub](https://huggingface.co/models).
 
 ---
+
 ## 💻  Getting Started with Hugging Face Inference Toolkit
 
+* Clone the repository `git clone https://github.com/huggingface/huggingface-inference-toolkit``
+* Install the dependencies in dev mode `pip install -e ".[torch, st, diffusers, test,quality]"`
+	* If you develop on AWS inferentia2 install with `pip install -e ".[test,quality]" optimum-neuron[neuronx] --upgrade`
+* Unit Testing: `make unit-test`
+* Integration testing: `make integ-test`
+
+
 ### Local run
 
 ```bash
@@ -58,6 +66,21 @@ curl --request POST \
 }'
 ```
 
+### Custom Handler and dependency support
+
+The Hugging Face Inference Toolkit allows user to provide a custom inference through a `handler.py` file which is located in the repository. 
+For an example check [https://huggingface.co/philschmid/custom-pipeline-text-classification](https://huggingface.co/philschmid/custom-pipeline-text-classification):  
+```bash
+model.tar.gz/
+|- pytorch_model.bin
+|- ....
+|- handler.py
+|- requirements.txt 
+```
+In this example, `pytroch_model.bin` is the model file saved from training, `handler.py` is the custom inference handler, and `requirements.txt` is a requirements file to add additional dependencies.
+The custom module can override the following methods:  
+
+
 ### Vertex AI Support
 
 The Hugging Face Inference Toolkit is also supported on Vertex AI, based on [Custom container requirements for prediction](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements). [Environment variables set by Vertex AI](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables) are automatically detected and used by the toolkit. 
@@ -109,6 +132,69 @@ curl --request POST \
 }'
 ```
 
+### AWS Inferentia2 Support 
+
+The Hugging Face Inference Toolkit provides support for deploying Hugging Face on AWS Inferentia2. To deploy a model on Inferentia2 you have 3 options:
+* Provide `HF_MODEL_ID`, the model repo id on huggingface.co which contains the compiled model under `.neuron` format. e.g. `optimum/bge-base-en-v1.5-neuronx`
+* Provide the `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH` environment variables to compile the model on the fly, e.g. `HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128`
+* Include `neuron` dictionary in the [config.json](https://huggingface.co/optimum/tiny_random_bert_neuron/blob/main/config.json) file in the model archive, e.g. `neuron: {"static_batch_size": 1, "static_sequence_length": 128}`
+
+The currently supported tasks can be found [here](https://huggingface.co/docs/optimum-neuron/en/package_reference/supported_models). If you plan to deploy an LLM, we recommend taking a look at [Neuronx TGI](https://huggingface.co/blog/text-generation-inference-on-inferentia2), which is purposly build for LLMs.
+
+#### Local run with HF_MODEL_ID and HF_TASK
+
+Start Hugging Face Inference Toolkit with the following environment variables. 
+
+_Note: You need to run this on an Inferentia2 instance._
+
+- transformers `text-classification` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
+```bash
+mkdir tmp2/
+HF_MODEL_ID="distilbert/distilbert-base-uncased-finetuned-sst-2-english" HF_TASK="text-classification" HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128  HF_MODEL_DIR=tmp2 uvicorn src.huggingface_inference_toolkit.webservice_starlette:app  --port 5000
+```
+- sentence transformers `feature-extration` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
+```bash
+HF_MODEL_ID="sentence-transformers/all-MiniLM-L6-v2" HF_TASK="feature-extraction" HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128 HF_MODEL_DIR=tmp2 uvicorn src.huggingface_inference_toolkit.webservice_starlette:app  --port 5000
+```
+
+Send request
+
+```bash
+curl --request POST \
+	--url http://localhost:5000 \
+	--header 'Content-Type: application/json' \
+	--data '{
+	"inputs": "Wow, this is such a great product. I love it!"
+}'
+```
+
+#### Container run with HF_MODEL_ID and HF_TASK
+
+
+1. build the preferred container for either CPU or GPU for PyTorch o.
+
+```bash
+make inference-pytorch-inf2
+```
+
+2. Run the container and provide either environment variables to the HUB model you want to use or mount a volume to the container, where your model is stored.
+
+```bash
+docker run -ti -p 5000:5000 -e HF_MODEL_ID="distilbert/distilbert-base-uncased-finetuned-sst-2-english" -e HF_TASK="text-classification" -e HF_OPTIMUM_BATCH_SIZE=1 -e HF_OPTIMUM_SEQUENCE_LENGTH=128 --device=/dev/neuron0 integration-test-pytorch:inf2
+```
+
+3. Send request
+
+```bash
+curl --request POST \
+	--url http://localhost:5000 \
+	--header 'Content-Type: application/json' \
+	--data '{
+	"inputs": "Wow, this is such a great product. I love it!",
+	"parameters": { "top_k": 2 }
+}'
+```
+
 
 ---
 
@@ -168,61 +254,23 @@ The `HF_FRAMEWORK` environment variable defines the base deep learning framework
 HF_FRAMEWORK="pytorch"
 ```
 
-### `HF_ENDPOINT`
+#### `HF_OPTIMUM_BATCH_SIZE`
 
-The `HF_ENDPOINT` environment variable indicates whether the service is run inside the HF Inference endpoint service to adjust the `logging` config.
+The `HF_OPTIMUM_BATCH_SIZE` environment variable defines the batch size, which is used when compiling the model to Neuron. The default value is `1`. Not required when model is already converted. 
 
 ```bash
-HF_ENDPOINT="True"
+HF_OPTIMUM_BATCH_SIZE="1"
 ```
 
+#### `HF_OPTIMUM_SEQUENCE_LENGTH`
 
----
+The `HF_OPTIMUM_SEQUENCE_LENGTH` environment variable defines the sequence length, which is used when compiling the model to Neuron. There is no default value. Not required when model is already converted. 
 
-## 🧑🏻‍💻 Custom Handler and dependency support
-
-The Hugging Face Inference Toolkit allows user to provide a custom inference through a `handler.py` file which is located in the repository. 
-For an example check [https://huggingface.co/philschmid/custom-pipeline-text-classification](https://huggingface.co/philschmid/custom-pipeline-text-classification):  
 ```bash
-model.tar.gz/
-|- pytorch_model.bin
-|- ....
-|- handler.py
-|- requirements.txt 
+HF_OPTIMUM_SEQUENCE_LENGTH="128"
 ```
-In this example, `pytroch_model.bin` is the model file saved from training, `handler.py` is the custom inference handler, and `requirements.txt` is a requirements file to add additional dependencies.
-The custom module can override the following methods:  
-
 
-## ☑️ Supported & Tested Tasks
-
-Below you ll find a list of supported and tested transformers and sentence transformers tasks. Each of those are always tested through integration tests. In addition to those tasks you can always provide `custom`, which expect a `handler.py` file to be provided.
-
-```bash
-"text-classification",
-"zero-shot-classification",
-"ner",
-"question-answering",
-"fill-mask",
-"summarization",
-"translation_xx_to_yy",
-"text2text-generation",
-"text-generation",
-"feature-extraction",
-"image-classification",
-"automatic-speech-recognition",
-"audio-classification",
-"object-detection",
-"image-segmentation",
-"table-question-answering",
-"conversational"
-"sentence-similarity",
-"sentence-embeddings",
-"sentence-ranking",
-# TODO currently not supported due to multimodality input
-# "visual-question-answering",
-# "zero-shot-image-classification",
-```
+---
 
 ##  ⚙ Supported Frontend
 
@@ -232,21 +280,11 @@ Below you ll find a list of supported and tested transformers and sentence trans
 - [ ] Starlette (SageMaker)
 
 ---
-## 🤝 Contributing
-
-### Development
-
-* Recommended Python version: 3.11
-* We recommend `pyenv` for easily switching between different Python versions
-* There are two options for unit and integration tests:
-	* `Make` - see `makefile`
 
-#### Testing with Make
-
-* Unit Testing: `make unit-test`
-* Integration testing: `make integ-test`
+## 🤝 Contributing
 
 ---
+
 ## 📜  License
 
 TBD. 
 
@@ -51,4 +51,4 @@ ENTRYPOINT ["bash", "-c", "./entrypoint.sh"]
 from base as vertex
 
 # Install Vertex AI requiremented packages
-RUN pip install --no-cache-dir google-cloud-storage
+RUN pip install --no-cache-dir google-cloud-storage
@@ -0,0 +1,122 @@
+# Build based on https://github.com/aws/deep-learning-containers/blob/master/huggingface/pytorch/inference/docker/2.1/py3/sdk2.18.0/Dockerfile.neuronx
+FROM ubuntu:20.04
+
+LABEL maintainer="Hugging Face"
+
+ARG PYTHON=python3.10
+ARG PYTHON_VERSION=3.10.12
+ARG MAMBA_VERSION=23.1.0-4
+
+# Neuron SDK components version numbers
+ARG NEURONX_FRAMEWORK_VERSION=2.1.2.2.1.0
+ARG NEURONX_DISTRIBUTED_VERSION=0.7.0
+ARG NEURONX_CC_VERSION=2.13.66.0
+ARG NEURONX_TRANSFORMERS_VERSION=0.10.0.21
+ARG NEURONX_COLLECTIVES_LIB_VERSION=2.20.22.0-c101c322e
+ARG NEURONX_RUNTIME_LIB_VERSION=2.20.22.0-1b3ca6425
+ARG NEURONX_TOOLS_VERSION=2.17.1.0
+
+# HF ARGS
+ARG OPTIMUM_NEURON_VERSION=0.0.23
+
+# See http://bugs.python.org/issue19846
+ENV LANG C.UTF-8
+ENV LD_LIBRARY_PATH /opt/aws/neuron/lib:/lib/x86_64-linux-gnu:/opt/conda/lib/:$LD_LIBRARY_PATH
+ENV PATH /opt/conda/bin:/opt/aws/neuron/bin:$PATH
+
+RUN apt-get update \
+ && apt-get upgrade -y \
+ && apt-get install -y --no-install-recommends software-properties-common \
+ && add-apt-repository ppa:openjdk-r/ppa \
+ && apt-get update \
+ && apt-get install -y --no-install-recommends \
+    build-essential \
+    apt-transport-https \
+    ca-certificates \
+    cmake \
+    curl \
+    emacs \
+    git \
+    jq \
+    libgl1-mesa-glx \
+    libsm6 \
+    libxext6 \
+    libxrender-dev \
+    openjdk-11-jdk \
+    vim \
+    wget \
+    unzip \
+    zlib1g-dev \
+    libcap-dev \
+    gpg-agent \
+ && rm -rf /var/lib/apt/lists/* \
+ && rm -rf /tmp/tmp* \
+ && apt-get clean
+
+RUN echo "deb https://apt.repos.neuron.amazonaws.com focal main" > /etc/apt/sources.list.d/neuron.list
+RUN wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add -
+
+# Install Neuronx tools
+RUN apt-get update \
+ && apt-get install -y \
+    aws-neuronx-tools=$NEURONX_TOOLS_VERSION \
+    aws-neuronx-collectives=$NEURONX_COLLECTIVES_LIB_VERSION \
+    aws-neuronx-runtime-lib=$NEURONX_RUNTIME_LIB_VERSION \
+ && rm -rf /var/lib/apt/lists/* \
+ && rm -rf /tmp/tmp* \
+ && apt-get clean
+
+# https://github.com/docker-library/openjdk/issues/261 https://github.com/docker-library/openjdk/pull/263/files
+RUN keytool -importkeystore -srckeystore /etc/ssl/certs/java/cacerts -destkeystore /etc/ssl/certs/java/cacerts.jks -deststoretype JKS -srcstorepass changeit -deststorepass changeit -noprompt; \
+    mv /etc/ssl/certs/java/cacerts.jks /etc/ssl/certs/java/cacerts; \
+    /var/lib/dpkg/info/ca-certificates-java.postinst configure;
+
+RUN curl -L -o ~/mambaforge.sh https://github.com/conda-forge/miniforge/releases/download/${MAMBA_VERSION}/Mambaforge-${MAMBA_VERSION}-Linux-x86_64.sh \
+ && chmod +x ~/mambaforge.sh \
+ && ~/mambaforge.sh -b -p /opt/conda \
+ && rm ~/mambaforge.sh \
+ && /opt/conda/bin/conda update -y conda \
+ && /opt/conda/bin/conda install -c conda-forge -y \
+    python=$PYTHON_VERSION \
+    pyopenssl \
+    cython \
+    mkl-include \
+    mkl \
+    botocore \
+    parso \
+    scipy \
+    typing \
+    # Below 2 are included in miniconda base, but not mamba so need to install
+    conda-content-trust \
+    charset-normalizer \
+ && /opt/conda/bin/conda update -y conda \
+ && /opt/conda/bin/conda clean -ya
+
+RUN conda install -c conda-forge \
+    scikit-learn \
+    h5py \
+    requests \
+ && conda clean -ya \
+ && pip install --upgrade pip --trusted-host pypi.org --trusted-host files.pythonhosted.org \
+ && ln -s /opt/conda/bin/pip /usr/local/bin/pip3 \
+ && pip install --no-cache-dir "protobuf>=3.18.3,<4" setuptools==69.5.1 packaging
+
+WORKDIR /
+
+# install Hugging Face libraries and its dependencies
+RUN pip install --extra-index-url https://pip.repos.neuron.amazonaws.com --no-cache-dir optimum-neuron[neuronx]==${OPTIMUM_NEURON_VERSION}  \
+ && pip install --no-deps --no-cache-dir -U torchvision==0.16.*
+
+
+COPY . .
+# install wheel and setuptools
+RUN pip install --no-cache-dir -U pip ".[st]"
+
+# copy application
+COPY src/huggingface_inference_toolkit huggingface_inference_toolkit
+COPY src/huggingface_inference_toolkit/webservice_starlette.py webservice_starlette.py
+
+# copy entrypoint and change permissions
+COPY --chmod=0755  scripts/entrypoint.sh entrypoint.sh
+
+ENTRYPOINT ["bash", "-c", "./entrypoint.sh"]
-Original file line number
+Diff line change
 tests/tmp
 tmp/
 act.sh
 -.act
 +.act
 +tmp*
 +log-*