huggingface · philschmid · Aug 20, 2024 · Aug 20, 2024 · Aug 20, 2024 · Aug 20, 2024
diff --git a/README.md b/README.md
@@ -1,21 +1,18 @@
-
 <div style="display:flex; text-align:center; justify-content:center;">
 <img src="https://huggingface.co/front/assets/huggingface_logo.svg" width="100"/>
 <h1 style="margin-top:auto;"> Hugging Face Inference Toolkit <h1>
 </div>
 
-Hugging Face Inference Toolkit is for serving 🤗 Transformers models in containers. This library provides default pre-processing, predict and postprocessing for Transformers, Sentence Tranfsformers. It is also possible to define custom `handler.py` for customization. The Toolkit is build to work with the [Hugging Face Hub](https://huggingface.co/models).
-
----
+Hugging Face Inference Toolkit is for serving 🤗 Transformers models in containers. This library provides default pre-processing, predict and postprocessing for Transformers, Sentence Tranfsformers. It is also possible to define custom `handler.py` for customization. The Toolkit is build to work with the [Hugging Face Hub](https://huggingface.co/models) and is used as "default" option in [Inference Endpoints](https://ui.endpoints.huggingface.co/)
 
-## 💻  Getting Started with Hugging Face Inference Toolkit
+## 💻 Getting Started with Hugging Face Inference Toolkit
 
-* Clone the repository `git clone <https://github.com/huggingface/huggingface-inference-toolkit``>
-* Install the dependencies in dev mode `pip install -e ".[torch,st,diffusers,test,quality]"`
-  * If you develop on AWS inferentia2 install with `pip install -e ".[test,quality]" optimum-neuron[neuronx] --upgrade`
-  * If you develop on Google Cloud install with `pip install -e ".[torch,st,diffusers,google,test,quality]"`
-* Unit Testing: `make unit-test`
-* Integration testing: `make integ-test`
+- Clone the repository `git clone https://github.com/huggingface/huggingface-inference-toolkit`
+- Install the dependencies in dev mode `pip install -e ".[torch,st,diffusers,test,quality]"`
+  - If you develop on AWS Inferentia2 install with `pip install -e ".[inf2,test,quality]" --upgrade`
+  - If you develop on Google Cloud install with `pip install -e ".[torch,st,diffusers,google,test,quality]"`
+- Unit Testing: `make unit-test`
+- Integration testing: `make integ-test`
 
 ### Local run
 
@@ -68,18 +65,18 @@ curl --request POST \
 
 The Hugging Face Inference Toolkit allows user to provide a custom inference through a `handler.py` file which is located in the repository.
 
-For an example check [philschmid/custom-pipeline-text-classification](https://huggingface.co/philschmid/custom-pipeline-text-classification):  
+For an example check [philschmid/custom-pipeline-text-classification](https://huggingface.co/philschmid/custom-pipeline-text-classification):
 
 ```bash
 model.tar.gz/
 |- pytorch_model.bin
 |- ....
 |- handler.py
-|- requirements.txt 
+|- requirements.txt
 ```
 
 In this example, `pytroch_model.bin` is the model file saved from training, `handler.py` is the custom inference handler, and `requirements.txt` is a requirements file to add additional dependencies.
-The custom module can override the following methods:  
+The custom module can override the following methods:
 
 ### Vertex AI Support
 
@@ -136,9 +133,9 @@ curl --request POST \
 
 The Hugging Face Inference Toolkit provides support for deploying Hugging Face on AWS Inferentia2. To deploy a model on Inferentia2 you have 3 options:
 
-* Provide `HF_MODEL_ID`, the model repo id on huggingface.co which contains the compiled model under `.neuron` format e.g. `optimum/bge-base-en-v1.5-neuronx`
-* Provide the `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH` environment variables to compile the model on the fly, e.g. `HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128`
-* Include `neuron` dictionary in the [config.json](https://huggingface.co/optimum/tiny_random_bert_neuron/blob/main/config.json) file in the model archive, e.g. `neuron: {"static_batch_size": 1, "static_sequence_length": 128}`
+- Provide `HF_MODEL_ID`, the model repo id on huggingface.co which contains the compiled model under `.neuron` format e.g. `optimum/bge-base-en-v1.5-neuronx`
+- Provide the `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH` environment variables to compile the model on the fly, e.g. `HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128`
+- Include `neuron` dictionary in the [config.json](https://huggingface.co/optimum/tiny_random_bert_neuron/blob/main/config.json) file in the model archive, e.g. `neuron: {"static_batch_size": 1, "static_sequence_length": 128}`
 
 The currently supported tasks can be found [here](https://huggingface.co/docs/optimum-neuron/en/package_reference/supported_models). If you plan to deploy an LLM, we recommend taking a look at [Neuronx TGI](https://huggingface.co/blog/text-generation-inference-on-inferentia2), which is purposly build for LLMs.
 
@@ -148,14 +145,14 @@ Start Hugging Face Inference Toolkit with the following environment variables.
 
 _Note: You need to run this on an Inferentia2 instance._
 
-* transformers `text-classification` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
+- transformers `text-classification` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
 
 ```bash
 mkdir tmp2/
 HF_MODEL_ID="distilbert/distilbert-base-uncased-finetuned-sst-2-english" HF_TASK="text-classification" HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128  HF_MODEL_DIR=tmp2 uvicorn src.huggingface_inference_toolkit.webservice_starlette:app  --port 5000
 ```
 
-* sentence transformers `feature-extration` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
+- sentence transformers `feature-extration` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
 
 ```bash
 HF_MODEL_ID="sentence-transformers/all-MiniLM-L6-v2" HF_TASK="feature-extraction" HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128 HF_MODEL_DIR=tmp2 uvicorn src.huggingface_inference_toolkit.webservice_starlette:app  --port 5000
@@ -284,19 +281,12 @@ HF_OPTIMUM_SEQUENCE_LENGTH="128"
 
 ## ⚙ Supported Front-Ends
 
-* [x] Starlette (HF Endpoints)
-* [x] Starlette (Vertex AI)
-* [ ] Starlette (Azure ML)
-* [ ] Starlette (SageMaker)
+- [x] Starlette (HF Endpoints)
+- [x] Starlette (Vertex AI)
+- [ ] Starlette (Azure ML)
+- [ ] Starlette (SageMaker)
 
----
+## 📜 License
 
-## 🤝 Contributing
+This project is licensed under the Apache-2.0 License.
 
----
-
-## 📜  License
-
-TBD.
-
----
diff --git a/pyproject.toml b/pyproject.toml
@@ -4,15 +4,15 @@ no_implicit_optional = true
 scripts_are_modules = true
 
 [tool.ruff]
-lint.select = [
+select = [
   "E", # pycodestyle errors
   "W", # pycodestyle warnings
   "F", # pyflakes
   "I", # isort
   "C", # flake8-comprehensions
   "B", # flake8-bugbear
 ]
-lint.ignore = [
+ignore = [
   "E501", # Line length (handled by ruff-format)
   "B008", # do not perform function calls in argument defaults
   "C901", # too complex
@@ -21,13 +21,13 @@ lint.ignore = [
 line-length = 119
 
 # Allow unused variables when underscore-prefixed.
-lint.dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$"
+dummy-variable-rgx = "^(_+|(_+[a-zA-Z0-9_]*[a-zA-Z0-9]+?))$"
 
 # Assume Python 3.11.
 target-version = "py311"
 
-lint.per-file-ignores = {"__init__.py" = ["F401"]}
+per-file-ignores = { "__init__.py" = ["F401"] }
 
 [tool.isort]
 profile = "black"
-known_third_party = ["transformers", "starlette", "huggingface_hub"]
+known_third_party = ["transformers", "starlette", "huggingface_hub"]
diff --git a/setup.cfg b/setup.cfg
@@ -10,7 +10,6 @@ known_third_party =
     datasets
     tensorflow
     torch
-    robyn
 
 line_length = 119
 lines_after_imports = 2

diff --git a/setup.py b/setup.py
@@ -5,7 +5,7 @@
 # We don't declare our dependency on transformers here because we build with
 # different packages for different variants
 
-VERSION = "0.4.3"
+VERSION = "0.5.0"
 
 # Ubuntu packages
 # libsndfile1-dev: torchaudio requires the development version of the libsndfile package which can be installed via a system package manager. On Ubuntu it can be installed as follows: apt install libsndfile1-dev

diff --git a/tests/integ/conftest.py b/tests/integ/conftest.py
@@ -7,9 +7,9 @@
 import docker
 import pytest
 import tenacity
-from huggingface_inference_toolkit.utils import _load_repository_from_hf
 from transformers.testing_utils import _run_slow_tests
 
+from huggingface_inference_toolkit.utils import _load_repository_from_hf
 from tests.integ.config import task2model
 
 HF_HUB_CACHE = os.environ.get("HF_HUB_CACHE", "/home/ubuntu/.cache/huggingface/hub")

diff --git a/tests/integ/helpers.py b/tests/integ/helpers.py
@@ -8,9 +8,9 @@
 import pytest
 import requests
 from docker import DockerClient
-from huggingface_inference_toolkit.utils import _load_repository_from_hf
 from transformers.testing_utils import _run_slow_tests, require_tf, require_torch
 
+from huggingface_inference_toolkit.utils import _load_repository_from_hf
 from tests.integ.config import task2input, task2model, task2output, task2validation
 
 IS_GPU = _run_slow_tests

diff --git a/tests/integ/test_pytorch_local_inf2.py b/tests/integ/test_pytorch_local_inf2.py
@@ -1,7 +1,7 @@
 import pytest
-from huggingface_inference_toolkit.optimum_utils import is_optimum_neuron_available
 from transformers.testing_utils import require_torch
 
+from huggingface_inference_toolkit.optimum_utils import is_optimum_neuron_available
 from tests.integ.helpers import verify_task
 
 require_inferentia = pytest.mark.skipif(

diff --git a/tests/unit/test_diffusers.py b/tests/unit/test_diffusers.py
@@ -1,11 +1,12 @@
 import logging
 import tempfile
 
-from huggingface_inference_toolkit.diffusers_utils import IEAutoPipelineForText2Image
-from huggingface_inference_toolkit.utils import _load_repository_from_hf, get_pipeline
 from PIL import Image
 from transformers.testing_utils import require_torch, slow
 
+from huggingface_inference_toolkit.diffusers_utils import IEAutoPipelineForText2Image
+from huggingface_inference_toolkit.utils import _load_repository_from_hf, get_pipeline
+
 logging.basicConfig(level="DEBUG")
 
 @require_torch

diff --git a/tests/unit/test_handler.py b/tests/unit/test_handler.py
@@ -1,6 +1,8 @@
 import tempfile
 
 import pytest
+from transformers.testing_utils import require_tf, require_torch
+
 from huggingface_inference_toolkit.handler import (
     HuggingFaceHandler,
     get_inference_handler_either_custom_or_default_handler,
@@ -9,7 +11,6 @@
     _is_gpu_available,
     _load_repository_from_hf,
 )
-from transformers.testing_utils import require_tf, require_torch
 
 TASK = "text-classification"
 MODEL = "hf-internal-testing/tiny-random-distilbert"

diff --git a/tests/unit/test_optimum_utils.py b/tests/unit/test_optimum_utils.py
@@ -2,13 +2,14 @@
 import tempfile
 
 import pytest
+from transformers.testing_utils import require_torch
+
 from huggingface_inference_toolkit.optimum_utils import (
     get_input_shapes,
     get_optimum_neuron_pipeline,
     is_optimum_neuron_available,
 )
 from huggingface_inference_toolkit.utils import _load_repository_from_hf
-from transformers.testing_utils import require_torch
 
 require_inferentia = pytest.mark.skipif(
     not is_optimum_neuron_available(),

diff --git a/tests/unit/test_sentence_transformers.py b/tests/unit/test_sentence_transformers.py
@@ -1,5 +1,7 @@
 import tempfile
 
+from transformers.testing_utils import require_torch
+
 from huggingface_inference_toolkit.sentence_transformers_utils import (
     SentenceEmbeddingPipeline,
     get_sentence_transformers_pipeline,
@@ -8,7 +10,6 @@
     _load_repository_from_hf,
     get_pipeline,
 )
-from transformers.testing_utils import require_torch
 
 
 @require_torch

diff --git a/tests/unit/test_serializer.py b/tests/unit/test_serializer.py
@@ -2,9 +2,10 @@
 
 import numpy as np
 import pytest
-from huggingface_inference_toolkit.serialization import Audioer, Imager, Jsoner
 from PIL import Image
 
+from huggingface_inference_toolkit.serialization import Audioer, Imager, Jsoner
+
 
 def test_json_serialization():
     t = {"res": np.array([2.0]), "text": "I like you.", "float": 1.2}

diff --git a/tests/unit/test_utils.py b/tests/unit/test_utils.py
@@ -3,6 +3,9 @@
 import tempfile
 from pathlib import Path
 
+from transformers.file_utils import is_torch_available
+from transformers.testing_utils import require_tf, require_torch, slow
+
 from huggingface_inference_toolkit.handler import get_inference_handler_either_custom_or_default_handler
 from huggingface_inference_toolkit.utils import (
     _get_framework,
@@ -11,8 +14,6 @@
     check_and_register_custom_pipeline_from_directory,
     get_pipeline,
 )
-from transformers.file_utils import is_torch_available
-from transformers.testing_utils import require_tf, require_torch, slow
 
 TASK_MODEL = "sshleifer/tiny-dbmdz-bert-large-cased-finetuned-conll03-english"
 

diff --git a/tests/unit/test_vertex_ai_utils.py b/tests/unit/test_vertex_ai_utils.py
@@ -20,6 +20,7 @@ def _load_repository_from_gcs(artifact_uri: str, target_dir: Path) -> str:
         import re
 
         from google.cloud import storage
+
         from huggingface_inference_toolkit.vertex_ai_utils import GCS_URI_PREFIX
 
         if isinstance(target_dir, str):