You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+95-57Lines changed: 95 additions & 57 deletions
Original file line number
Diff line number
Diff line change
@@ -8,8 +8,16 @@
8
8
Hugging Face Inference Toolkit is for serving 🤗 Transformers models in containers. This library provides default pre-processing, predict and postprocessing for Transformers, Sentence Tranfsformers. It is also possible to define custom `handler.py` for customization. The Toolkit is build to work with the [Hugging Face Hub](https://huggingface.co/models).
9
9
10
10
---
11
+
11
12
## 💻 Getting Started with Hugging Face Inference Toolkit
12
13
14
+
* Clone the repository `git clone https://github.com/huggingface/huggingface-inference-toolkit``
15
+
* Install the dependencies in dev mode `pip install -e ".[torch, st, diffusers, test,quality]"`
16
+
* If you develop on AWS inferentia2 install with `pip install -e ".[test,quality]" optimum-neuron[neuronx] --upgrade`
17
+
* Unit Testing: `make unit-test`
18
+
* Integration testing: `make integ-test`
19
+
20
+
13
21
### Local run
14
22
15
23
```bash
@@ -58,6 +66,21 @@ curl --request POST \
58
66
}'
59
67
```
60
68
69
+
### Custom Handler and dependency support
70
+
71
+
The Hugging Face Inference Toolkit allows user to provide a custom inference through a `handler.py` file which is located in the repository.
72
+
For an example check [https://huggingface.co/philschmid/custom-pipeline-text-classification](https://huggingface.co/philschmid/custom-pipeline-text-classification):
73
+
```bash
74
+
model.tar.gz/
75
+
|- pytorch_model.bin
76
+
|- ....
77
+
|- handler.py
78
+
|- requirements.txt
79
+
```
80
+
In this example, `pytroch_model.bin` is the model file saved from training, `handler.py` is the custom inference handler, and `requirements.txt` is a requirements file to add additional dependencies.
81
+
The custom module can override the following methods:
82
+
83
+
61
84
### Vertex AI Support
62
85
63
86
The Hugging Face Inference Toolkit is also supported on Vertex AI, based on [Custom container requirements for prediction](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements). [Environment variables set by Vertex AI](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#aip-variables) are automatically detected and used by the toolkit.
@@ -109,6 +132,69 @@ curl --request POST \
109
132
}'
110
133
```
111
134
135
+
### AWS Inferentia2 Support
136
+
137
+
The Hugging Face Inference Toolkit provides support for deploying Hugging Face on AWS Inferentia2. To deploy a model on Inferentia2 you have 3 options:
138
+
* Provide `HF_MODEL_ID`, the model repo id on huggingface.co which contains the compiled model under `.neuron` format. e.g. `optimum/bge-base-en-v1.5-neuronx`
139
+
* Provide the `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH` environment variables to compile the model on the fly, e.g. `HF_OPTIMUM_BATCH_SIZE=1 HF_OPTIMUM_SEQUENCE_LENGTH=128`
140
+
* Include `neuron` dictionary in the [config.json](https://huggingface.co/optimum/tiny_random_bert_neuron/blob/main/config.json) file in the model archive, e.g. `neuron: {"static_batch_size": 1, "static_sequence_length": 128}`
141
+
142
+
The currently supported tasks can be found [here](https://huggingface.co/docs/optimum-neuron/en/package_reference/supported_models). If you plan to deploy an LLM, we recommend taking a look at [Neuronx TGI](https://huggingface.co/blog/text-generation-inference-on-inferentia2), which is purposly build for LLMs.
143
+
144
+
#### Local run with HF_MODEL_ID and HF_TASK
145
+
146
+
Start Hugging Face Inference Toolkit with the following environment variables.
147
+
148
+
_Note: You need to run this on an Inferentia2 instance._
149
+
150
+
- transformers `text-classification` with `HF_OPTIMUM_BATCH_SIZE` and `HF_OPTIMUM_SEQUENCE_LENGTH`
"inputs": "Wow, this is such a great product. I love it!"
168
+
}'
169
+
```
170
+
171
+
#### Container run with HF_MODEL_ID and HF_TASK
172
+
173
+
174
+
1. build the preferred container for either CPU or GPU for PyTorch o.
175
+
176
+
```bash
177
+
make inference-pytorch-inf2
178
+
```
179
+
180
+
2. Run the container and provide either environment variables to the HUB model you want to use or mount a volume to the container, where your model is stored.
"inputs": "Wow, this is such a great product. I love it!",
194
+
"parameters": { "top_k": 2 }
195
+
}'
196
+
```
197
+
112
198
113
199
---
114
200
@@ -168,61 +254,23 @@ The `HF_FRAMEWORK` environment variable defines the base deep learning framework
168
254
HF_FRAMEWORK="pytorch"
169
255
```
170
256
171
-
###`HF_ENDPOINT`
257
+
#### `HF_OPTIMUM_BATCH_SIZE`
172
258
173
-
The `HF_ENDPOINT` environment variable indicates whether the service is run inside the HF Inference endpoint service to adjust the `logging` config.
259
+
The `HF_OPTIMUM_BATCH_SIZE` environment variable defines the batch size, which is used when compiling the model to Neuron. The default value is `1`. Not required when model is already converted.
174
260
175
261
```bash
176
-
HF_ENDPOINT="True"
262
+
HF_OPTIMUM_BATCH_SIZE="1"
177
263
```
178
264
265
+
#### `HF_OPTIMUM_SEQUENCE_LENGTH`
179
266
180
-
---
267
+
The `HF_OPTIMUM_SEQUENCE_LENGTH` environment variable defines the sequence length, which is used when compiling the model to Neuron. There is no default value. Not required when model is already converted.
181
268
182
-
## 🧑🏻💻 Custom Handler and dependency support
183
-
184
-
The Hugging Face Inference Toolkit allows user to provide a custom inference through a `handler.py` file which is located in the repository.
185
-
For an example check [https://huggingface.co/philschmid/custom-pipeline-text-classification](https://huggingface.co/philschmid/custom-pipeline-text-classification):
186
269
```bash
187
-
model.tar.gz/
188
-
|- pytorch_model.bin
189
-
|- ....
190
-
|- handler.py
191
-
|- requirements.txt
270
+
HF_OPTIMUM_SEQUENCE_LENGTH="128"
192
271
```
193
-
In this example, `pytroch_model.bin` is the model file saved from training, `handler.py` is the custom inference handler, and `requirements.txt` is a requirements file to add additional dependencies.
194
-
The custom module can override the following methods:
195
-
196
272
197
-
## ☑️ Supported & Tested Tasks
198
-
199
-
Below you ll find a list of supported and tested transformers and sentence transformers tasks. Each of those are always tested through integration tests. In addition to those tasks you can always provide `custom`, which expect a `handler.py` file to be provided.
200
-
201
-
```bash
202
-
"text-classification",
203
-
"zero-shot-classification",
204
-
"ner",
205
-
"question-answering",
206
-
"fill-mask",
207
-
"summarization",
208
-
"translation_xx_to_yy",
209
-
"text2text-generation",
210
-
"text-generation",
211
-
"feature-extraction",
212
-
"image-classification",
213
-
"automatic-speech-recognition",
214
-
"audio-classification",
215
-
"object-detection",
216
-
"image-segmentation",
217
-
"table-question-answering",
218
-
"conversational"
219
-
"sentence-similarity",
220
-
"sentence-embeddings",
221
-
"sentence-ranking",
222
-
# TODO currently not supported due to multimodality input
223
-
# "visual-question-answering",
224
-
# "zero-shot-image-classification",
225
-
```
273
+
---
226
274
227
275
## ⚙ Supported Frontend
228
276
@@ -232,21 +280,11 @@ Below you ll find a list of supported and tested transformers and sentence trans
232
280
-[] Starlette (SageMaker)
233
281
234
282
---
235
-
## 🤝 Contributing
236
-
237
-
### Development
238
-
239
-
* Recommended Python version: 3.11
240
-
* We recommend `pyenv` for easily switching between different Python versions
241
-
* There are two options for unit and integration tests:
0 commit comments