You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+70-17Lines changed: 70 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,24 +28,39 @@
28
28
29
29
# FasterTransformer Backend
30
30
31
-
The Triton backend for the [FasterTransformer](https://github.com/NVIDIA/FasterTransformer). This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA. In the FasterTransformer v4.0, it supports multi-gpu inference on GPT-3 model. This backend integrates FasterTransformer into Triton to use giant GPT-3 model serving by Triton. In the below example, we will show how to use the FasterTransformer backend in Triton to run inference on a GPT-3 model with 345M parameters trained by [Megatron-LM](https://github.com/NVIDIA/Megatron-LM). In latest beta release, FasterTransformer backend supports the multi-node multi-GPU inference on T5 with the model of huggingface.
31
+
The Triton backend for the [FasterTransformer](https://github.com/NVIDIA/FasterTransformer). This repository provides a script and recipe to run the highly optimized transformer-based encoder and decoder component, and it is tested and maintained by NVIDIA. In the FasterTransformer v4.0, it supports multi-gpu inference on GPT-3 model. This backend integrates FasterTransformer into Triton to use giant GPT-3 model serving by Triton. In the below example, we will show how to use the FasterTransformer backend in Triton to run inference on a GPT-3 model with 345M parameters trained by [Megatron-LM](https://github.com/NVIDIA/Megatron-LM). In latest release, FasterTransformer backend supports the multi-node multi-GPU inference on T5 with the model of huggingface.
32
32
33
33
Note that this is a research and prototyping tool, not a formal product or maintained framework. User can learn more about Triton backends in the [backend repo](https://github.com/triton-inference-server/backend). Ask questions or report problems on the [issues page](https://github.com/triton-inference-server/fastertransformer_backend/issues) in this FasterTransformer_backend repo.
FasterTransformer backend hopes to integrate the FasterTransformer into Triton, leveraging the efficiency of FasterTransformer and serving capabilities of Triton. To run the GPT-3 model, we need to solve the following two issues: 1. How to run the auto-regressive model? 2. How to run the model with multi-gpu and multi-node?
@@ -84,10 +99,10 @@ For the issue of running the model with multi-gpu and multi-node, FasterTransfor
git clone https://github.com/triton-inference-server/server.git # We need some tools when we test this backend
103
-
git clone https://github.com/NVIDIA/FasterTransformer.git # Used for convert the checkpoint and triton output
104
-
ln -s server/qa/common .
105
-
cd fastertransformer_backend
106
115
docker build --rm \
107
116
--build-arg TRITON_VERSION=${CONTAINER_VERSION} \
108
117
-t ${TRITON_DOCKER_IMAGE} \
@@ -180,22 +189,50 @@ If your current machine/nodes are fully connected through PCIE or even across NU
180
189
If you met timed-out or hangs, please first check the topology and try to use DGX V100 or DGX A100 with nvlink connected.
181
190
182
191
183
-
## MPI Launching with Tensor Parallel size and Pipeline Parallel Size Setting
184
-
192
+
## Model-Parallism and Triton-Multiple-Model-Instances
185
193
We apply MPI to start single-node/multi-node servers.
186
194
187
195
- N: Number of MPI Processes/Number of Nodes
188
-
- T: Tensor Parallel Size. Default 8
196
+
- T: Tensor Parallel Size. Default 1
189
197
- P: Pipeline Parallel Size. Default 1
190
198
191
-
`total number of gpus = num_gpus_per_node x N = T x P`
199
+
Multiple model instances on same GPUs will share the weights, so there will not be any redundant weights memory allocated.
200
+
201
+
### Run inter-node (T x P > GPUs per Node) models
202
+
-`total number of GPUs = num_gpus_per_node x N = T x P`.
203
+
- only single mode instance supported
204
+
205
+
### Run intra-node (T x P <= GPUs per Node) models
206
+
-`total number of visible GPUs must be evenly divisble by T x P`. Note that you can control this by setting `CUDA_VISIBLE_DEVICES`.
207
+
-`total number of visible GPUs must be <= T x P x Instance Count`. It can avoid unnecessary cuda memory allocation on unused GPUs.
208
+
- multiple model instances can be run on tsame GPU groups or different GPU groups.
209
+
210
+
The backend will first try to assign different GPU groups to different model instances. If there are not empty GPUs, multiple model instances will be assigned to the same GPU groups.
211
+
212
+
For example, if there are 8 GPUs, 8 model instances (T = 2, P = 1), then model instances will be distributed to GPU groups [0, 1], [2, 3], [4, 5], [6, 7], [0, 1], [2, 3], [4, 5], [6, 7].
213
+
- weights are shared among model instances in same GPU groups. In the example above, instance 0 and instance 4 will share the same weights, and others are similar.
192
214
193
-
**Note** that we currently do not support the case that different nodes have different number of GPUs.
215
+
### Specify Multiple Model Instances
216
+
217
+
Set `count` here to start multiple model instances. Note `KIND_CPU` is the only choice here as the backend needs to take full control of how to distribute multiple model instances to all the visible GPUs.
218
+
219
+
```json
220
+
instance_group [
221
+
{
222
+
count: 8
223
+
kind: KIND_CPU
224
+
}
225
+
]
226
+
```
227
+
228
+
### Multi-Node Inference
229
+
230
+
We currently do not support the case that different nodes have different number of GPUs.
194
231
195
232
We start one MPI process per node. If you need to run on three nodes, then you should launch 3 Nodes with one process per node.
196
233
Remember to change `tensor_para_size` and `pipeline_para_size` if you run on multiple nodes.
197
234
198
-
We do suggest tensor_para_size = number of gpus in one node (e.g. 8 for DGX A100), and pipeline_para_size = number of nodes (2 for two nodes). Other model configuration in config.pbtxt should be modified as normal.
235
+
We do suggest tensor_para_size = number of GPUs in one node (e.g. 8 for DGX A100), and pipeline_para_size = number of nodes (2 for two nodes). Other model configuration in config.pbtxt should be modified as normal.
199
236
200
237
## Request examples
201
238
@@ -205,6 +242,22 @@ Specifically `tools/issue_request.py` is a simple script that sends a request co
205
242
206
243
## Changelog
207
244
245
+
Aug 2022
246
+
- Support for interactive generation
247
+
248
+
July 2022
249
+
- Support shared context optimization in GPT model
250
+
- Support UL2
251
+
252
+
June 2022
253
+
- Support decoupled (streaming) mode.
254
+
- Add demo of grpc protocol.
255
+
- Support BERT
256
+
257
+
May 2022
258
+
- Support GPT-NeoX.
259
+
- Support optional input. (triton version must be after 22.05)
260
+
208
261
April 2022
209
262
- Support bfloat16 inference in GPT model.
210
263
- Support Nemo Megatron T5 and Megatron-LM T5 model.
0 commit comments