Skip to content

Commit 05120bb

Browse files
authored
Update MII Inference Examples (#837)
1 parent ff9a023 commit 05120bb

File tree

15 files changed

+137
-21
lines changed

15 files changed

+137
-21
lines changed

inference/mii/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@
22

33
Install the requirements by running `pip install -r requirements.txt`.
44

5-
Once [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii) is installed you have two options for deployment: an interactive non-persistent pipeline or a persistent serving deployment. For details on these files please refer to the [Getting Started guide for MII](https://github.com/microsoft/deepspeed-mii#getting-started-with-mii).
5+
Once [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii) is installed you have two options for deployment: an interactive non-persistent pipeline or a persistent serving deployment. See the scripts in [non-persistent](./non-persistent/) and [persistent](./persistent/) for examples. Details on the code implemented in these scripts can be found on our [Getting Started guide for MII](https://github.com/microsoft/deepspeed-mii#getting-started-with-mii).

inference/mii/client.py

Lines changed: 0 additions & 6 deletions
This file was deleted.
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Non-Persistent Pipeline Examples
2+
3+
The `pipeline.py` script can be used to run any of the [supported
4+
models](https://github.com/microsoft/DeepSpeed-mii#supported-models). Provide
5+
the HuggingFace model name, maximum generated tokens, and prompt(s). The
6+
generated responses will be printed in the terminal:
7+
8+
```shell
9+
$ python pipeline.py --model "mistralai/Mistral-7B-v0.1" --max-new-tokens 128 --prompts "DeepSpeed is" "Seattle is"
10+
```
11+
12+
Tensor-parallelism can be controlled using the `deepspeed` launcher and setting
13+
`--num_gpus`:
14+
15+
```shell
16+
$ deepspeed --num_gpus 2 pipeline.py
17+
```
18+
19+
## Model-Specific Examples
20+
21+
For convenience, we also provide a set of scripts to quickly test the MII
22+
Pipeline with some popular text-generation models:
23+
24+
| Model | Launch command |
25+
|-------|----------------|
26+
| [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b) | `$ python llama2.py` |
27+
| [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b) | `$ python falcon.py` |
28+
| [mistralai/Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) | `$ deepspeed --num_gpus 2 mixtral.py` |
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
import mii
2+
3+
pipe = mii.pipeline("tiiuae/falcon-7b")
4+
responses = pipe("DeepSpeed is", max_new_tokens=128, return_full_text=True)
5+
if pipe.is_rank_0:
6+
print(responses[0])
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
import mii
2+
3+
pipe = mii.pipeline("meta-llama/Llama-2-7b-hf")
4+
responses = pipe("DeepSpeed is", max_new_tokens=128, return_full_text=True)
5+
if pipe.is_rank_0:
6+
print(responses[0])
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
import mii
2+
3+
pipe = mii.pipeline("mistralai/Mixtral-8x7B-v0.1")
4+
responses = pipe("DeepSpeed is", max_new_tokens=128, return_full_text=True)
5+
if pipe.is_rank_0:
6+
print(responses[0])
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
import argparse
2+
import mii
3+
4+
parser = argparse.ArgumentParser()
5+
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
6+
parser.add_argument(
7+
"--prompts", type=str, nargs="+", default=["DeepSpeed is", "Seattle is"]
8+
)
9+
parser.add_argument("--max-new-tokens", type=int, default=128)
10+
args = parser.parse_args()
11+
12+
pipe = mii.pipeline(args.model)
13+
responses = pipe(
14+
args.prompts, max_new_tokens=args.max_new_tokens, return_full_text=True
15+
)
16+
17+
if pipe.is_rank_0:
18+
for r in responses:
19+
print(r, "\n", "-" * 80, "\n")

inference/mii/persistent/README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Persistent Deployment Examples
2+
3+
The `serve.py` script can be used to create an inference server for any of the
4+
[supported models](https://github.com/microsoft/DeepSpeed-mii#supported-models).
5+
Provide the HuggingFace model name and tensor-parallelism (use the default
6+
values and run `$ python serve.py` for a single-GPU
7+
[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
8+
deployment):
9+
10+
```shell
11+
$ python serve.py --model "mistralai/Mistral-7B-v0.1" tensor-parallel 1
12+
```
13+
14+
Connect to the persistent deployment and generate text with `client.py`. Provide
15+
the HuggingFace model name, maximum generated tokens, and prompt(s) (or if you
16+
are using the default values, run `$ python client.py`):
17+
18+
```shell
19+
$ python client.py --model "mistralai/Mistral-7B-v0.1" --max-new-tokens 128 --prompts "DeepSpeed is" "Seattle is"
20+
```
21+
22+
Shutdown the persistent deployment with `terminate.py`. Provide the HuggingFace
23+
model name (or if you are using the default values, run `$ python
24+
terminate.py`):
25+
26+
```shell
27+
$ python terminate.py --model "mistralai/Mistral-7B-v0.1
28+
```

inference/mii/persistent/client.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
import argparse
2+
import mii
3+
4+
parser = argparse.ArgumentParser()
5+
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
6+
parser.add_argument(
7+
"--prompts", type=str, nargs="+", default=["DeepSpeed is", "Seattle is"]
8+
)
9+
parser.add_argument("--max-new-tokens", type=int, default=128)
10+
args = parser.parse_args()
11+
12+
client = mii.client(args.model)
13+
responses = client(
14+
args.prompts, max_new_tokens=args.max_new_tokens, return_full_text=True
15+
)
16+
17+
for r in responses:
18+
print(r, "\n", "-" * 80, "\n")

inference/mii/persistent/serve.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
import argparse
2+
import mii
3+
4+
parser = argparse.ArgumentParser()
5+
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
6+
parser.add_argument("--tensor-parallel", type=int, default=1)
7+
args = parser.parse_args()
8+
9+
mii.serve(args.model, tensor_parallel=args.tensor_parallel)
10+
11+
print(f"Serving model {args.model} on {args.tensor_parallel} GPU(s).")
12+
print(f"Run `python client.py --model {args.model}` to connect.")
13+
print(f"Run `python terminate.py --model {args.model}` to terminate.")

0 commit comments

Comments
 (0)