Skip to content

Commit

Permalink
Update MII Inference Examples (#837)
Browse files Browse the repository at this point in the history
  • Loading branch information
mrwyattii authored Jan 10, 2024
1 parent ff9a023 commit 05120bb
Show file tree
Hide file tree
Showing 15 changed files with 137 additions and 21 deletions.
2 changes: 1 addition & 1 deletion inference/mii/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

Install the requirements by running `pip install -r requirements.txt`.

Once [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii) is installed you have two options for deployment: an interactive non-persistent pipeline or a persistent serving deployment. For details on these files please refer to the [Getting Started guide for MII](https://github.com/microsoft/deepspeed-mii#getting-started-with-mii).
Once [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii) is installed you have two options for deployment: an interactive non-persistent pipeline or a persistent serving deployment. See the scripts in [non-persistent](./non-persistent/) and [persistent](./persistent/) for examples. Details on the code implemented in these scripts can be found on our [Getting Started guide for MII](https://github.com/microsoft/deepspeed-mii#getting-started-with-mii).
6 changes: 0 additions & 6 deletions inference/mii/client.py

This file was deleted.

28 changes: 28 additions & 0 deletions inference/mii/non-persistent/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Non-Persistent Pipeline Examples

The `pipeline.py` script can be used to run any of the [supported
models](https://github.com/microsoft/DeepSpeed-mii#supported-models). Provide
the HuggingFace model name, maximum generated tokens, and prompt(s). The
generated responses will be printed in the terminal:

```shell
$ python pipeline.py --model "mistralai/Mistral-7B-v0.1" --max-new-tokens 128 --prompts "DeepSpeed is" "Seattle is"
```

Tensor-parallelism can be controlled using the `deepspeed` launcher and setting
`--num_gpus`:

```shell
$ deepspeed --num_gpus 2 pipeline.py
```

## Model-Specific Examples

For convenience, we also provide a set of scripts to quickly test the MII
Pipeline with some popular text-generation models:

| Model | Launch command |
|-------|----------------|
| [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b) | `$ python llama2.py` |
| [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b) | `$ python falcon.py` |
| [mistralai/Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) | `$ deepspeed --num_gpus 2 mixtral.py` |
6 changes: 6 additions & 0 deletions inference/mii/non-persistent/falcon.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
import mii

pipe = mii.pipeline("tiiuae/falcon-7b")
responses = pipe("DeepSpeed is", max_new_tokens=128, return_full_text=True)
if pipe.is_rank_0:
print(responses[0])
6 changes: 6 additions & 0 deletions inference/mii/non-persistent/llama2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
import mii

pipe = mii.pipeline("meta-llama/Llama-2-7b-hf")
responses = pipe("DeepSpeed is", max_new_tokens=128, return_full_text=True)
if pipe.is_rank_0:
print(responses[0])
6 changes: 6 additions & 0 deletions inference/mii/non-persistent/mixtral.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
import mii

pipe = mii.pipeline("mistralai/Mixtral-8x7B-v0.1")
responses = pipe("DeepSpeed is", max_new_tokens=128, return_full_text=True)
if pipe.is_rank_0:
print(responses[0])
19 changes: 19 additions & 0 deletions inference/mii/non-persistent/pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
import argparse
import mii

parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
parser.add_argument(
"--prompts", type=str, nargs="+", default=["DeepSpeed is", "Seattle is"]
)
parser.add_argument("--max-new-tokens", type=int, default=128)
args = parser.parse_args()

pipe = mii.pipeline(args.model)
responses = pipe(
args.prompts, max_new_tokens=args.max_new_tokens, return_full_text=True
)

if pipe.is_rank_0:
for r in responses:
print(r, "\n", "-" * 80, "\n")
28 changes: 28 additions & 0 deletions inference/mii/persistent/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Persistent Deployment Examples

The `serve.py` script can be used to create an inference server for any of the
[supported models](https://github.com/microsoft/DeepSpeed-mii#supported-models).
Provide the HuggingFace model name and tensor-parallelism (use the default
values and run `$ python serve.py` for a single-GPU
[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
deployment):

```shell
$ python serve.py --model "mistralai/Mistral-7B-v0.1" tensor-parallel 1
```

Connect to the persistent deployment and generate text with `client.py`. Provide
the HuggingFace model name, maximum generated tokens, and prompt(s) (or if you
are using the default values, run `$ python client.py`):

```shell
$ python client.py --model "mistralai/Mistral-7B-v0.1" --max-new-tokens 128 --prompts "DeepSpeed is" "Seattle is"
```

Shutdown the persistent deployment with `terminate.py`. Provide the HuggingFace
model name (or if you are using the default values, run `$ python
terminate.py`):

```shell
$ python terminate.py --model "mistralai/Mistral-7B-v0.1
```
18 changes: 18 additions & 0 deletions inference/mii/persistent/client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import argparse
import mii

parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
parser.add_argument(
"--prompts", type=str, nargs="+", default=["DeepSpeed is", "Seattle is"]
)
parser.add_argument("--max-new-tokens", type=int, default=128)
args = parser.parse_args()

client = mii.client(args.model)
responses = client(
args.prompts, max_new_tokens=args.max_new_tokens, return_full_text=True
)

for r in responses:
print(r, "\n", "-" * 80, "\n")
13 changes: 13 additions & 0 deletions inference/mii/persistent/serve.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import argparse
import mii

parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
parser.add_argument("--tensor-parallel", type=int, default=1)
args = parser.parse_args()

mii.serve(args.model, tensor_parallel=args.tensor_parallel)

print(f"Serving model {args.model} on {args.tensor_parallel} GPU(s).")
print(f"Run `python client.py --model {args.model}` to connect.")
print(f"Run `python terminate.py --model {args.model}` to terminate.")
11 changes: 11 additions & 0 deletions inference/mii/persistent/terminate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
import argparse
import mii

parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
args = parser.parse_args()

client = mii.client(args.model)
client.terminate_server()

print(f"Terminated server for model {args.model}.")
6 changes: 0 additions & 6 deletions inference/mii/pipeline.py

This file was deleted.

2 changes: 1 addition & 1 deletion inference/mii/requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
mii>=0.1.0
deepspeed-mii>=0.1.3
3 changes: 0 additions & 3 deletions inference/mii/serve.py

This file was deleted.

4 changes: 0 additions & 4 deletions inference/mii/terminate.py

This file was deleted.

0 comments on commit 05120bb

Please sign in to comment.