Skip to content

Commit

Permalink
add READMEs
Browse files Browse the repository at this point in the history
  • Loading branch information
mrwyattii committed Dec 22, 2023
1 parent 7306a43 commit 5932fb9
Show file tree
Hide file tree
Showing 10 changed files with 106 additions and 16 deletions.
2 changes: 1 addition & 1 deletion inference/mii/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

Install the requirements by running `pip install -r requirements.txt`.

Once [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii) is installed you have two options for deployment: an interactive non-persistent pipeline or a persistent serving deployment. For details on these files please refer to the [Getting Started guide for MII](https://github.com/microsoft/deepspeed-mii#getting-started-with-mii).
Once [DeepSpeed-MII](https://github.com/microsoft/deepspeed-mii) is installed you have two options for deployment: an interactive non-persistent pipeline or a persistent serving deployment. See the scripts in [non-persistent](./non-persistent/) and [persistent](./persistent/) for examples. Details on the code implemented in these scripts can be found on our [Getting Started guide for MII](https://github.com/microsoft/deepspeed-mii#getting-started-with-mii).
28 changes: 28 additions & 0 deletions inference/mii/non-persistent/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Non-Persistent Pipeline Examples

The `pipeline.py` script can be used to run any of the [supported
models](https://github.com/microsoft/DeepSpeed-mii#supported-models). Provide
the HuggingFace model name, maximum generated tokens, and prompt(s). The
generated responses will be printed in the terminal:

```shell
$ python pipeline.py --model "mistralai/Mistral-7B-v0.1" --max-new-tokens 128 --prompts "DeepSpeed is" "Seattle is"
```

Tensor-parallelism can be controlled using the `deepspeed` launcher and setting
`--num_gpus`:

```shell
$ deepspeed --num_gpus 2 pipeline.py
```

## Model-Specific Examples

For convenience, we also provide a set of scripts to quickly test the MII
Pipeline with some popular text-generation models:

| Model | Launch command |
|-------|----------------|
| [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b) | `$ python llama2.py` |
| [tiiuae/falcon-7b](https://huggingface.co/tiiuae/falcon-7b) | `$ python falcon.py` |
| [mistralai/Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) | `$ deepspeed --num_gpus 2 mixtral.py` |
4 changes: 2 additions & 2 deletions inference/mii/non-persistent/falcon.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from mii import pipeline
import mii

pipe = pipeline("tiiuae/falcon-7b")
pipe = mii.pipeline("tiiuae/falcon-7b")
responses = pipe("DeepSpeed is", max_new_tokens=128, return_full_text=True)
if pipe.is_rank_0:
print(responses[0])
4 changes: 2 additions & 2 deletions inference/mii/non-persistent/llama2.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from mii import pipeline
import mii

pipe = pipeline("meta-llama/Llama-2-7b-hf")
pipe = mii.pipeline("meta-llama/Llama-2-7b-hf")
responses = pipe("DeepSpeed is", max_new_tokens=128, return_full_text=True)
if pipe.is_rank_0:
print(responses[0])
4 changes: 2 additions & 2 deletions inference/mii/non-persistent/mixtral.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from mii import pipeline
import mii

pipe = pipeline("mistralai/Mixtral-8x7B-v0.1")
pipe = mii.pipeline("mistralai/Mixtral-8x7B-v0.1")
responses = pipe("DeepSpeed is", max_new_tokens=128, return_full_text=True)
if pipe.is_rank_0:
print(responses[0])
13 changes: 9 additions & 4 deletions inference/mii/non-persistent/pipeline.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,18 @@
import argparse
from mii import pipeline
import mii

parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
parser.add_argument("--prompts", type=str, nargs="+", default=["DeepSpeed is"])
parser.add_argument(
"--prompts", type=str, nargs="+", default=["DeepSpeed is", "Seattle is"]
)
parser.add_argument("--max-new-tokens", type=int, default=128)
args = parser.parse_args()

pipe = pipeline(parser.model)
responses = pipe(args.prompts, max_new_tokens=128, return_full_text=True)
pipe = mii.pipeline(args.model)
responses = pipe(
args.prompts, max_new_tokens=args.max_new_tokens, return_full_text=True
)

if pipe.is_rank_0:
for r in responses:
Expand Down
28 changes: 28 additions & 0 deletions inference/mii/persistent/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Persistent Deployment Examples

The `serve.py` script can be used to create an inference server for any of the
[supported models](https://github.com/microsoft/DeepSpeed-mii#supported-models).
Provide the HuggingFace model name and tensor-parallelism (use the default
values and run `$ python serve.py` for a single-GPU
[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
deployment):

```shell
$ python serve.py --model "mistralai/Mistral-7B-v0.1" tensor-parallel 1
```

Connect to the persistent deployment and generate text with `client.py`. Provide
the HuggingFace model name, maximum generated tokens, and prompt(s) (or if you
are using the default values, run `$ python client.py`):

```shell
$ python client.py --model "mistralai/Mistral-7B-v0.1" --max-new-tokens 128 --prompts "DeepSpeed is" "Seattle is"
```

Shutdown the persistent deployment with `terminate.py`. Provide the HuggingFace
model name (or if you are using the default values, run `$ python
terminate.py`):

```shell
$ python terminate.py --model "mistralai/Mistral-7B-v0.1
```
18 changes: 15 additions & 3 deletions inference/mii/persistent/client.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,18 @@
import argparse
import mii

client = mii.client("mistralai/Mistral-7B-v0.1")
output = client.generate("Deepspeed is", max_new_tokens=128)
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
parser.add_argument(
"--prompts", type=str, nargs="+", default=["DeepSpeed is", "Seattle is"]
)
parser.add_argument("--max-new-tokens", type=int, default=128)
args = parser.parse_args()

print(output)
client = mii.client(args.model)
responses = client(
args.prompts, max_new_tokens=args.max_new_tokens, return_full_text=True
)

for r in responses:
print(r, "\n", "-" * 80, "\n")
12 changes: 11 additions & 1 deletion inference/mii/persistent/serve.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,13 @@
import argparse
import mii

mii.serve("mistralai/Mistral-7B-v0.1")
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
parser.add_argument("--tensor-parallel", type=int, default=1)
args = parser.parse_args()

mii.serve(args.model, tensor_parallel=args.tensor_parallel)

print(f"Serving model {args.model} on {args.tensor_parallel} GPU(s).")
print(f"Run `python client.py --model {args.model}` to connect.")
print(f"Run `python terminate.py --model {args.model}` to terminate.")
9 changes: 8 additions & 1 deletion inference/mii/persistent/terminate.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,11 @@
import argparse
import mii

client = mii.client("mistralai/Mistral-7B-v0.1")
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="mistralai/Mistral-7B-v0.1")
args = parser.parse_args()

client = mii.client(args.model)
client.terminate_server()

print(f"Terminated server for model {args.model}.")

0 comments on commit 5932fb9

Please sign in to comment.