The serve.py script can be used to create an inference server for any of the
supported models.
Provide the HuggingFace model name and tensor-parallelism (use the default
values and run $ python serve.py for a single-GPU
mistralai/Mistral-7B-v0.1
deployment):
$ python serve.py --model "mistralai/Mistral-7B-v0.1" tensor-parallel 1Connect to the persistent deployment and generate text with client.py. Provide
the HuggingFace model name, maximum generated tokens, and prompt(s) (or if you
are using the default values, run $ python client.py):
$ python client.py --model "mistralai/Mistral-7B-v0.1" --max-new-tokens 128 --prompts "DeepSpeed is" "Seattle is"Shutdown the persistent deployment with terminate.py. Provide the HuggingFace
model name (or if you are using the default values, run $ python terminate.py):
$ python terminate.py --model "mistralai/Mistral-7B-v0.1