This project is no longer actively maintained. While existing releases remain available, there are no planned updates, bug fixes, new features, or security patches. Users should be aware that vulnerabilities may not be addressed.
This is an example showing how to integrate vLLM with TorchServe and run inference on model meta-llama/Meta-Llama-3.1-8B + LoRA model llama-duo/llama3.1-8b-summarize-gpt4o-128k with continuous batching.
This examples supports distributed inference by following this instruction
To leverage the power of vLLM we fist need to install it using pip in out development environment
python -m pip install -r ../requirements.txtFor later deployments we can make vLLM part of the deployment environment by adding the requirements.txt while building the model archive in step 2 (see here for details) or we can make it part of a docker image like here.
Login with a HuggingFace account
huggingface-cli login
# or using an environment variable
huggingface-cli login --token $HUGGINGFACE_TOKEN
python ../../utils/Download_model.py --model_path model --model_name meta-llama/Meta-Llama-3.1-8B --use_auth_token True
mkdir adapters && cd adapters
python ../../../utils/Download_model.py --model_path model --model_name llama-duo/llama3.1-8b-summarize-gpt4o-128k --use_auth_token True
cd ..Add the downloaded path to "model_path:" and "adapter_1:" in model-config.yaml and run the following.
torch-model-archiver --model-name llama-8b-lora --version 1.0 --handler vllm_handler --config-file model-config.yaml --archive-format no-archive
mv model llama-8b-lora
mv adapters llama-8b-loramkdir model_store
mv llama-8b-lora model_storetorchserve --start --ncs --ts-config ../config.properties --model-store model_store --models llama-8b-lora --disable-token-auth --enable-model-apiThe vllm integration uses an OpenAI compatible interface which lets you perform inference with curl or the openai library client and supports streaming.
Curl:
curl --header "Content-Type: application/json" --request POST --data @prompt.json http://localhost:8080/predictions/llama-8b-lora/1.0/v1/completionsPython + Request:
python ../../utils/test_llm_streaming_response.py -m llama-8b-lora -o 50 -t 2 -n 4 --prompt-text "@prompt.json" --prompt-json --openai-api --demo-streamingOpenAI client:
from openai import OpenAI
model_name = "llama-8b-lora"
stream=True
openai_api_key = "EMPTY"
openai_api_base = f"http://localhost:8080/predictions/{model_name}/1.0/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
response = client.completions.create(
model=model_name, prompt="Hello world", temperature=0.0, stream=stream
)
for chunk in reponse:
print(f"{chunk=}")