Skip to content

Commit 9b9a066

Browse files
Add vllm-ipex service for llms and lvms (#1926)
* Add vllm-ipex service for llms and lvms Signed-off-by: Lin, Jiaojiao <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix typos Signed-off-by: Lin, Jiaojiao <[email protected]> * Update lvms Supported Implementations Signed-off-by: Lin, Jiaojiao <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update llms Supported Implementations Signed-off-by: Lin, Jiaojiao <[email protected]> * fix vllm-ipex scripts Signed-off-by: Lin, Jiaojiao <[email protected]> * fix CI issues Signed-off-by: Lin, Jiaojiao <[email protected]> * Add test scripts for vllm-ipex on b60 graphics Signed-off-by: Lin, Jiaojiao <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update action runner for B60 vllm-ipex tests Signed-off-by: Lin, Jiaojiao <[email protected]> * fix ENV to support vllm-ipex multi-arc deployment Signed-off-by: Lin, Jiaojiao <[email protected]> * update vllm-ipex docker image tag Signed-off-by: Lin, Jiaojiao <[email protected]> --------- Signed-off-by: Lin, Jiaojiao <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent ab63467 commit 9b9a066

File tree

10 files changed

+593
-25
lines changed

10 files changed

+593
-25
lines changed

comps/llms/deployment/docker_compose/compose_text-generation.yaml

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,39 @@ services:
156156
ovms-llm-serving:
157157
condition: service_healthy
158158

159+
textgen-vllm-ipex-service:
160+
container_name: textgen-vllm-ipex-service
161+
image: ${REGISTRY:-intel}/llm-scaler-vllm:${TAG:-latest}
162+
privileged: true
163+
restart: always
164+
ports:
165+
- ${VLLM_PORT:-41090}:8000
166+
group_add:
167+
- ${VIDEO_GROUP_ID:-44} # Use the environment variable for the video group ID
168+
- ${RENDER_GROUP_ID:-992} # Use the environment variable for the render group ID
169+
volumes:
170+
- ${HF_HOME:-${HOME}/.cache/huggingface}:/llm/.cache/huggingface
171+
- ../../src/text-generation/vllm_ipex_entrypoint.sh:/llm/vllm_ipex_entrypoint.sh
172+
devices:
173+
- /dev/dri
174+
environment:
175+
no_proxy: localhost,127.0.0.1
176+
http_proxy: ${http_proxy}
177+
https_proxy: ${https_proxy}
178+
HF_ENDPOINT: https://hf-mirror.com
179+
HF_HOME: /llm/.cache/huggingface
180+
MODEL_PATH: ${LLM_MODEL_ID}
181+
SERVED_MODEL_NAME: ${LLM_MODEL_ID}
182+
TENSOR_PARALLEL_SIZE: ${TENSOR_PARALLEL_SIZE:-1}
183+
MAX_MODEL_LEN: ${MAX_MODEL_LEN:-20000}
184+
ONEAPI_DEVICE_SELECTOR: ${ONEAPI_DEVICE_SELECTOR:-level_zero:0}
185+
LOAD_QUANTIZATION: ${LOAD_QUANTIZATION:-fp8}
186+
ZE_AFFINITY_MASK: ${ZE_AFFINITY_MASK}
187+
shm_size: 128g
188+
entrypoint: /bin/bash -c "\
189+
chmod +x /llm/vllm_ipex_entrypoint.sh && \
190+
bash /llm/vllm_ipex_entrypoint.sh"
191+
159192
networks:
160193
default:
161194
driver: bridge

comps/llms/src/text-generation/README.md

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -18,23 +18,26 @@ Overall, this microservice offers a streamlined way to integrate large language
1818

1919
## Validated LLM Models
2020

21-
| Model | TGI-Gaudi | vLLM-CPU | vLLM-Gaudi | OVMS | Optimum-Habana | SGLANG-CPU |
22-
| ----------------------------------------------- | --------- | -------- | ---------- | -------- | -------------- | ---------- |
23-
| [Intel/neural-chat-7b-v3-3] |||||| - |
24-
| [meta-llama/Llama-2-7b-chat-hf] |||||||
25-
| [meta-llama/Llama-2-70b-chat-hf] || - || - |||
26-
| [meta-llama/Meta-Llama-3-8B-Instruct] |||||||
27-
| [meta-llama/Meta-Llama-3-70B-Instruct] || - || - |||
28-
| [Phi-3] || Limit 4K | Limit 4K | Limit 4K || - |
29-
| [Phi-4] |||||| - |
30-
| [deepseek-ai/DeepSeek-R1-Distill-Llama-8B] || - || - || - |
31-
| [deepseek-ai/DeepSeek-R1-Distill-Llama-70B] || - || - || - |
32-
| [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B] || - || - || - |
33-
| [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B] || - || - || - |
34-
| [mistralai/Mistral-Small-24B-Instruct-2501] || - || - || - |
35-
| [mistralai/Mistral-Large-Instruct-2411] || - || - || - |
36-
| [meta-llama/Llama-4-Scout-17B-16E-Instruct] | - | - | - | - | - ||
37-
| [meta-llama/Llama-4-Maverick-17B-128E-Instruct] | - | - | - | - | - ||
21+
| Model | TGI-Gaudi | vLLM-CPU | vLLM-Gaudi | vLLM-IPEX-XPU | OVMS | Optimum-Habana | SGLANG-CPU |
22+
| ----------------------------------------------- | --------- | -------- | ---------- | ------------- | -------- | -------------- | ---------- |
23+
| [Intel/neural-chat-7b-v3-3] ||||||| - |
24+
| [meta-llama/Llama-2-7b-chat-hf] |||| - ||||
25+
| [meta-llama/Llama-2-70b-chat-hf] || - || - | - |||
26+
| [meta-llama/Meta-Llama-3-8B-Instruct] |||| - ||||
27+
| [meta-llama/Meta-Llama-3-70B-Instruct] || - || - | - |||
28+
| [Phi-3] || Limit 4K | Limit 4K || Limit 4K || - |
29+
| [Phi-4] ||||||| - |
30+
| [deepseek-ai/DeepSeek-R1-Distill-Llama-8B] || - ||| - || - |
31+
| [deepseek-ai/DeepSeek-R1-Distill-Llama-70B] || - ||| - || - |
32+
| [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B] || - ||| - || - |
33+
| [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B] || - ||| - || - |
34+
| [mistralai/Mistral-Small-24B-Instruct-2501] || - || - | - || - |
35+
| [mistralai/Mistral-Large-Instruct-2411] || - || - | - || - |
36+
| [meta-llama/Llama-4-Scout-17B-16E-Instruct] | - | - | - | - | - | - ||
37+
| [meta-llama/Llama-4-Maverick-17B-128E-Instruct] | - | - | - | - | - | - ||
38+
| [Qwen3-8B/14B/32B] | - | - | - || - | - | - |
39+
40+
> **Note:** More details about supported models for vLLM-IPEX-XPU can be found at [supported-models](https://github.com/intel/llm-scaler/tree/main/vllm#3-supported-models).
3841
3942
### System Requirements for LLM Models
4043

@@ -70,6 +73,7 @@ In this microservices, we have supported following backend LLM service as integr
7073
- [Bedrock](./README_bedrock.md)
7174
- [Native](./README_native.md), based on optimum habana
7275
- [Predictionguard](./README_predictionguard.md)
76+
- [VLLM-IPEX](./README_vllm_ipex.md), based on B60 Graphics
7377

7478
### Clone OPEA GenAIComps
7579

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# LLM Microservice with vLLM on Intel XPU
2+
3+
This service provides high-throughput, low-latency LLM serving accelerated by vLLM-IPEX, optimized for Intel® Arc™ Pro B60 Graphics.
4+
5+
---
6+
7+
## Table of Contents
8+
9+
1. [Prerequisites](#prerequisites)
10+
2. [Start Microservice](#start-microservice)
11+
3. [Consume LLM Service](#consume-llm-service)
12+
13+
---
14+
15+
## Prerequisites
16+
17+
### Download vLLM-IPEX Docker Image
18+
19+
You must download the official docker image from [Docker Hub](https://hub.docker.com/r/intel/llm-scaler-vllm) first.
20+
21+
```bash
22+
docker pull intel/llm-scaler-vllm:1.0
23+
```
24+
25+
## Start Microservice
26+
27+
### Run with Docker Compose
28+
29+
Deploy the vLLM-IPEX model serving using Docker Compose.
30+
31+
1. Export the required environment variables:
32+
33+
```bash
34+
# Use image: intel/llm-scaler-vllm:1.0
35+
export REGISTRY=intel
36+
export TAG=1.0
37+
38+
export VIDEO_GROUP_ID=$(getent group video | awk -F: '{printf "%s\n", $3}')
39+
export RENDER_GROUP_ID=$(getent group render | awk -F: '{printf "%s\n", $3}')
40+
41+
HF_HOME=${HF_HOME:=~/.cache/huggingface}
42+
export HF_HOME
43+
44+
export MAX_MODEL_LEN=20000
45+
export LLM_MODEL_ID=Qwen/Qwen3-8B-AWQ
46+
export LOAD_QUANTIZATION=awq
47+
export VLLM_PORT=41090
48+
49+
# Single-Arc GPU, select GPU index as needed
50+
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
51+
export TENSOR_PARALLEL_SIZE=1
52+
# Multi-Arc GPU, select GPU indices as needed
53+
# export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"
54+
# export TENSOR_PARALLEL_SIZE=2
55+
```
56+
57+
2. Navigate to the Docker Compose directory and start the services:
58+
```bash
59+
cd comps/llms/deployment/docker_compose/
60+
docker compose -f compose_text-generation.yaml up textgen-vllm-ipex-service -d
61+
```
62+
63+
> **Note:** More details about supported models can be found at [supported-models](https://github.com/intel/llm-scaler/tree/main/vllm#3-supported-models).
64+
65+
---
66+
67+
## Consume LLM Service
68+
69+
Once the service is running, you can send requests to the API.
70+
71+
### Use the LLM Service API
72+
73+
Send a POST request with a prompt.
74+
75+
```bash
76+
curl http://localhost:41090/v1/chat/completions -XPOST -H "Content-Type: application/json" -d '{
77+
"model": "Qwen/Qwen3-8B-AWQ",
78+
"messages": [
79+
{
80+
"role": "user",
81+
"content": "What is Deep Learning?"
82+
}
83+
],
84+
"max_tokens": 512
85+
}'
86+
```
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
#!/bin/sh
2+
3+
# Copyright (C) 2025 Intel Corporation
4+
# SPDX-License-Identifier: Apache-2.0
5+
6+
if [ "$LOAD_QUANTIZATION" = "None" ]; then
7+
echo "LOAD_QUANTIZATION is None, will load the model without online quantization."
8+
9+
TORCH_LLM_ALLREDUCE=1 \
10+
VLLM_USE_V1=1 \
11+
CCL_ZE_IPC_EXCHANGE=pidfd \
12+
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
13+
VLLM_WORKER_MULTIPROC_METHOD=spawn \
14+
python3 -m vllm.entrypoints.openai.api_server \
15+
--model ${MODEL_PATH} \
16+
--served-model-name ${SERVED_MODEL_NAME} \
17+
--dtype=float16 \
18+
--enforce-eager \
19+
--port 8000 \
20+
--host 0.0.0.0 \
21+
--trust-remote-code \
22+
--gpu-memory-util=0.9 \
23+
--disable-sliding-window \
24+
--max-num-batched-tokens=${MAX_MODEL_LEN} \
25+
--disable-log-requests \
26+
--max-model-len=${MAX_MODEL_LEN} \
27+
--block-size 64 \
28+
-tp=${TENSOR_PARALLEL_SIZE} \
29+
--enable_prefix_caching
30+
31+
else
32+
echo "LOAD_QUANTIZATION is $LOAD_QUANTIZATION"
33+
34+
TORCH_LLM_ALLREDUCE=1 \
35+
VLLM_USE_V1=1 \
36+
CCL_ZE_IPC_EXCHANGE=pidfd \
37+
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
38+
VLLM_WORKER_MULTIPROC_METHOD=spawn \
39+
python3 -m vllm.entrypoints.openai.api_server \
40+
--model ${MODEL_PATH} \
41+
--served-model-name ${SERVED_MODEL_NAME} \
42+
--dtype=float16 \
43+
--enforce-eager \
44+
--port 8000 \
45+
--host 0.0.0.0 \
46+
--trust-remote-code \
47+
--gpu-memory-util=0.9 \
48+
--disable-sliding-window \
49+
--max-num-batched-tokens=${MAX_MODEL_LEN} \
50+
--disable-log-requests \
51+
--max-model-len=${MAX_MODEL_LEN} \
52+
--block-size 64 \
53+
--quantization ${LOAD_QUANTIZATION} \
54+
-tp=${TENSOR_PARALLEL_SIZE} \
55+
--enable_prefix_caching
56+
57+
fi

comps/lvms/README.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -38,11 +38,12 @@ Users can configure and deploy LVM-related services based on their specific requ
3838

3939
The LVM Microservice supports multiple implementation options. Select the one that best fits your use case and follow the linked documentation for detailed setup instructions.
4040

41-
| Implementation | Description | Documentation |
42-
| ------------------------ | -------------------------------------------------------------------- | ------------------------------------------------------- |
43-
| **With LLaVA** | A general-purpose VQA service using the LLaVA model. | [README_llava](src/README_llava.md) |
44-
| **With TGI LLaVA** | LLaVA service accelerated by TGI, optimized for Intel Gaudi HPUs. | [README_llava_tgi](src/README_llava_tgi.md) |
45-
| **With LLaMA-Vision** | VQA service leveraging the LLaMA-Vision model. | [README_llama_vision](src/README_llama_vision.md) |
46-
| **With Video-LLaMA** | A specialized service for performing VQA on video inputs. | [README_video_llama](src/README_video_llama.md) |
47-
| **With vLLM** | High-throughput LVM serving accelerated by vLLM on Intel Gaudi HPUs. | [README_vllm](src/README_vllm.md) |
48-
| **With PredictionGuard** | LVM service using Prediction Guard with built-in safety features. | [README_predictionguard](src/README_predictionguard.md) |
41+
| Implementation | Description | Documentation |
42+
| ------------------------ | ---------------------------------------------------------------------- | ------------------------------------------------------- |
43+
| **With LLaVA** | A general-purpose VQA service using the LLaVA model. | [README_llava](src/README_llava.md) |
44+
| **With TGI LLaVA** | LLaVA service accelerated by TGI, optimized for Intel Gaudi HPUs. | [README_llava_tgi](src/README_llava_tgi.md) |
45+
| **With LLaMA-Vision** | VQA service leveraging the LLaMA-Vision model. | [README_llama_vision](src/README_llama_vision.md) |
46+
| **With Video-LLaMA** | A specialized service for performing VQA on video inputs. | [README_video_llama](src/README_video_llama.md) |
47+
| **With vLLM** | High-throughput LVM serving accelerated by vLLM on Intel Gaudi HPUs. | [README_vllm](src/README_vllm.md) |
48+
| **With vLLM-IPEX** | High-throughput LVM serving accelerated by vLLM-IPEX on Intel Arc GPUs | [README_vllm_ipex](src/README_vllm_ipex.md) |
49+
| **With PredictionGuard** | LVM service using Prediction Guard with built-in safety features. | [README_predictionguard](src/README_predictionguard.md) |

comps/lvms/deployment/docker_compose/compose.yaml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,38 @@ services:
158158
depends_on:
159159
vllm-gaudi-service:
160160
condition: service_healthy
161+
lvm-vllm-ipex-service:
162+
container_name: lvm-vllm-ipex-service
163+
image: ${REGISTRY:-intel}/llm-scaler-vllm:${TAG:-latest}
164+
privileged: true
165+
restart: always
166+
ports:
167+
- ${VLLM_PORT:-41091}:8000
168+
group_add:
169+
- ${VIDEO_GROUP_ID:-44} # Use the environment variable for the video group ID
170+
- ${RENDER_GROUP_ID:-992} # Use the environment variable for the render group ID
171+
volumes:
172+
- ${HF_HOME:-${HOME}/.cache/huggingface}:/llm/.cache/huggingface
173+
- ../../src/vllm_ipex_entrypoint.sh:/llm/vllm_ipex_entrypoint.sh
174+
devices:
175+
- /dev/dri
176+
environment:
177+
no_proxy: localhost,127.0.0.1
178+
http_proxy: ${http_proxy}
179+
https_proxy: ${https_proxy}
180+
HF_ENDPOINT: https://hf-mirror.com
181+
HF_HOME: /llm/.cache/huggingface
182+
MODEL_PATH: ${LLM_MODEL_ID}
183+
SERVED_MODEL_NAME: ${LLM_MODEL_ID}
184+
TENSOR_PARALLEL_SIZE: ${TENSOR_PARALLEL_SIZE:-1}
185+
MAX_MODEL_LEN: ${MAX_MODEL_LEN:-20000}
186+
ONEAPI_DEVICE_SELECTOR: ${ONEAPI_DEVICE_SELECTOR:-level_zero:0}
187+
LOAD_QUANTIZATION: ${LOAD_QUANTIZATION:-fp8}
188+
ZE_AFFINITY_MASK: ${ZE_AFFINITY_MASK}
189+
shm_size: 128g
190+
entrypoint: /bin/bash -c "\
191+
chmod +x /llm/vllm_ipex_entrypoint.sh && \
192+
bash /llm/vllm_ipex_entrypoint.sh"
161193

162194
networks:
163195
default:

comps/lvms/src/README_vllm_ipex.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# LVM Microservice with vLLM on Intel XPU
2+
3+
This service provides high-throughput, low-latency LVM serving accelerated by vLLM-IPEX, optimized for Intel® Arc™ Pro B60 Graphics.
4+
5+
---
6+
7+
## Table of Contents
8+
9+
1. [Prerequisites](#prerequisites)
10+
2. [Start Microservice](#start-microservice)
11+
3. [Consume LVM Service](#consume-lvm-service)
12+
13+
---
14+
15+
## Prerequisites
16+
17+
### Download vLLM-IPEX Docker Image
18+
19+
You must download the official docker image from [Docker Hub](https://hub.docker.com/r/intel/llm-scaler-vllm) first.
20+
21+
```bash
22+
docker pull intel/llm-scaler-vllm:1.0
23+
```
24+
25+
## Start Microservice
26+
27+
### Run with Docker Compose
28+
29+
Deploy the vLLM-IPEX model serving using Docker Compose.
30+
31+
1. Export the required environment variables:
32+
33+
```bash
34+
# Use image: intel/llm-scaler-vllm:1.0
35+
export REGISTRY=intel
36+
export TAG=1.0
37+
38+
export ip_address=$(hostname -I | awk '{print $1}')
39+
export VIDEO_GROUP_ID=$(getent group video | awk -F: '{printf "%s\n", $3}')
40+
export RENDER_GROUP_ID=$(getent group render | awk -F: '{printf "%s\n", $3}')
41+
42+
HF_HOME=${HF_HOME:=~/.cache/huggingface}
43+
export HF_HOME
44+
45+
export MAX_MODEL_LEN=20000
46+
export LLM_MODEL_ID=Qwen/Qwen2.5-VL-7B-Instruct
47+
export LOAD_QUANTIZATION=fp8
48+
export VLLM_PORT=41091
49+
export LVM_ENDPOINT="http://$ip_address:$VLLM_PORT"
50+
51+
# Single-Arc GPU, select GPU index as needed
52+
export ONEAPI_DEVICE_SELECTOR="level_zero:0"
53+
export TENSOR_PARALLEL_SIZE=1
54+
# Multi-Arc GPU, select GPU indices as needed
55+
# export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"
56+
# export TENSOR_PARALLEL_SIZE=2
57+
```
58+
59+
2. Navigate to the Docker Compose directory and start the services:
60+
```bash
61+
cd comps/lvms/deployment/docker_compose/
62+
docker compose up lvm-vllm-ipex-service -d
63+
```
64+
65+
> **Note:** More details about supported models can be found at [supported-models](https://github.com/intel/llm-scaler/tree/main/vllm#3-supported-models).
66+
67+
---
68+
69+
## Consume LVM Service
70+
71+
Once the service is running, you can send requests to the API.
72+
73+
### Use the LVM Service API
74+
75+
Send a POST request with an image url and a prompt.
76+
77+
```bash
78+
curl http://localhost:41091/v1/chat/completions -XPOST -H "Content-Type: application/json" -d '{
79+
"model": "Qwen/Qwen2.5-VL-7B-Instruct",
80+
"messages": [
81+
{
82+
"role": "user",
83+
"content": [
84+
{
85+
"type": "text",
86+
"text": "Describe the image."
87+
},
88+
{
89+
"type": "image_url",
90+
"image_url": {
91+
"url": "http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg"
92+
}
93+
}
94+
]
95+
}
96+
],
97+
"max_tokens": 512
98+
}'
99+
```

0 commit comments

Comments
 (0)