Add vllm-ipex service for llms and lvms (#1926)

Johere · pre-commit-ci[bot] · web-flow · commit 9b9a0661d0ee · 2025-09-16T14:29:44.000+08:00
* Add vllm-ipex service for llms and lvms Signed-off-by: Lin, Jiaojiao <jiaojiao.lin@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix typos Signed-off-by: Lin, Jiaojiao <jiaojiao.lin@intel.com> * Update lvms Supported Implementations Signed-off-by: Lin, Jiaojiao <jiaojiao.lin@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update llms Supported Implementations Signed-off-by: Lin, Jiaojiao <jiaojiao.lin@intel.com> * fix vllm-ipex scripts Signed-off-by: Lin, Jiaojiao <jiaojiao.lin@intel.com> * fix CI issues Signed-off-by: Lin, Jiaojiao <jiaojiao.lin@intel.com> * Add test scripts for vllm-ipex on b60 graphics Signed-off-by: Lin, Jiaojiao <jiaojiao.lin@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update action runner for B60 vllm-ipex tests Signed-off-by: Lin, Jiaojiao <jiaojiao.lin@intel.com> * fix ENV to support vllm-ipex multi-arc deployment Signed-off-by: Lin, Jiaojiao <jiaojiao.lin@intel.com> * update vllm-ipex docker image tag Signed-off-by: Lin, Jiaojiao <jiaojiao.lin@intel.com> --------- Signed-off-by: Lin, Jiaojiao <jiaojiao.lin@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
diff --git a/comps/llms/deployment/docker_compose/compose_text-generation.yaml b/comps/llms/deployment/docker_compose/compose_text-generation.yaml
@@ -156,6 +156,39 @@ services:
       ovms-llm-serving:
         condition: service_healthy
 
+  textgen-vllm-ipex-service:
+    container_name: textgen-vllm-ipex-service
+    image: ${REGISTRY:-intel}/llm-scaler-vllm:${TAG:-latest}
+    privileged: true
+    restart: always
+    ports:
+      - ${VLLM_PORT:-41090}:8000
+    group_add:
+      - ${VIDEO_GROUP_ID:-44}           # Use the environment variable for the video group ID
+      - ${RENDER_GROUP_ID:-992}         # Use the environment variable for the render group ID
+    volumes:
+      - ${HF_HOME:-${HOME}/.cache/huggingface}:/llm/.cache/huggingface
+      - ../../src/text-generation/vllm_ipex_entrypoint.sh:/llm/vllm_ipex_entrypoint.sh
+    devices:
+      - /dev/dri
+    environment:
+      no_proxy: localhost,127.0.0.1
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      HF_ENDPOINT: https://hf-mirror.com
+      HF_HOME: /llm/.cache/huggingface
+      MODEL_PATH: ${LLM_MODEL_ID}
+      SERVED_MODEL_NAME: ${LLM_MODEL_ID}
+      TENSOR_PARALLEL_SIZE: ${TENSOR_PARALLEL_SIZE:-1}
+      MAX_MODEL_LEN: ${MAX_MODEL_LEN:-20000}
+      ONEAPI_DEVICE_SELECTOR: ${ONEAPI_DEVICE_SELECTOR:-level_zero:0}
+      LOAD_QUANTIZATION: ${LOAD_QUANTIZATION:-fp8}
+      ZE_AFFINITY_MASK: ${ZE_AFFINITY_MASK}
+    shm_size: 128g
+    entrypoint: /bin/bash -c "\
+      chmod +x /llm/vllm_ipex_entrypoint.sh && \
+      bash /llm/vllm_ipex_entrypoint.sh"
+
 networks:
   default:
     driver: bridge
diff --git a/comps/llms/src/text-generation/README.md b/comps/llms/src/text-generation/README.md
@@ -18,23 +18,26 @@ Overall, this microservice offers a streamlined way to integrate large language
 
 ## Validated LLM Models
 
-| Model                                           | TGI-Gaudi | vLLM-CPU | vLLM-Gaudi | OVMS     | Optimum-Habana | SGLANG-CPU |
-| ----------------------------------------------- | --------- | -------- | ---------- | -------- | -------------- | ---------- |
-| [Intel/neural-chat-7b-v3-3]                     | ✓         | ✓        | ✓          | ✓        | ✓              | -          |
-| [meta-llama/Llama-2-7b-chat-hf]                 | ✓         | ✓        | ✓          | ✓        | ✓              | ✓          |
-| [meta-llama/Llama-2-70b-chat-hf]                | ✓         | -        | ✓          | -        | ✓              | ✓          |
-| [meta-llama/Meta-Llama-3-8B-Instruct]           | ✓         | ✓        | ✓          | ✓        | ✓              | ✓          |
-| [meta-llama/Meta-Llama-3-70B-Instruct]          | ✓         | -        | ✓          | -        | ✓              | ✓          |
-| [Phi-3]                                         | ✗         | Limit 4K | Limit 4K   | Limit 4K | ✓              | -          |
-| [Phi-4]                                         | ✗         | ✗        | ✗          | ✗        | ✓              | -          |
-| [deepseek-ai/DeepSeek-R1-Distill-Llama-8B]      | ✓         | -        | ✓          | -        | ✓              | -          |
-| [deepseek-ai/DeepSeek-R1-Distill-Llama-70B]     | ✓         | -        | ✓          | -        | ✓              | -          |
-| [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B]      | ✓         | -        | ✓          | -        | ✓              | -          |
-| [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B]      | ✓         | -        | ✓          | -        | ✓              | -          |
-| [mistralai/Mistral-Small-24B-Instruct-2501]     | ✓         | -        | ✓          | -        | ✓              | -          |
-| [mistralai/Mistral-Large-Instruct-2411]         | ✗         | -        | ✓          | -        | ✓              | -          |
-| [meta-llama/Llama-4-Scout-17B-16E-Instruct]     | -         | -        | -          | -        | -              | ✓          |
-| [meta-llama/Llama-4-Maverick-17B-128E-Instruct] | -         | -        | -          | -        | -              | ✓          |
+| Model                                           | TGI-Gaudi | vLLM-CPU | vLLM-Gaudi | vLLM-IPEX-XPU | OVMS     | Optimum-Habana | SGLANG-CPU |
+| ----------------------------------------------- | --------- | -------- | ---------- | ------------- | -------- | -------------- | ---------- |
+| [Intel/neural-chat-7b-v3-3]                     | ✓         | ✓        | ✓          | ✓             | ✓        | ✓              | -          |
+| [meta-llama/Llama-2-7b-chat-hf]                 | ✓         | ✓        | ✓          | -             | ✓        | ✓              | ✓          |
+| [meta-llama/Llama-2-70b-chat-hf]                | ✓         | -        | ✓          | -             | -        | ✓              | ✓          |
+| [meta-llama/Meta-Llama-3-8B-Instruct]           | ✓         | ✓        | ✓          | -             | ✓        | ✓              | ✓          |
+| [meta-llama/Meta-Llama-3-70B-Instruct]          | ✓         | -        | ✓          | -             | -        | ✓              | ✓          |
+| [Phi-3]                                         | ✗         | Limit 4K | Limit 4K   | ✓             | Limit 4K | ✓              | -          |
+| [Phi-4]                                         | ✗         | ✗        | ✗          | ✓             | ✗        | ✓              | -          |
+| [deepseek-ai/DeepSeek-R1-Distill-Llama-8B]      | ✓         | -        | ✓          | ✓             | -        | ✓              | -          |
+| [deepseek-ai/DeepSeek-R1-Distill-Llama-70B]     | ✓         | -        | ✓          | ✓             | -        | ✓              | -          |
+| [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B]      | ✓         | -        | ✓          | ✓             | -        | ✓              | -          |
+| [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B]      | ✓         | -        | ✓          | ✓             | -        | ✓              | -          |
+| [mistralai/Mistral-Small-24B-Instruct-2501]     | ✓         | -        | ✓          | -             | -        | ✓              | -          |
+| [mistralai/Mistral-Large-Instruct-2411]         | ✗         | -        | ✓          | -             | -        | ✓              | -          |
+| [meta-llama/Llama-4-Scout-17B-16E-Instruct]     | -         | -        | -          | -             | -        | -              | ✓          |
+| [meta-llama/Llama-4-Maverick-17B-128E-Instruct] | -         | -        | -          | -             | -        | -              | ✓          |
+| [Qwen3-8B/14B/32B]                              | -         | -        | -          | ✓             | -        | -              | -          |
+
+> **Note:** More details about supported models for vLLM-IPEX-XPU can be found at [supported-models](https://github.com/intel/llm-scaler/tree/main/vllm#3-supported-models).
 
 ### System Requirements for LLM Models
 
@@ -70,6 +73,7 @@ In this microservices, we have supported following backend LLM service as integr
 - [Bedrock](./README_bedrock.md)
 - [Native](./README_native.md), based on optimum habana
 - [Predictionguard](./README_predictionguard.md)
+- [VLLM-IPEX](./README_vllm_ipex.md), based on B60 Graphics
 
 ### Clone OPEA GenAIComps
 
diff --git a/comps/llms/src/text-generation/README_vllm_ipex.md b/comps/llms/src/text-generation/README_vllm_ipex.md
@@ -0,0 +1,86 @@
+# LLM Microservice with vLLM on Intel XPU
+
+This service provides high-throughput, low-latency LLM serving accelerated by vLLM-IPEX, optimized for Intel® Arc™ Pro B60 Graphics.
+
+---
+
+## Table of Contents
+
+1. [Prerequisites](#prerequisites)
+2. [Start Microservice](#start-microservice)
+3. [Consume LLM Service](#consume-llm-service)
+
+---
+
+## Prerequisites
+
+### Download vLLM-IPEX Docker Image
+
+You must download the official docker image from [Docker Hub](https://hub.docker.com/r/intel/llm-scaler-vllm) first.
+
+```bash
+docker pull intel/llm-scaler-vllm:1.0
+```
+
+## Start Microservice
+
+### Run with Docker Compose
+
+Deploy the vLLM-IPEX model serving using Docker Compose.
+
+1.  Export the required environment variables:
+
+    ```bash
+    # Use image: intel/llm-scaler-vllm:1.0
+    export REGISTRY=intel
+    export TAG=1.0
+
+    export VIDEO_GROUP_ID=$(getent group video | awk -F: '{printf "%s\n", $3}')
+    export RENDER_GROUP_ID=$(getent group render | awk -F: '{printf "%s\n", $3}')
+
+    HF_HOME=${HF_HOME:=~/.cache/huggingface}
+    export HF_HOME
+
+    export MAX_MODEL_LEN=20000
+    export LLM_MODEL_ID=Qwen/Qwen3-8B-AWQ
+    export LOAD_QUANTIZATION=awq
+    export VLLM_PORT=41090
+
+    # Single-Arc GPU, select GPU index as needed
+    export ONEAPI_DEVICE_SELECTOR="level_zero:0"
+    export TENSOR_PARALLEL_SIZE=1
+    # Multi-Arc GPU, select GPU indices as needed
+    # export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"
+    # export TENSOR_PARALLEL_SIZE=2
+    ```
+
+2.  Navigate to the Docker Compose directory and start the services:
+    ```bash
+    cd comps/llms/deployment/docker_compose/
+    docker compose -f compose_text-generation.yaml up textgen-vllm-ipex-service -d
+    ```
+
+> **Note:** More details about supported models can be found at [supported-models](https://github.com/intel/llm-scaler/tree/main/vllm#3-supported-models).
+
+---
+
+## Consume LLM Service
+
+Once the service is running, you can send requests to the API.
+
+### Use the LLM Service API
+
+Send a POST request with a prompt.
+
+```bash
+curl http://localhost:41090/v1/chat/completions -XPOST -H "Content-Type: application/json" -d '{
+    "model": "Qwen/Qwen3-8B-AWQ",
+    "messages": [
+      {
+        "role": "user",
+        "content": "What is Deep Learning?"
+      }
+    ],
+    "max_tokens": 512
+  }'
+```
diff --git a/comps/llms/src/text-generation/vllm_ipex_entrypoint.sh b/comps/llms/src/text-generation/vllm_ipex_entrypoint.sh
@@ -0,0 +1,57 @@
+#!/bin/sh
+
+# Copyright (C) 2025 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+if [ "$LOAD_QUANTIZATION" = "None" ]; then
+    echo "LOAD_QUANTIZATION is None, will load the model without online quantization."
+
+    TORCH_LLM_ALLREDUCE=1 \
+    VLLM_USE_V1=1 \
+    CCL_ZE_IPC_EXCHANGE=pidfd \
+    VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
+    VLLM_WORKER_MULTIPROC_METHOD=spawn \
+    python3 -m vllm.entrypoints.openai.api_server \
+        --model ${MODEL_PATH} \
+        --served-model-name ${SERVED_MODEL_NAME} \
+        --dtype=float16 \
+        --enforce-eager \
+        --port 8000 \
+        --host 0.0.0.0 \
+        --trust-remote-code \
+        --gpu-memory-util=0.9 \
+        --disable-sliding-window \
+        --max-num-batched-tokens=${MAX_MODEL_LEN} \
+        --disable-log-requests \
+        --max-model-len=${MAX_MODEL_LEN} \
+        --block-size 64 \
+        -tp=${TENSOR_PARALLEL_SIZE} \
+        --enable_prefix_caching
+
+else
+    echo "LOAD_QUANTIZATION is $LOAD_QUANTIZATION"
+
+    TORCH_LLM_ALLREDUCE=1 \
+    VLLM_USE_V1=1 \
+    CCL_ZE_IPC_EXCHANGE=pidfd \
+    VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
+    VLLM_WORKER_MULTIPROC_METHOD=spawn \
+    python3 -m vllm.entrypoints.openai.api_server \
+        --model ${MODEL_PATH} \
+        --served-model-name ${SERVED_MODEL_NAME} \
+        --dtype=float16 \
+        --enforce-eager \
+        --port 8000 \
+        --host 0.0.0.0 \
+        --trust-remote-code \
+        --gpu-memory-util=0.9 \
+        --disable-sliding-window \
+        --max-num-batched-tokens=${MAX_MODEL_LEN} \
+        --disable-log-requests \
+        --max-model-len=${MAX_MODEL_LEN} \
+        --block-size 64 \
+        --quantization ${LOAD_QUANTIZATION} \
+        -tp=${TENSOR_PARALLEL_SIZE} \
+        --enable_prefix_caching
+
+fi
diff --git a/comps/lvms/README.md b/comps/lvms/README.md
@@ -38,11 +38,12 @@ Users can configure and deploy LVM-related services based on their specific requ
 
 The LVM Microservice supports multiple implementation options. Select the one that best fits your use case and follow the linked documentation for detailed setup instructions.
 
-| Implementation           | Description                                                          | Documentation                                           |
-| ------------------------ | -------------------------------------------------------------------- | ------------------------------------------------------- |
-| **With LLaVA**           | A general-purpose VQA service using the LLaVA model.                 | [README_llava](src/README_llava.md)                     |
-| **With TGI LLaVA**       | LLaVA service accelerated by TGI, optimized for Intel Gaudi HPUs.    | [README_llava_tgi](src/README_llava_tgi.md)             |
-| **With LLaMA-Vision**    | VQA service leveraging the LLaMA-Vision model.                       | [README_llama_vision](src/README_llama_vision.md)       |
-| **With Video-LLaMA**     | A specialized service for performing VQA on video inputs.            | [README_video_llama](src/README_video_llama.md)         |
-| **With vLLM**            | High-throughput LVM serving accelerated by vLLM on Intel Gaudi HPUs. | [README_vllm](src/README_vllm.md)                       |
-| **With PredictionGuard** | LVM service using Prediction Guard with built-in safety features.    | [README_predictionguard](src/README_predictionguard.md) |
+| Implementation           | Description                                                            | Documentation                                           |
+| ------------------------ | ---------------------------------------------------------------------- | ------------------------------------------------------- |
+| **With LLaVA**           | A general-purpose VQA service using the LLaVA model.                   | [README_llava](src/README_llava.md)                     |
+| **With TGI LLaVA**       | LLaVA service accelerated by TGI, optimized for Intel Gaudi HPUs.      | [README_llava_tgi](src/README_llava_tgi.md)             |
+| **With LLaMA-Vision**    | VQA service leveraging the LLaMA-Vision model.                         | [README_llama_vision](src/README_llama_vision.md)       |
+| **With Video-LLaMA**     | A specialized service for performing VQA on video inputs.              | [README_video_llama](src/README_video_llama.md)         |
+| **With vLLM**            | High-throughput LVM serving accelerated by vLLM on Intel Gaudi HPUs.   | [README_vllm](src/README_vllm.md)                       |
+| **With vLLM-IPEX**       | High-throughput LVM serving accelerated by vLLM-IPEX on Intel Arc GPUs | [README_vllm_ipex](src/README_vllm_ipex.md)             |
+| **With PredictionGuard** | LVM service using Prediction Guard with built-in safety features.      | [README_predictionguard](src/README_predictionguard.md) |
diff --git a/comps/lvms/deployment/docker_compose/compose.yaml b/comps/lvms/deployment/docker_compose/compose.yaml
@@ -158,6 +158,38 @@ services:
     depends_on:
       vllm-gaudi-service:
         condition: service_healthy
+  lvm-vllm-ipex-service:
+    container_name: lvm-vllm-ipex-service
+    image: ${REGISTRY:-intel}/llm-scaler-vllm:${TAG:-latest}
+    privileged: true
+    restart: always
+    ports:
+      - ${VLLM_PORT:-41091}:8000
+    group_add:
+      - ${VIDEO_GROUP_ID:-44}           # Use the environment variable for the video group ID
+      - ${RENDER_GROUP_ID:-992}         # Use the environment variable for the render group ID
+    volumes:
+      - ${HF_HOME:-${HOME}/.cache/huggingface}:/llm/.cache/huggingface
+      - ../../src/vllm_ipex_entrypoint.sh:/llm/vllm_ipex_entrypoint.sh
+    devices:
+      - /dev/dri
+    environment:
+      no_proxy: localhost,127.0.0.1
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      HF_ENDPOINT: https://hf-mirror.com
+      HF_HOME: /llm/.cache/huggingface
+      MODEL_PATH: ${LLM_MODEL_ID}
+      SERVED_MODEL_NAME: ${LLM_MODEL_ID}
+      TENSOR_PARALLEL_SIZE: ${TENSOR_PARALLEL_SIZE:-1}
+      MAX_MODEL_LEN: ${MAX_MODEL_LEN:-20000}
+      ONEAPI_DEVICE_SELECTOR: ${ONEAPI_DEVICE_SELECTOR:-level_zero:0}
+      LOAD_QUANTIZATION: ${LOAD_QUANTIZATION:-fp8}
+      ZE_AFFINITY_MASK: ${ZE_AFFINITY_MASK}
+    shm_size: 128g
+    entrypoint: /bin/bash -c "\
+      chmod +x /llm/vllm_ipex_entrypoint.sh && \
+      bash /llm/vllm_ipex_entrypoint.sh"
 
 networks:
   default:
diff --git a/comps/lvms/src/README_vllm_ipex.md b/comps/lvms/src/README_vllm_ipex.md
@@ -0,0 +1,99 @@
+# LVM Microservice with vLLM on Intel XPU
+
+This service provides high-throughput, low-latency LVM serving accelerated by vLLM-IPEX, optimized for Intel® Arc™ Pro B60 Graphics.
+
+---
+
+## Table of Contents
+
+1. [Prerequisites](#prerequisites)
+2. [Start Microservice](#start-microservice)
+3. [Consume LVM Service](#consume-lvm-service)
+
+---
+
+## Prerequisites
+
+### Download vLLM-IPEX Docker Image
+
+You must download the official docker image from [Docker Hub](https://hub.docker.com/r/intel/llm-scaler-vllm) first.
+
+```bash
+docker pull intel/llm-scaler-vllm:1.0
+```
+
+## Start Microservice
+
+### Run with Docker Compose
+
+Deploy the vLLM-IPEX model serving using Docker Compose.
+
+1.  Export the required environment variables:
+
+    ```bash
+    # Use image: intel/llm-scaler-vllm:1.0
+    export REGISTRY=intel
+    export TAG=1.0
+
+    export ip_address=$(hostname -I | awk '{print $1}')
+    export VIDEO_GROUP_ID=$(getent group video | awk -F: '{printf "%s\n", $3}')
+    export RENDER_GROUP_ID=$(getent group render | awk -F: '{printf "%s\n", $3}')
+
+    HF_HOME=${HF_HOME:=~/.cache/huggingface}
+    export HF_HOME
+
+    export MAX_MODEL_LEN=20000
+    export LLM_MODEL_ID=Qwen/Qwen2.5-VL-7B-Instruct
+    export LOAD_QUANTIZATION=fp8
+    export VLLM_PORT=41091
+    export LVM_ENDPOINT="http://$ip_address:$VLLM_PORT"
+
+    # Single-Arc GPU, select GPU index as needed
+    export ONEAPI_DEVICE_SELECTOR="level_zero:0"
+    export TENSOR_PARALLEL_SIZE=1
+    # Multi-Arc GPU, select GPU indices as needed
+    # export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"
+    # export TENSOR_PARALLEL_SIZE=2
+    ```
+
+2.  Navigate to the Docker Compose directory and start the services:
+    ```bash
+    cd comps/lvms/deployment/docker_compose/
+    docker compose up lvm-vllm-ipex-service -d
+    ```
+
+> **Note:** More details about supported models can be found at [supported-models](https://github.com/intel/llm-scaler/tree/main/vllm#3-supported-models).
+
+---
+
+## Consume LVM Service
+
+Once the service is running, you can send requests to the API.
+
+### Use the LVM Service API
+
+Send a POST request with an image url and a prompt.
+
+```bash
+curl http://localhost:41091/v1/chat/completions -XPOST -H "Content-Type: application/json" -d '{
+    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
+    "messages": [
+      {
+        "role": "user",
+        "content": [
+          {
+            "type": "text",
+            "text": "Describe the image."
+          },
+          {
+            "type": "image_url",
+            "image_url": {
+              "url": "http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg"
+            }
+          }
+        ]
+      }
+    ],
+    "max_tokens": 512
+  }'
+```
diff --git a/comps/lvms/src/vllm_ipex_entrypoint.sh b/comps/lvms/src/vllm_ipex_entrypoint.sh
diff --git a/tests/llms/test_llms_text-generation_service_vllm_ipex_on_intel_arc.sh b/tests/llms/test_llms_text-generation_service_vllm_ipex_on_intel_arc.sh
diff --git a/tests/lvms/test_lvms_vllm_ipex_on_intel_arc.sh b/tests/lvms/test_lvms_vllm_ipex_on_intel_arc.sh