-
Notifications
You must be signed in to change notification settings - Fork 94
Add vLLM CPU inference support for docker compose setup #1967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 10 commits
ee47c57
ae117f9
023ecda
bff8113
d8ec8b8
538e3c6
476b070
0ec8c80
f5a2914
c49d6e8
b5b699d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| # Copyright (C) 2025 Intel Corporation | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # Overlay file to enable vLLM (CPU) as the backend for both VLM captioning and LLM summarization. | ||
| services: | ||
| vllm-cpu-service: | ||
| profiles: | ||
| - vllm | ||
| image: ${VLLM_IMAGE:-public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.13.0} | ||
| hostname: vllm-cpu-service | ||
| ports: | ||
| - "${VLLM_HOST_PORT:-8200}:8000" | ||
| ipc: "host" | ||
| environment: | ||
| no_proxy: ${no_proxy},localhost | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing minio from no_proxy |
||
| http_proxy: ${http_proxy} | ||
| https_proxy: ${https_proxy} | ||
| HUGGING_FACE_HUB_TOKEN: ${HUGGINGFACE_TOKEN:-} | ||
| HF_HOME: /cache | ||
| VLLM_CPU_KVCACHE_SPACE: ${VLLM_CPU_KVCACHE_SPACE:-48} | ||
| VLLM_RPC_TIMEOUT: ${VLLM_RPC_TIMEOUT:-100000} | ||
| VLLM_ALLOW_LONG_MAX_MODEL_LEN: ${VLLM_ALLOW_LONG_MAX_MODEL_LEN:-1} | ||
| VLLM_ENGINE_ITERATION_TIMEOUT_S: ${VLLM_ENGINE_ITERATION_TIMEOUT_S:-120} | ||
| VLLM_CPU_NUM_OF_RESERVED_CPU: ${VLLM_CPU_NUM_OF_RESERVED_CPU:-0} | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add |
||
| command: | ||
| - "--model" | ||
| - "${VLM_MODEL_NAME}" | ||
| - "--dtype" | ||
| - "${VLLM_DTYPE:-bfloat16}" | ||
| - "--distributed-executor-backend" | ||
| - "mp" | ||
| - "--trust-remote-code" | ||
| - "--block-size" | ||
| - "${VLLM_BLOCK_SIZE:-128}" | ||
| - "--enable-chunked-prefill" | ||
| - "--max-num-batched-tokens" | ||
| - "${VLLM_MAX_NUM_BATCHED_TOKENS:-2048}" | ||
| - "--max-num-seqs" | ||
| - "${VLLM_MAX_NUM_SEQS:-256}" | ||
| - "--disable-log-requests" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add override for |
||
| - "--tensor-parallel-size" | ||
| - "${VLLM_TENSOR_PARALLEL_SIZE:-1}" | ||
| volumes: | ||
| - vllm_model_cache:/cache | ||
| shm_size: "32gb" | ||
| healthcheck: | ||
| test: ["CMD", "curl", "-f", "http://localhost:8000/health"] | ||
| interval: 30s | ||
| timeout: 10s | ||
| retries: 40 | ||
| start_period: 60s | ||
| restart: unless-stopped | ||
| networks: | ||
| - vs_network | ||
|
|
||
| nginx: | ||
| depends_on: | ||
| pipeline-manager: | ||
| condition: service_healthy | ||
|
|
||
| pipeline-manager: | ||
| depends_on: | ||
| vllm-cpu-service: | ||
| condition: service_healthy | ||
| environment: | ||
| no_proxy: ${no_proxy},${EVAM_HOST},${VLM_HOST},${AUDIO_HOST},${RABBITMQ_HOST},${MINIO_HOST},${POSTGRES_HOST},${OVMS_HOST},${VDMS_DATAPREP_HOST},${VS_HOST},${VLLM_HOST},localhost | ||
| LLM_SUMMARIZATION_API: ${VLLM_ENDPOINT} | ||
| VLM_CAPTIONING_API: ${VLLM_ENDPOINT} | ||
| USE_VLLM: "CONFIG_ON" | ||
|
|
||
| volumes: | ||
| vllm_model_cache: | ||
| driver: local | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we also add note about vllm related params open to be overridden like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tag works on IceLake device but does not work on Arrow lake device. Container always in restarting state. Tried latest tag
v0.17.1which works after removing- "--disable-log-requests"