Add vLLM CPU inference support for docker compose setup#1967
Add vLLM CPU inference support for docker compose setup#1967zahidulhaque wants to merge 11 commits intoopen-edge-platform:mainfrom
Conversation
Signed-off-by: Zahidul Haque <zahidul.haque@intel.com>
Signed-off-by: Zahidul Haque <zahidul.haque@intel.com>
Signed-off-by: Zahidul Haque <zahidul.haque@intel.com>
Signed-off-by: Zahidul Haque <zahidul.haque@intel.com>
Signed-off-by: Zahidul Haque <zahidul.haque@intel.com>
Signed-off-by: Zahidul Haque <zahidul.haque@intel.com>
Signed-off-by: Zahidul Haque <zahidul.haque@intel.com>
| vllm-cpu-service: | ||
| profiles: | ||
| - vllm | ||
| image: ${VLLM_IMAGE:-public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.13.0} |
There was a problem hiding this comment.
This tag works on IceLake device but does not work on Arrow lake device. Container always in restarting state. Tried latest tag v0.17.1 which works after removing - "--disable-log-requests"
There was a problem hiding this comment.
Should we also add note about vllm related params open to be overridden like VLLM_MAX_NUM_BATCHED_TOKENS, VLLM_BLOCK_SIZE etc. -> refer user to vllm docs for description on params
| - "${VLLM_HOST_PORT:-8200}:8000" | ||
| ipc: "host" | ||
| environment: | ||
| no_proxy: ${no_proxy},localhost |
There was a problem hiding this comment.
Missing minio from no_proxy
| VLLM_RPC_TIMEOUT: ${VLLM_RPC_TIMEOUT:-100000} | ||
| VLLM_ALLOW_LONG_MAX_MODEL_LEN: ${VLLM_ALLOW_LONG_MAX_MODEL_LEN:-1} | ||
| VLLM_ENGINE_ITERATION_TIMEOUT_S: ${VLLM_ENGINE_ITERATION_TIMEOUT_S:-120} | ||
| VLLM_CPU_NUM_OF_RESERVED_CPU: ${VLLM_CPU_NUM_OF_RESERVED_CPU:-0} |
There was a problem hiding this comment.
add VLLM_LOGGING_LEVEL param to be overriden by user for debug logs
| - "${VLLM_MAX_NUM_BATCHED_TOKENS:-2048}" | ||
| - "--max-num-seqs" | ||
| - "${VLLM_MAX_NUM_SEQS:-256}" | ||
| - "--disable-log-requests" |
There was a problem hiding this comment.
add override for max_model_len as default 4096 token context length will block summary of summaries
Description
Add support for vLLM as an alternative inference backend for the Video Search and Summarization application. This change allows users to run both VLM captioning and LLM summarization tasks using vLLM on CPU without requiring GPU resources or OpenVINO model server (OVMS) microservices.
Fixes # (issue)
Key Changes:
Benefits:
Any Newly Introduced Dependencies
No new 3rd party dependencies introduced. The vLLM service uses a pre-built Docker image (public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.13.0) that is already compiled and optimized. The solution leverages existing environment configurations and does not add any new library dependencies to the project.
How Has This Been Tested?
Checklist: