Skip to content

Add vLLM CPU inference support for docker compose setup#1967

Open
zahidulhaque wants to merge 11 commits intoopen-edge-platform:mainfrom
zahidulhaque:feat/vllm-compose-support
Open

Add vLLM CPU inference support for docker compose setup#1967
zahidulhaque wants to merge 11 commits intoopen-edge-platform:mainfrom
zahidulhaque:feat/vllm-compose-support

Conversation

@zahidulhaque
Copy link
Contributor

Description

Add support for vLLM as an alternative inference backend for the Video Search and Summarization application. This change allows users to run both VLM captioning and LLM summarization tasks using vLLM on CPU without requiring GPU resources or OpenVINO model server (OVMS) microservices.

Fixes # (issue)

Key Changes:

  • Created a new Docker Compose overlay file (compose.vllm.yaml) that configures vLLM CPU service with optimized settings for video processing
  • Added profile-based service management for clean isolation between inference backends (vlm, ovms, vllm)
  • Added environment variables (ENABLE_VLLM, VLLM_HOST, VLLM_ENDPOINT, etc.) for vLLM configuration
  • Updated setup.sh to handle vLLM backend selection and disable conflicting configurations
  • Enhanced model validation to check for specific OpenVINO artifact files (.xml and .bin) rather than directory existence
  • Updated documentation with vLLM deployment option and usage instructions
  • Made vlm-openvino-serving dependency optional in pipeline-manager to support profiles-based service selection

Benefits:

  • Enables CPU-only deployments without GPU or microservices overhead
  • Provides users with more deployment flexibility and options
  • Maintains backward compatibility with existing OVMS and VLM configurations
  • Leverages vLLM's performance optimizations for efficient CPU inference

Any Newly Introduced Dependencies

No new 3rd party dependencies introduced. The vLLM service uses a pre-built Docker image (public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.13.0) that is already compiled and optimized. The solution leverages existing environment configurations and does not add any new library dependencies to the project.

How Has This Been Tested?

# Test vLLM deployment:
cd sample-applications/video-search-and-summarization
ENABLE_VLLM=true source setup.sh --summary
# Verify configuration:
ENABLE_VLLM=true source setup.sh --summary config

# Standard OVMS setup should still work
ENABLE_OVMS_LLM_SUMMARY=true source setup.sh --summary

# VLM-only setup should still work
source setup.sh --summary

# Cleanup:  This will call stop_containers

ENABLE_VLLM=true source setup.sh --clean-data

Checklist:

  • I agree to use the APACHE-2.0 license for my code changes.
  • I have not introduced any 3rd party components incompatible with APACHE-2.0.
  • I have not included any company confidential information, trade secret, password or security token.
  • I have performed a self-review of my code.

Signed-off-by: Zahidul Haque <zahidul.haque@intel.com>
Signed-off-by: Zahidul Haque <zahidul.haque@intel.com>
Signed-off-by: Zahidul Haque <zahidul.haque@intel.com>
Signed-off-by: Zahidul Haque <zahidul.haque@intel.com>
Signed-off-by: Zahidul Haque <zahidul.haque@intel.com>
Signed-off-by: Zahidul Haque <zahidul.haque@intel.com>
Signed-off-by: Zahidul Haque <zahidul.haque@intel.com>
vllm-cpu-service:
profiles:
- vllm
image: ${VLLM_IMAGE:-public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.13.0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tag works on IceLake device but does not work on Arrow lake device. Container always in restarting state. Tried latest tag v0.17.1 which works after removing - "--disable-log-requests"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add note about vllm related params open to be overridden like VLLM_MAX_NUM_BATCHED_TOKENS, VLLM_BLOCK_SIZE etc. -> refer user to vllm docs for description on params

- "${VLLM_HOST_PORT:-8200}:8000"
ipc: "host"
environment:
no_proxy: ${no_proxy},localhost
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing minio from no_proxy

VLLM_RPC_TIMEOUT: ${VLLM_RPC_TIMEOUT:-100000}
VLLM_ALLOW_LONG_MAX_MODEL_LEN: ${VLLM_ALLOW_LONG_MAX_MODEL_LEN:-1}
VLLM_ENGINE_ITERATION_TIMEOUT_S: ${VLLM_ENGINE_ITERATION_TIMEOUT_S:-120}
VLLM_CPU_NUM_OF_RESERVED_CPU: ${VLLM_CPU_NUM_OF_RESERVED_CPU:-0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add VLLM_LOGGING_LEVEL param to be overriden by user for debug logs

- "${VLLM_MAX_NUM_BATCHED_TOKENS:-2048}"
- "--max-num-seqs"
- "${VLLM_MAX_NUM_SEQS:-256}"
- "--disable-log-requests"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add override for max_model_len as default 4096 token context length will block summary of summaries

Copy link
Contributor

@bhardwaj-nakul bhardwaj-nakul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

export VLM_MODEL_NAME="Qwen/Qwen2.5-VL-3B-Instruct" is not working for vllm on xeon or arrow lake device.
Need to also validate against export VLM_MODEL_NAME="microsoft/Phi-3.5-vision-instruct" - works on arrow lake but not on IceLake

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants