vLLM support for FAQGen (opea-project#884)

sgurunat · pre-commit-ci[bot] · web-flow · commit f5c60f10b13f · 2024-11-13T14:17:49.000+08:00
* Add model parameter for FaqGenGateway in gateway.py file Signed-off-by: sgurunat <gurunath.s@intel.com> * Add langchain vllm support for FaqGen along with authentication support for vllm endpoints Signed-off-by: sgurunat <gurunath.s@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated docker_compose_llm.yaml and README file with vLLM information Signed-off-by: sgurunat <gurunath.s@intel.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated faq-vllm Dockerfile into llm-compose-cd.yaml under github workflows Signed-off-by: sgurunat <gurunath.s@intel.com> * Updated llm-compose.yaml file to include vllm faqgen build Signed-off-by: sgurunat <gurunath.s@intel.com> --------- Signed-off-by: sgurunat <gurunath.s@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
diff --git a/.github/workflows/docker/compose/llms-compose.yaml b/.github/workflows/docker/compose/llms-compose.yaml
@@ -58,3 +58,7 @@ services:
     build:
       dockerfile: comps/llms/text-generation/predictionguard/Dockerfile
     image: ${REGISTRY:-opea}/llm-textgen-predictionguard:${TAG:-latest}
+  llm-faqgen-vllm:
+    build:
+      dockerfile: comps/llms/faq-generation/vllm/langchain/Dockerfile
+    image: ${REGISTRY:-opea}/llm-faqgen-vllm:${TAG:-latest}
diff --git a/comps/cores/mega/gateway.py b/comps/cores/mega/gateway.py
@@ -581,6 +581,7 @@ async def handle_request(self, request: Request, files: List[UploadFile] = File(
             presence_penalty=chat_request.presence_penalty if chat_request.presence_penalty else 0.0,
             repetition_penalty=chat_request.repetition_penalty if chat_request.repetition_penalty else 1.03,
             streaming=stream_opt,
+            model=chat_request.model if chat_request.model else None,
         )
         result_dict, runtime_graph = await self.megaservice.schedule(
             initial_inputs={"query": prompt}, llm_parameters=parameters
diff --git a/comps/llms/faq-generation/vllm/langchain/Dockerfile b/comps/llms/faq-generation/vllm/langchain/Dockerfile
@@ -0,0 +1,25 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+FROM python:3.11-slim
+
+RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
+    libgl1-mesa-glx \
+    libjemalloc-dev
+
+RUN useradd -m -s /bin/bash user && \
+    mkdir -p /home/user && \
+    chown -R user /home/user/
+
+USER user
+
+COPY comps /home/user/comps
+
+RUN pip install --no-cache-dir --upgrade pip setuptools && \
+    pip install --no-cache-dir -r /home/user/comps/llms/faq-generation/vllm/langchain/requirements.txt
+
+ENV PYTHONPATH=$PYTHONPATH:/home/user
+
+WORKDIR /home/user/comps/llms/faq-generation/vllm/langchain
+
+ENTRYPOINT ["bash", "entrypoint.sh"]
diff --git a/comps/llms/faq-generation/vllm/langchain/README.md b/comps/llms/faq-generation/vllm/langchain/README.md
@@ -0,0 +1,77 @@
+# vLLM FAQGen LLM Microservice
+
+This microservice interacts with the vLLM server to generate FAQs from Input Text.[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products).
+
+## 🚀1. Start Microservice with Docker
+
+If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a VLLM service with docker.
+
+To setup or build the vLLM image follow the instructions provided in [vLLM Gaudi](https://github.com/opea-project/GenAIComps/tree/main/comps/llms/text-generation/vllm/langchain#22-vllm-on-gaudi)
+
+### 1.1 Setup Environment Variables
+
+In order to start vLLM and LLM services, you need to setup the following environment variables first.
+
+```bash
+export HF_TOKEN=${your_hf_api_token}
+export vLLM_ENDPOINT="http://${your_ip}:8008"
+export LLM_MODEL_ID=${your_hf_llm_model}
+```
+
+### 1.3 Build Docker Image
+
+```bash
+cd ../../../../../
+docker build -t opea/llm-faqgen-vllm:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/faq-generation/vllm/langchain/Dockerfile .
+```
+
+To start a docker container, you have two options:
+
+- A. Run Docker with CLI
+- B. Run Docker with Docker Compose
+
+You can choose one as needed.
+
+### 1.3 Run Docker with CLI (Option A)
+
+```bash
+docker run -d -p 8008:80 -v ./data:/data --name vllm-service --shm-size 1g opea/vllm:hpu --model-id ${LLM_MODEL_ID}
+```
+
+```bash
+docker run -d --name="llm-faqgen-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e vLLM_ENDPOINT=$vLLM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HF_TOKEN opea/llm-faqgen-vllm:latest
+```
+
+### 1.4 Run Docker with Docker Compose (Option B)
+
+```bash
+docker compose -f docker_compose_llm.yaml up -d
+```
+
+## 🚀3. Consume LLM Service
+
+### 3.1 Check Service Status
+
+```bash
+curl http://${your_ip}:9000/v1/health_check\
+  -X GET \
+  -H 'Content-Type: application/json'
+```
+
+### 3.2 Consume FAQGen LLM Service
+
+```bash
+# Streaming Response
+# Set streaming to True. Default will be True.
+curl http://${your_ip}:9000/v1/faqgen \
+  -X POST \
+  -d '{"query":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5."}' \
+  -H 'Content-Type: application/json'
+
+# Non-Streaming Response
+# Set streaming to False.
+curl http://${your_ip}:9000/v1/faqgen \
+  -X POST \
+  -d '{"query":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "streaming":false}' \
+  -H 'Content-Type: application/json'
+```
diff --git a/comps/llms/faq-generation/vllm/langchain/__init__.py b/comps/llms/faq-generation/vllm/langchain/__init__.py
@@ -0,0 +1,2 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
diff --git a/comps/llms/faq-generation/vllm/langchain/docker_compose_llm.yaml b/comps/llms/faq-generation/vllm/langchain/docker_compose_llm.yaml
@@ -0,0 +1,46 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+version: "3.8"
+
+services:
+  vllm-service:
+    image: opea/vllm:hpu
+    container_name: vllm-gaudi-server
+    ports:
+      - "8008:80"
+    volumes:
+      - "./data:/data"
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      HF_TOKEN: ${HF_TOKEN}
+      HABANA_VISIBLE_DEVICES: all
+      OMPI_MCA_btl_vader_single_copy_mechanism: none
+      LLM_MODEL_ID: ${LLM_MODEL_ID}
+    runtime: habana
+    cap_add:
+      - SYS_NICE
+    ipc: host
+    command: --enforce-eager --model $LLM_MODEL_ID --tensor-parallel-size 1 --host 0.0.0.0 --port 80
+  llm:
+    image: opea/llm-faqgen-vllm:latest
+    container_name: llm-faqgen-server
+    depends_on:
+      - vllm-service
+    ports:
+      - "9000:9000"
+    ipc: host
+    environment:
+      no_proxy: ${no_proxy}
+      http_proxy: ${http_proxy}
+      https_proxy: ${https_proxy}
+      vLLM_ENDPOINT: ${vLLM_ENDPOINT}
+      HUGGINGFACEHUB_API_TOKEN: ${HF_TOKEN}
+      LLM_MODEL_ID: ${LLM_MODEL_ID}
+    restart: unless-stopped
+
+networks:
+  default:
+    driver: bridge
diff --git a/comps/llms/faq-generation/vllm/langchain/entrypoint.sh b/comps/llms/faq-generation/vllm/langchain/entrypoint.sh
@@ -0,0 +1,8 @@
+#!/usr/bin/env bash
+
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+pip --no-cache-dir install -r requirements-runtime.txt
+
+python llm.py
diff --git a/comps/llms/faq-generation/vllm/langchain/llm.py b/comps/llms/faq-generation/vllm/langchain/llm.py
@@ -0,0 +1,102 @@
+# Copyright (C) 2024 Intel Corporation
+# SPDX-License-Identifier: Apache-2.0
+
+import os
+
+from fastapi.responses import StreamingResponse
+from langchain.chains.summarize import load_summarize_chain
+from langchain.docstore.document import Document
+from langchain.prompts import PromptTemplate
+from langchain.text_splitter import CharacterTextSplitter
+from langchain_community.llms import VLLMOpenAI
+
+from comps import CustomLogger, GeneratedDoc, LLMParamsDoc, ServiceType, opea_microservices, register_microservice
+from comps.cores.mega.utils import get_access_token
+
+logger = CustomLogger("llm_faqgen")
+logflag = os.getenv("LOGFLAG", False)
+
+# Environment variables
+TOKEN_URL = os.getenv("TOKEN_URL")
+CLIENTID = os.getenv("CLIENTID")
+CLIENT_SECRET = os.getenv("CLIENT_SECRET")
+
+
+def post_process_text(text: str):
+    if text == " ":
+        return "data: @#$\n\n"
+    if text == "\n":
+        return "data: <br/>\n\n"
+    if text.isspace():
+        return None
+    new_text = text.replace(" ", "@#$")
+    return f"data: {new_text}\n\n"
+
+
+@register_microservice(
+    name="opea_service@llm_faqgen",
+    service_type=ServiceType.LLM,
+    endpoint="/v1/faqgen",
+    host="0.0.0.0",
+    port=9000,
+)
+async def llm_generate(input: LLMParamsDoc):
+    if logflag:
+        logger.info(input)
+    access_token = (
+        get_access_token(TOKEN_URL, CLIENTID, CLIENT_SECRET) if TOKEN_URL and CLIENTID and CLIENT_SECRET else None
+    )
+    headers = {}
+    if access_token:
+        headers = {"Authorization": f"Bearer {access_token}"}
+
+    model = input.model if input.model else os.getenv("LLM_MODEL_ID")
+    llm = VLLMOpenAI(
+        openai_api_key="EMPTY",
+        openai_api_base=llm_endpoint + "/v1",
+        model_name=model,
+        default_headers=headers,
+        max_tokens=input.max_tokens,
+        top_p=input.top_p,
+        streaming=input.streaming,
+        temperature=input.temperature,
+    )
+
+    templ = """Create a concise FAQs (frequently asked questions and answers) for following text:
+        TEXT: {text}
+        Do not use any prefix or suffix to the FAQ.
+    """
+    PROMPT = PromptTemplate.from_template(templ)
+    llm_chain = load_summarize_chain(llm=llm, prompt=PROMPT)
+    texts = text_splitter.split_text(input.query)
+
+    # Create multiple documents
+    docs = [Document(page_content=t) for t in texts]
+
+    if input.streaming:
+
+        async def stream_generator():
+            from langserve.serialization import WellKnownLCSerializer
+
+            _serializer = WellKnownLCSerializer()
+            async for chunk in llm_chain.astream_log(docs):
+                data = _serializer.dumps({"ops": chunk.ops}).decode("utf-8")
+                if logflag:
+                    logger.info(data)
+                yield f"data: {data}\n\n"
+            yield "data: [DONE]\n\n"
+
+        return StreamingResponse(stream_generator(), media_type="text/event-stream")
+    else:
+        response = await llm_chain.ainvoke(docs)
+        response = response["output_text"]
+        if logflag:
+            logger.info(response)
+        return GeneratedDoc(text=response, prompt=input.query)
+
+
+if __name__ == "__main__":
+    llm_endpoint = os.getenv("vLLM_ENDPOINT", "http://localhost:8080")
+    # Split text
+    text_splitter = CharacterTextSplitter()
+    opea_microservices["opea_service@llm_faqgen"].start()
diff --git a/comps/llms/faq-generation/vllm/langchain/requirements-runtime.txt b/comps/llms/faq-generation/vllm/langchain/requirements-runtime.txt
@@ -0,0 +1 @@
+langserve
diff --git a/comps/llms/faq-generation/vllm/langchain/requirements.txt b/comps/llms/faq-generation/vllm/langchain/requirements.txt
@@ -0,0 +1,15 @@
+docarray[full]
+fastapi
+huggingface_hub
+langchain
+langchain-huggingface
+langchain-openai
+langchain_community
+langchainhub
+opentelemetry-api
+opentelemetry-exporter-otlp
+opentelemetry-sdk
+prometheus-fastapi-instrumentator
+shortuuid
+transformers
+uvicorn

Original file line number	Diff line number	Diff line change
`@@ -581,6 +581,7 @@ async def handle_request(self, request: Request, files: List[UploadFile] = File(`
`581`	`581`	`presence_penalty=chat_request.presence_penalty if chat_request.presence_penalty else 0.0,`
`582`	`582`	`repetition_penalty=chat_request.repetition_penalty if chat_request.repetition_penalty else 1.03,`
`583`	`583`	`streaming=stream_opt,`
	`584`	`+ model=chat_request.model if chat_request.model else None,`
`584`	`585`	`)`
`585`	`586`	`result_dict, runtime_graph = await self.megaservice.schedule(`
`586`	`587`	`initial_inputs={"query": prompt}, llm_parameters=parameters`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+# Copyright (C) 2024 Intel Corporation`
	`2`	`+# SPDX-License-Identifier: Apache-2.0`