Skip to content

Commit f5c60f1

Browse files
vLLM support for FAQGen (opea-project#884)
* Add model parameter for FaqGenGateway in gateway.py file Signed-off-by: sgurunat <[email protected]> * Add langchain vllm support for FaqGen along with authentication support for vllm endpoints Signed-off-by: sgurunat <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated docker_compose_llm.yaml and README file with vLLM information Signed-off-by: sgurunat <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated faq-vllm Dockerfile into llm-compose-cd.yaml under github workflows Signed-off-by: sgurunat <[email protected]> * Updated llm-compose.yaml file to include vllm faqgen build Signed-off-by: sgurunat <[email protected]> --------- Signed-off-by: sgurunat <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent baafa40 commit f5c60f1

File tree

10 files changed

+281
-0
lines changed

10 files changed

+281
-0
lines changed

.github/workflows/docker/compose/llms-compose.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,3 +58,7 @@ services:
5858
build:
5959
dockerfile: comps/llms/text-generation/predictionguard/Dockerfile
6060
image: ${REGISTRY:-opea}/llm-textgen-predictionguard:${TAG:-latest}
61+
llm-faqgen-vllm:
62+
build:
63+
dockerfile: comps/llms/faq-generation/vllm/langchain/Dockerfile
64+
image: ${REGISTRY:-opea}/llm-faqgen-vllm:${TAG:-latest}

comps/cores/mega/gateway.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -581,6 +581,7 @@ async def handle_request(self, request: Request, files: List[UploadFile] = File(
581581
presence_penalty=chat_request.presence_penalty if chat_request.presence_penalty else 0.0,
582582
repetition_penalty=chat_request.repetition_penalty if chat_request.repetition_penalty else 1.03,
583583
streaming=stream_opt,
584+
model=chat_request.model if chat_request.model else None,
584585
)
585586
result_dict, runtime_graph = await self.megaservice.schedule(
586587
initial_inputs={"query": prompt}, llm_parameters=parameters
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
FROM python:3.11-slim
5+
6+
RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
7+
libgl1-mesa-glx \
8+
libjemalloc-dev
9+
10+
RUN useradd -m -s /bin/bash user && \
11+
mkdir -p /home/user && \
12+
chown -R user /home/user/
13+
14+
USER user
15+
16+
COPY comps /home/user/comps
17+
18+
RUN pip install --no-cache-dir --upgrade pip setuptools && \
19+
pip install --no-cache-dir -r /home/user/comps/llms/faq-generation/vllm/langchain/requirements.txt
20+
21+
ENV PYTHONPATH=$PYTHONPATH:/home/user
22+
23+
WORKDIR /home/user/comps/llms/faq-generation/vllm/langchain
24+
25+
ENTRYPOINT ["bash", "entrypoint.sh"]
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# vLLM FAQGen LLM Microservice
2+
3+
This microservice interacts with the vLLM server to generate FAQs from Input Text.[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving, it delivers state-of-the-art serving throughput with a set of advanced features such as PagedAttention, Continuous batching and etc.. Besides GPUs, vLLM already supported [Intel CPUs](https://www.intel.com/content/www/us/en/products/overview.html) and [Gaudi accelerators](https://habana.ai/products).
4+
5+
## 🚀1. Start Microservice with Docker
6+
7+
If you start an LLM microservice with docker, the `docker_compose_llm.yaml` file will automatically start a VLLM service with docker.
8+
9+
To setup or build the vLLM image follow the instructions provided in [vLLM Gaudi](https://github.com/opea-project/GenAIComps/tree/main/comps/llms/text-generation/vllm/langchain#22-vllm-on-gaudi)
10+
11+
### 1.1 Setup Environment Variables
12+
13+
In order to start vLLM and LLM services, you need to setup the following environment variables first.
14+
15+
```bash
16+
export HF_TOKEN=${your_hf_api_token}
17+
export vLLM_ENDPOINT="http://${your_ip}:8008"
18+
export LLM_MODEL_ID=${your_hf_llm_model}
19+
```
20+
21+
### 1.3 Build Docker Image
22+
23+
```bash
24+
cd ../../../../../
25+
docker build -t opea/llm-faqgen-vllm:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/llms/faq-generation/vllm/langchain/Dockerfile .
26+
```
27+
28+
To start a docker container, you have two options:
29+
30+
- A. Run Docker with CLI
31+
- B. Run Docker with Docker Compose
32+
33+
You can choose one as needed.
34+
35+
### 1.3 Run Docker with CLI (Option A)
36+
37+
```bash
38+
docker run -d -p 8008:80 -v ./data:/data --name vllm-service --shm-size 1g opea/vllm:hpu --model-id ${LLM_MODEL_ID}
39+
```
40+
41+
```bash
42+
docker run -d --name="llm-faqgen-server" -p 9000:9000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e vLLM_ENDPOINT=$vLLM_ENDPOINT -e HUGGINGFACEHUB_API_TOKEN=$HF_TOKEN opea/llm-faqgen-vllm:latest
43+
```
44+
45+
### 1.4 Run Docker with Docker Compose (Option B)
46+
47+
```bash
48+
docker compose -f docker_compose_llm.yaml up -d
49+
```
50+
51+
## 🚀3. Consume LLM Service
52+
53+
### 3.1 Check Service Status
54+
55+
```bash
56+
curl http://${your_ip}:9000/v1/health_check\
57+
-X GET \
58+
-H 'Content-Type: application/json'
59+
```
60+
61+
### 3.2 Consume FAQGen LLM Service
62+
63+
```bash
64+
# Streaming Response
65+
# Set streaming to True. Default will be True.
66+
curl http://${your_ip}:9000/v1/faqgen \
67+
-X POST \
68+
-d '{"query":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5."}' \
69+
-H 'Content-Type: application/json'
70+
71+
# Non-Streaming Response
72+
# Set streaming to False.
73+
curl http://${your_ip}:9000/v1/faqgen \
74+
-X POST \
75+
-d '{"query":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "streaming":false}' \
76+
-H 'Content-Type: application/json'
77+
```
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
version: "3.8"
5+
6+
services:
7+
vllm-service:
8+
image: opea/vllm:hpu
9+
container_name: vllm-gaudi-server
10+
ports:
11+
- "8008:80"
12+
volumes:
13+
- "./data:/data"
14+
environment:
15+
no_proxy: ${no_proxy}
16+
http_proxy: ${http_proxy}
17+
https_proxy: ${https_proxy}
18+
HF_TOKEN: ${HF_TOKEN}
19+
HABANA_VISIBLE_DEVICES: all
20+
OMPI_MCA_btl_vader_single_copy_mechanism: none
21+
LLM_MODEL_ID: ${LLM_MODEL_ID}
22+
runtime: habana
23+
cap_add:
24+
- SYS_NICE
25+
ipc: host
26+
command: --enforce-eager --model $LLM_MODEL_ID --tensor-parallel-size 1 --host 0.0.0.0 --port 80
27+
llm:
28+
image: opea/llm-faqgen-vllm:latest
29+
container_name: llm-faqgen-server
30+
depends_on:
31+
- vllm-service
32+
ports:
33+
- "9000:9000"
34+
ipc: host
35+
environment:
36+
no_proxy: ${no_proxy}
37+
http_proxy: ${http_proxy}
38+
https_proxy: ${https_proxy}
39+
vLLM_ENDPOINT: ${vLLM_ENDPOINT}
40+
HUGGINGFACEHUB_API_TOKEN: ${HF_TOKEN}
41+
LLM_MODEL_ID: ${LLM_MODEL_ID}
42+
restart: unless-stopped
43+
44+
networks:
45+
default:
46+
driver: bridge
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#!/usr/bin/env bash
2+
3+
# Copyright (C) 2024 Intel Corporation
4+
# SPDX-License-Identifier: Apache-2.0
5+
6+
pip --no-cache-dir install -r requirements-runtime.txt
7+
8+
python llm.py
Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
import os
5+
6+
from fastapi.responses import StreamingResponse
7+
from langchain.chains.summarize import load_summarize_chain
8+
from langchain.docstore.document import Document
9+
from langchain.prompts import PromptTemplate
10+
from langchain.text_splitter import CharacterTextSplitter
11+
from langchain_community.llms import VLLMOpenAI
12+
13+
from comps import CustomLogger, GeneratedDoc, LLMParamsDoc, ServiceType, opea_microservices, register_microservice
14+
from comps.cores.mega.utils import get_access_token
15+
16+
logger = CustomLogger("llm_faqgen")
17+
logflag = os.getenv("LOGFLAG", False)
18+
19+
# Environment variables
20+
TOKEN_URL = os.getenv("TOKEN_URL")
21+
CLIENTID = os.getenv("CLIENTID")
22+
CLIENT_SECRET = os.getenv("CLIENT_SECRET")
23+
24+
25+
def post_process_text(text: str):
26+
if text == " ":
27+
return "data: @#$\n\n"
28+
if text == "\n":
29+
return "data: <br/>\n\n"
30+
if text.isspace():
31+
return None
32+
new_text = text.replace(" ", "@#$")
33+
return f"data: {new_text}\n\n"
34+
35+
36+
@register_microservice(
37+
name="opea_service@llm_faqgen",
38+
service_type=ServiceType.LLM,
39+
endpoint="/v1/faqgen",
40+
host="0.0.0.0",
41+
port=9000,
42+
)
43+
async def llm_generate(input: LLMParamsDoc):
44+
if logflag:
45+
logger.info(input)
46+
access_token = (
47+
get_access_token(TOKEN_URL, CLIENTID, CLIENT_SECRET) if TOKEN_URL and CLIENTID and CLIENT_SECRET else None
48+
)
49+
headers = {}
50+
if access_token:
51+
headers = {"Authorization": f"Bearer {access_token}"}
52+
53+
model = input.model if input.model else os.getenv("LLM_MODEL_ID")
54+
llm = VLLMOpenAI(
55+
openai_api_key="EMPTY",
56+
openai_api_base=llm_endpoint + "/v1",
57+
model_name=model,
58+
default_headers=headers,
59+
max_tokens=input.max_tokens,
60+
top_p=input.top_p,
61+
streaming=input.streaming,
62+
temperature=input.temperature,
63+
)
64+
65+
templ = """Create a concise FAQs (frequently asked questions and answers) for following text:
66+
TEXT: {text}
67+
Do not use any prefix or suffix to the FAQ.
68+
"""
69+
PROMPT = PromptTemplate.from_template(templ)
70+
llm_chain = load_summarize_chain(llm=llm, prompt=PROMPT)
71+
texts = text_splitter.split_text(input.query)
72+
73+
# Create multiple documents
74+
docs = [Document(page_content=t) for t in texts]
75+
76+
if input.streaming:
77+
78+
async def stream_generator():
79+
from langserve.serialization import WellKnownLCSerializer
80+
81+
_serializer = WellKnownLCSerializer()
82+
async for chunk in llm_chain.astream_log(docs):
83+
data = _serializer.dumps({"ops": chunk.ops}).decode("utf-8")
84+
if logflag:
85+
logger.info(data)
86+
yield f"data: {data}\n\n"
87+
yield "data: [DONE]\n\n"
88+
89+
return StreamingResponse(stream_generator(), media_type="text/event-stream")
90+
else:
91+
response = await llm_chain.ainvoke(docs)
92+
response = response["output_text"]
93+
if logflag:
94+
logger.info(response)
95+
return GeneratedDoc(text=response, prompt=input.query)
96+
97+
98+
if __name__ == "__main__":
99+
llm_endpoint = os.getenv("vLLM_ENDPOINT", "http://localhost:8080")
100+
# Split text
101+
text_splitter = CharacterTextSplitter()
102+
opea_microservices["opea_service@llm_faqgen"].start()
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
langserve
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
docarray[full]
2+
fastapi
3+
huggingface_hub
4+
langchain
5+
langchain-huggingface
6+
langchain-openai
7+
langchain_community
8+
langchainhub
9+
opentelemetry-api
10+
opentelemetry-exporter-otlp
11+
opentelemetry-sdk
12+
prometheus-fastapi-instrumentator
13+
shortuuid
14+
transformers
15+
uvicorn

0 commit comments

Comments
 (0)