Skip to content

Commit 84374a5

Browse files
Refactor milvus dataprep and retriever (opea-project#728)
* milvus: Refactor embedding settings for mivlus dataprep and retriever Milvus dataprep and retriever leverage the same embedding enpoints, but the embedding-related code is somewhat messed up, unify the namings and logic to improve code readability and facilitate user-friendly configuration. Signed-off-by: Cathy Zhang <[email protected]> * MOSEC: Rename EMB_MODEL env as MOSEC_EMBEDDING_MODEL Signed-off-by: Cathy Zhang <[email protected]> * milvus/dataprep: Update README for milvus dataprep Signed-off-by: Cathy Zhang <[email protected]> * Add OCR package for Milvus dataprep Signed-off-by: Cathy Zhang <[email protected]> * Update Milvus dataprep test script This is to fix the CI issue for MILVUS environment variable name is update. Signed-off-by: Cathy Zhang <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Cathy Zhang <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 2bbee5d commit 84374a5

File tree

10 files changed

+72
-100
lines changed

10 files changed

+72
-100
lines changed

comps/dataprep/milvus/langchain/Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,8 @@ RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missin
1111
build-essential \
1212
default-jre \
1313
libgl1-mesa-glx \
14-
libjemalloc-dev
14+
libjemalloc-dev \
15+
tesseract-ocr
1516

1617
RUN useradd -m -s /bin/bash user && \
1718
mkdir -p /home/user && \

comps/dataprep/milvus/langchain/README.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Please refer to this [readme](../../../vectorstores/milvus/README.md).
2121
export no_proxy=${your_no_proxy}
2222
export http_proxy=${your_http_proxy}
2323
export https_proxy=${your_http_proxy}
24-
export MILVUS=${your_milvus_host_ip}
24+
export MILVUS_HOST=${your_milvus_host_ip}
2525
export MILVUS_PORT=19530
2626
export COLLECTION_NAME=${your_collection_name}
2727
export MOSEC_EMBEDDING_ENDPOINT=${your_embedding_endpoint}
@@ -47,7 +47,7 @@ Setup environment variables:
4747

4848
```bash
4949
export MOSEC_EMBEDDING_ENDPOINT="http://localhost:$your_port"
50-
export MILVUS=${your_host_ip}
50+
export MILVUS_HOST=${your_host_ip}
5151
```
5252

5353
### 1.5 Start Document Preparation Microservice for Milvus with Python Script
@@ -78,19 +78,24 @@ docker build -t opea/dataprep-milvus:latest --build-arg https_proxy=$https_proxy
7878

7979
```bash
8080
export MOSEC_EMBEDDING_ENDPOINT="http://localhost:$your_port"
81-
export MILVUS=${your_host_ip}
81+
export MILVUS_HOST=${your_host_ip}
8282
```
8383

8484
### 2.3 Run Docker with CLI (Option A)
8585

8686
```bash
87-
docker run -d --name="dataprep-milvus-server" -p 6010:6010 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e MOSEC_EMBEDDING_ENDPOINT=${MOSEC_EMBEDDING_ENDPOINT} -e MILVUS=${MILVUS} opea/dataprep-milvus:latest
87+
docker run -d --name="dataprep-milvus-server" -p 6010:6010 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e MOSEC_EMBEDDING_ENDPOINT=${MOSEC_EMBEDDING_ENDPOINT} -e MILVUS_HOST=${MILVUS_HOST} opea/dataprep-milvus:latest
8888
```
8989

9090
### 2.4 Run with Docker Compose (Option B)
9191

9292
```bash
93-
cd docker
93+
mkdir model
94+
cd model
95+
git clone https://huggingface.co/BAAI/bge-base-en-v1.5
96+
cd ../
97+
# Update `host_ip` and `HUGGINGFACEHUB_API_TOKEN` in set_env.sh
98+
. set_env.sh
9499
docker compose -f docker-compose-dataprep-milvus.yaml up -d
95100
```
96101

comps/dataprep/milvus/langchain/config.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,16 @@
33

44
import os
55

6-
# Embedding model
7-
TEI_EMBEDDING_MODEL = os.getenv("EMBED_MODEL", "maidalun1020/bce-embedding-base_v1")
8-
# Embedding endpoints
6+
# Local Embedding model
7+
LOCAL_EMBEDDING_MODEL = os.getenv("LOCAL_EMBEDDING_MODEL", "maidalun1020/bce-embedding-base_v1")
8+
# TEI Embedding endpoints
99
TEI_EMBEDDING_ENDPOINT = os.getenv("TEI_EMBEDDING_ENDPOINT", "")
1010
# MILVUS configuration
11-
MILVUS_HOST = os.getenv("MILVUS", "localhost")
11+
MILVUS_HOST = os.getenv("MILVUS_HOST", "localhost")
1212
MILVUS_PORT = int(os.getenv("MILVUS_PORT", 19530))
1313
COLLECTION_NAME = os.getenv("COLLECTION_NAME", "rag_milvus")
14-
15-
MOSEC_EMBEDDING_MODEL = os.environ.get("MOSEC_EMBEDDING_MODEL", "/home/user/bce-embedding-base_v1")
14+
# MOSEC configuration
15+
MOSEC_EMBEDDING_MODEL = os.environ.get("MOSEC_EMBEDDING_MODEL", "/home/user/bge-large-zh-v1.5")
1616
MOSEC_EMBEDDING_ENDPOINT = os.environ.get("MOSEC_EMBEDDING_ENDPOINT", "")
1717
os.environ["OPENAI_API_BASE"] = MOSEC_EMBEDDING_ENDPOINT
1818
os.environ["OPENAI_API_KEY"] = "Dummy key"

comps/dataprep/milvus/langchain/prepare_doc_milvus.py

Lines changed: 27 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,12 @@
99

1010
from config import (
1111
COLLECTION_NAME,
12+
LOCAL_EMBEDDING_MODEL,
1213
MILVUS_HOST,
1314
MILVUS_PORT,
1415
MOSEC_EMBEDDING_ENDPOINT,
1516
MOSEC_EMBEDDING_MODEL,
1617
TEI_EMBEDDING_ENDPOINT,
17-
TEI_EMBEDDING_MODEL,
1818
)
1919
from fastapi import Body, File, Form, HTTPException, UploadFile
2020
from langchain.text_splitter import RecursiveCharacterTextSplitter
@@ -73,7 +73,7 @@ def empty_embedding() -> List[float]:
7373
return [e if e is not None else empty_embedding() for e in batched_embeddings]
7474

7575

76-
def ingest_chunks_to_milvus(file_name: str, chunks: List, embedder):
76+
def ingest_chunks_to_milvus(file_name: str, chunks: List):
7777
if logflag:
7878
logger.info(f"[ ingest chunks ] file name: {file_name}")
7979

@@ -94,7 +94,7 @@ def ingest_chunks_to_milvus(file_name: str, chunks: List, embedder):
9494
try:
9595
_ = Milvus.from_documents(
9696
batch_docs,
97-
embedder,
97+
embeddings,
9898
collection_name=COLLECTION_NAME,
9999
connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT},
100100
partition_key_field=partition_field_name,
@@ -110,7 +110,7 @@ def ingest_chunks_to_milvus(file_name: str, chunks: List, embedder):
110110
return True
111111

112112

113-
def ingest_data_to_milvus(doc_path: DocPath, embedder):
113+
def ingest_data_to_milvus(doc_path: DocPath):
114114
"""Ingest document to Milvus."""
115115
path = doc_path.path
116116
file_name = path.split("/")[-1]
@@ -151,7 +151,7 @@ def ingest_data_to_milvus(doc_path: DocPath, embedder):
151151
if logflag:
152152
logger.info(f"[ ingest data ] Done preprocessing. Created {len(chunks)} chunks of the original file.")
153153

154-
return ingest_chunks_to_milvus(file_name, chunks, embedder)
154+
return ingest_chunks_to_milvus(file_name, chunks)
155155

156156

157157
def search_by_file(collection, file_name):
@@ -210,28 +210,9 @@ async def ingest_documents(
210210
if files and link_list:
211211
raise HTTPException(status_code=400, detail="Provide either a file or a string list, not both.")
212212

213-
# Create vectorstore
214-
if MOSEC_EMBEDDING_ENDPOINT:
215-
# create embeddings using MOSEC endpoint service
216-
if logflag:
217-
logger.info(
218-
f"[ upload ] MOSEC_EMBEDDING_ENDPOINT:{MOSEC_EMBEDDING_ENDPOINT}, MOSEC_EMBEDDING_MODEL:{MOSEC_EMBEDDING_MODEL}"
219-
)
220-
embedder = MosecEmbeddings(model=MOSEC_EMBEDDING_MODEL)
221-
elif TEI_EMBEDDING_ENDPOINT:
222-
# create embeddings using TEI endpoint service
223-
if logflag:
224-
logger.info(f"[ upload ] TEI_EMBEDDING_ENDPOINT:{TEI_EMBEDDING_ENDPOINT}")
225-
embedder = HuggingFaceHubEmbeddings(model=TEI_EMBEDDING_ENDPOINT)
226-
else:
227-
# create embeddings using local embedding model
228-
if logflag:
229-
logger.info(f"[ upload ] Local TEI_EMBEDDING_MODEL:{TEI_EMBEDDING_MODEL}")
230-
embedder = HuggingFaceBgeEmbeddings(model_name=TEI_EMBEDDING_MODEL)
231-
232213
# define Milvus obj
233214
my_milvus = Milvus(
234-
embedding_function=embedder,
215+
embedding_function=embeddings,
235216
collection_name=COLLECTION_NAME,
236217
connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT},
237218
index_params=index_params,
@@ -274,7 +255,6 @@ async def ingest_documents(
274255
process_table=process_table,
275256
table_strategy=table_strategy,
276257
),
277-
embedder,
278258
)
279259
uploaded_files.append(save_path)
280260
if logflag:
@@ -294,7 +274,6 @@ async def ingest_documents(
294274
# process_table=process_table,
295275
# table_strategy=table_strategy,
296276
# ),
297-
# embedder
298277
# )
299278

300279
# try:
@@ -352,7 +331,6 @@ async def ingest_documents(
352331
process_table=process_table,
353332
table_strategy=table_strategy,
354333
),
355-
embedder,
356334
)
357335
if logflag:
358336
logger.info(f"[ upload ] Successfully saved link list {link_list}")
@@ -368,28 +346,9 @@ async def rag_get_file_structure():
368346
if logflag:
369347
logger.info("[ get ] start to get file structure")
370348

371-
# Create vectorstore
372-
if MOSEC_EMBEDDING_ENDPOINT:
373-
# create embeddings using MOSEC endpoint service
374-
if logflag:
375-
logger.info(
376-
f"[ get ] MOSEC_EMBEDDING_ENDPOINT:{MOSEC_EMBEDDING_ENDPOINT}, MOSEC_EMBEDDING_MODEL:{MOSEC_EMBEDDING_MODEL}"
377-
)
378-
embedder = MosecEmbeddings(model=MOSEC_EMBEDDING_MODEL)
379-
elif TEI_EMBEDDING_ENDPOINT:
380-
# create embeddings using TEI endpoint service
381-
if logflag:
382-
logger.info(f"[ get ] TEI_EMBEDDING_ENDPOINT:{TEI_EMBEDDING_ENDPOINT}")
383-
embedder = HuggingFaceHubEmbeddings(model=TEI_EMBEDDING_ENDPOINT)
384-
else:
385-
# create embeddings using local embedding model
386-
if logflag:
387-
logger.info(f"[ get ] Local TEI_EMBEDDING_MODEL:{TEI_EMBEDDING_MODEL}")
388-
embedder = HuggingFaceBgeEmbeddings(model_name=TEI_EMBEDDING_MODEL)
389-
390349
# define Milvus obj
391350
my_milvus = Milvus(
392-
embedding_function=embedder,
351+
embedding_function=embeddings,
393352
collection_name=COLLECTION_NAME,
394353
connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT},
395354
index_params=index_params,
@@ -445,28 +404,9 @@ async def delete_single_file(file_path: str = Body(..., embed=True)):
445404
if logflag:
446405
logger.info(file_path)
447406

448-
# Create vectorstore
449-
if MOSEC_EMBEDDING_ENDPOINT:
450-
# create embeddings using MOSEC endpoint service
451-
if logflag:
452-
logger.info(
453-
f"[ delete ] MOSEC_EMBEDDING_ENDPOINT:{MOSEC_EMBEDDING_ENDPOINT}, MOSEC_EMBEDDING_MODEL:{MOSEC_EMBEDDING_MODEL}"
454-
)
455-
embedder = MosecEmbeddings(model=MOSEC_EMBEDDING_MODEL)
456-
elif TEI_EMBEDDING_ENDPOINT:
457-
# create embeddings using TEI endpoint service
458-
if logflag:
459-
logger.info(f"[ delete ] TEI_EMBEDDING_ENDPOINT:{TEI_EMBEDDING_ENDPOINT}")
460-
embedder = HuggingFaceHubEmbeddings(model=TEI_EMBEDDING_ENDPOINT)
461-
else:
462-
# create embeddings using local embedding model
463-
if logflag:
464-
logger.info(f"[ delete ] Local TEI_EMBEDDING_MODEL:{TEI_EMBEDDING_MODEL}")
465-
embedder = HuggingFaceBgeEmbeddings(model_name=TEI_EMBEDDING_MODEL)
466-
467407
# define Milvus obj
468408
my_milvus = Milvus(
469-
embedding_function=embedder,
409+
embedding_function=embeddings,
470410
collection_name=COLLECTION_NAME,
471411
connection_args={"host": MILVUS_HOST, "port": MILVUS_PORT},
472412
index_params=index_params,
@@ -533,4 +473,23 @@ async def delete_single_file(file_path: str = Body(..., embed=True)):
533473
if __name__ == "__main__":
534474
create_upload_folder(upload_folder)
535475

476+
# Create vectorstore
477+
if MOSEC_EMBEDDING_ENDPOINT:
478+
# create embeddings using MOSEC endpoint service
479+
if logflag:
480+
logger.info(
481+
f"[ prepare_doc_milvus ] MOSEC_EMBEDDING_ENDPOINT:{MOSEC_EMBEDDING_ENDPOINT}, MOSEC_EMBEDDING_MODEL:{MOSEC_EMBEDDING_MODEL}"
482+
)
483+
embeddings = MosecEmbeddings(model=MOSEC_EMBEDDING_MODEL)
484+
elif TEI_EMBEDDING_ENDPOINT:
485+
# create embeddings using TEI endpoint service
486+
if logflag:
487+
logger.info(f"[ prepare_doc_milvus ] TEI_EMBEDDING_ENDPOINT:{TEI_EMBEDDING_ENDPOINT}")
488+
embeddings = HuggingFaceHubEmbeddings(model=TEI_EMBEDDING_ENDPOINT)
489+
else:
490+
# create embeddings using local embedding model
491+
if logflag:
492+
logger.info(f"[ prepare_doc_milvus ] LOCAL_EMBEDDING_MODEL:{LOCAL_EMBEDDING_MODEL}")
493+
embeddings = HuggingFaceBgeEmbeddings(model_name=LOCAL_EMBEDDING_MODEL)
494+
536495
opea_microservices["opea_service@prepare_doc_milvus"].start()

comps/embeddings/mosec/langchain/dependency/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ RUN pip3 install llmspec mosec
1919

2020
RUN cd /home/user/ && export HF_ENDPOINT=https://hf-mirror.com && huggingface-cli download --resume-download BAAI/bge-large-zh-v1.5 --local-dir /home/user/bge-large-zh-v1.5
2121
USER user
22-
ENV EMB_MODEL="/home/user/bge-large-zh-v1.5/"
22+
ENV MOSEC_EMBEDDING_MODEL="/home/user/bge-large-zh-v1.5/"
2323

2424
WORKDIR /home/user/comps/embeddings/mosec/langchain/dependency
2525

comps/embeddings/mosec/langchain/dependency/server-ipex.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818

1919
class Embedding(Worker):
2020
def __init__(self):
21-
self.model_name = os.environ.get("EMB_MODEL", DEFAULT_MODEL)
21+
self.model_name = os.environ.get("MOSEC_EMBEDDING_MODEL", DEFAULT_MODEL)
2222
self.tokenizer = transformers.AutoTokenizer.from_pretrained(self.model_name)
2323
self.model = transformers.AutoModel.from_pretrained(self.model_name)
2424
self.device = torch.cuda.current_device() if torch.cuda.is_available() else "cpu"

comps/reranks/mosec/langchain/dependency/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ RUN pip3 install llmspec mosec
2020

2121
RUN cd /home/user/ && export HF_ENDPOINT=https://hf-mirror.com && huggingface-cli download --resume-download BAAI/bge-reranker-base --local-dir /home/user/bge-reranker-large
2222
USER user
23-
ENV EMB_MODEL="/home/user/bge-reranker-large/"
23+
ENV MOSEC_EMBEDDING_MODEL="/home/user/bge-reranker-large/"
2424

2525
WORKDIR /home/user/comps/reranks/mosec/langchain/dependency
2626

comps/retrievers/milvus/langchain/config.py

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,16 @@
33

44
import os
55

6-
# Embedding model
7-
EMBED_MODEL = os.getenv("EMBED_MODEL", "maidalun1020/bce-embedding-base_v1")
8-
# Embedding endpoints
9-
EMBED_ENDPOINT = os.getenv("TEI_EMBEDDING_ENDPOINT", "")
6+
# Local Embedding model
7+
LOCAL_EMBEDDING_MODEL = os.getenv("LOCAL_EMBEDDING_MODEL", "maidalun1020/bce-embedding-base_v1")
8+
# TEI Embedding endpoints
9+
TEI_EMBEDDING_ENDPOINT = os.getenv("TEI_EMBEDDING_ENDPOINT", "")
1010
# MILVUS configuration
11-
MILVUS_HOST = os.getenv("MILVUS", "localhost")
11+
MILVUS_HOST = os.getenv("MILVUS_HOST", "localhost")
1212
MILVUS_PORT = int(os.getenv("MILVUS_PORT", 19530))
1313
COLLECTION_NAME = os.getenv("COLLECTION_NAME", "rag_milvus")
14-
15-
14+
# MOSEC configuration
15+
MOSEC_EMBEDDING_MODEL = os.environ.get("MOSEC_EMBEDDING_MODEL", "/home/user/bce-embedding-base_v1")
1616
MOSEC_EMBEDDING_ENDPOINT = os.environ.get("MOSEC_EMBEDDING_ENDPOINT", "")
1717
os.environ["OPENAI_API_BASE"] = MOSEC_EMBEDDING_ENDPOINT
1818
os.environ["OPENAI_API_KEY"] = "Dummy key"
19-
MODEL_ID = "/home/user/bce-embedding-base_v1"

comps/retrievers/milvus/langchain/retriever_milvus.py

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,14 +8,14 @@
88

99
from config import (
1010
COLLECTION_NAME,
11-
EMBED_ENDPOINT,
12-
EMBED_MODEL,
11+
LOCAL_EMBEDDING_MODEL,
1312
MILVUS_HOST,
1413
MILVUS_PORT,
15-
MODEL_ID,
1614
MOSEC_EMBEDDING_ENDPOINT,
15+
MOSEC_EMBEDDING_MODEL,
16+
TEI_EMBEDDING_ENDPOINT,
1717
)
18-
from langchain_community.embeddings import HuggingFaceBgeEmbeddings, OpenAIEmbeddings
18+
from langchain_community.embeddings import HuggingFaceBgeEmbeddings, HuggingFaceHubEmbeddings, OpenAIEmbeddings
1919
from langchain_milvus.vectorstores import Milvus
2020

2121
from comps import (
@@ -106,11 +106,19 @@ async def retrieve(input: EmbedDoc) -> SearchedDoc:
106106
if __name__ == "__main__":
107107
# Create vectorstore
108108
if MOSEC_EMBEDDING_ENDPOINT:
109+
# create embeddings using Mosec endpoint service
110+
if logflag:
111+
logger.info(f"[ retriever_milvus ] MOSEC_EMBEDDING_ENDPOINT:{MOSEC_EMBEDDING_ENDPOINT}")
112+
embeddings = MosecEmbeddings(model=MOSEC_EMBEDDING_MODEL)
113+
elif TEI_EMBEDDING_ENDPOINT:
109114
# create embeddings using TEI endpoint service
110-
# embeddings = HuggingFaceHubEmbeddings(model=EMBED_ENDPOINT)
111-
embeddings = MosecEmbeddings(model=MODEL_ID)
115+
if logflag:
116+
logger.info(f"[ retriever_milvus ] TEI_EMBEDDING_ENDPOINT:{TEI_EMBEDDING_ENDPOINT}")
117+
embeddings = HuggingFaceHubEmbeddings(model=TEI_EMBEDDING_ENDPOINT)
112118
else:
113119
# create embeddings using local embedding model
114-
embeddings = HuggingFaceBgeEmbeddings(model_name=EMBED_MODEL)
120+
if logflag:
121+
logger.info(f"[ retriever_milvus ] LOCAL_EMBEDDING_MODEL:{LOCAL_EMBEDDING_MODEL}")
122+
embeddings = HuggingFaceBgeEmbeddings(model_name=LOCAL_EMBEDDING_MODEL)
115123

116124
opea_microservices["opea_service@retriever_milvus"].start()

tests/dataprep/test_dataprep_milvus_langchain.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,8 @@ function start_service() {
4646

4747
# start dataprep service
4848
MOSEC_EMBEDDING_ENDPOINT="http://${ip_address}:${mosec_embedding_port}"
49-
MILVUS=${ip_address}
50-
docker run -d --name="test-comps-dataprep-milvus-server" -p ${dataprep_service_port}:6010 -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e MOSEC_EMBEDDING_ENDPOINT=${MOSEC_EMBEDDING_ENDPOINT} -e MILVUS=${MILVUS} -e LOGFLAG=true --ipc=host opea/dataprep-milvus:comps
49+
MILVUS_HOST=${ip_address}
50+
docker run -d --name="test-comps-dataprep-milvus-server" -p ${dataprep_service_port}:6010 -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e no_proxy=$no_proxy -e MOSEC_EMBEDDING_ENDPOINT=${MOSEC_EMBEDDING_ENDPOINT} -e MILVUS_HOST=${MILVUS_HOST} -e LOGFLAG=true --ipc=host opea/dataprep-milvus:comps
5151
sleep 1m
5252
}
5353

0 commit comments

Comments
 (0)