Skip to content

Commit 18057f8

Browse files
committed
clean, create two distinct pipeline folders, remove configs in kotaemon code, fix one kotaemon bug for retrival
1 parent 0476ca3 commit 18057f8

File tree

71 files changed

+91314
-279
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

71 files changed

+91314
-279
lines changed

rag_system/Dockerfile

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,9 +30,10 @@ RUN bash scripts/download_pdfjs.sh $PDFJS_PREBUILT_DIR
3030
# Copy project files
3131
COPY kotaemon/libs /app/libs
3232
COPY kotaemon/launch.sh /app/launch.sh
33-
COPY kotaemon/.env.example /app/.env
34-
COPY flowsettings.py /app/flowsettings.py
35-
COPY pipeline_scripts /app/pipeline_scripts
33+
COPY kotaemon/app.py /app/app.py
34+
COPY kotaemon_pipeline_scripts/.env /app/.env
35+
COPY kotaemon_pipeline_scripts/flowsettings.py /app/flowsettings.py
36+
COPY kotaemon_pipeline_scripts /app/pipeline_scripts
3637
COPY taxonomy /app/taxonomy
3738

3839

rag_system/README.md

Lines changed: 131 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,36 +13,162 @@ rag_system
1313
├── flowsettings.py
1414
├── kotaemon
1515
├── kotaemon_install_guide
16-
├── pipeline_scripts
16+
├── kotaemon_pipeline_scripts
17+
├── new_pipeline_scripts
1718
├── README.md
1819
└── taxonomy
1920
```
2021

21-
## Pipeline Scripts Instructions
22+
There are 2 pipeline ingestion projects here...
23+
... that share the same taxonomy.
24+
25+
The first one (with Kotaemon) use these folders :
26+
27+
```bash
28+
├── docker-compose.yml
29+
├── Dockerfile
30+
├── kotaemon
31+
├── kotaemon_install_guide
32+
├── kotaemon_pipeline_scripts
33+
└── taxonomy
34+
```
35+
36+
The second one (currently without Kotaemon ?) use these folders:
37+
38+
```bash
39+
├── new_pipeline_scripts
40+
├── README.md
41+
└── taxonomy
42+
```
43+
44+
## NEW Pipeline Scripts Instructions
2245

2346
The pipeline scripts folder contains script to make extraction and analysis of documents.
2447

2548
To setup the pipeline scripts, run the following command:
2649

2750

2851
```bash
29-
cd rag_system/pipeline_scripts
52+
cd rag_system/new_pipeline_scripts
3053
uv sync
3154
```
3255

3356
You can find a detailed guide here: [📄](../rag_system/pipeline_scripts/agentic_data_policies_extraction/policies_transformation_to_matrices/README.md)
3457

3558

36-
## Running the RAG System
59+
### Running the RAG System
3760

3861
We recommend running as a Python module, or using the Docker Compose file:
3962

4063
```bash
41-
cd rag_system/pipeline_scripts
64+
cd rag_system/new_pipeline_scripts
4265
uv run python -m agentic_data_policies_extraction.main
4366
```
4467

4568

69+
## KOTAEMON Pipeline Scripts Instructions
70+
71+
This framework is build according to Kotaemon to allow a new custom built 'fast' ingestion script (multi-threading ingestion for hundred and hundred document ate the same time), side-to-side to the standard 'drag-and-drop' Kotaemon ingestion from the UI.
72+
73+
Shell scripts call ...
74+
75+
76+
77+
### DEV set-up deployment
78+
79+
You have two config files to check:
80+
81+
82+
#### - the official Kotaemon file 'flowsettings.py" :
83+
84+
This file is at the root of 'rag_system'. (It will overwrite the official 'flowsettings.py' during the docker build.)
85+
86+
where are declared (among other things but the main declared components...):
87+
88+
- ```KH_OLLAMA_URL``` : the uri used to connect to the Ollama service inference (LLM models inference service)
89+
- ```KH_APP_DATA_DIR``` : The main app data root directory where Kotaemon store all the internal data
90+
- ```KH_DOCSTORE``` : The Kotaemon Docstore used and the path for it. Local Lancedb by default, but you could choose a remote LanceDB here.
91+
- ```KH_VECTORSTORE``` : The Kotaemon VectorStore used and the url for it. Qdrant by default for the dev team.
92+
- ...
93+
94+
You should not touch all these config for now... (for a dev setup)
95+
96+
97+
#### - an additionnal .env to set inside the 'kotaemon_pipeline_scripts' folder :
98+
99+
This file lives inside 'kotaemon_pipeline_scripts'.
100+
101+
You have to generate your own .env from the .env.example template.
102+
103+
All these config parameters are needed for the automatic fast ingestion pipeline.
104+
105+
- ```PG_DATABASE_URL``` = The URL of the Data4Good database that maintains the OpenAlex articles metadata
106+
- ```LLM_INFERENCE_URL``` = The URL for the LLM inference stack (Ollama for local dev)
107+
- ```LLM_INFERENCE_MODEL``` = The model used for the chunk inference on metadatas
108+
- ```LLM_INFERENCE_API_KEY``` = The API Key for the LLM inference stack
109+
- ```EMBEDDING_MODEL_URL``` = The URL for the LLM embedding model stack (Ollama for local dev)
110+
- ```EMBEDDING_MODEL``` = The model used for the embedding
111+
- ```EMBEDDING_MODEL_API_KEY``` = The API Key for the LLM embedding model
112+
- ```COLLECTION_ID``` = The id of the collection within Kotaemon App (BE CAREFULL TO CHOOSE THE RIGHT ID)
113+
- ```USER_ID``` = The User ID taken from the Kotaemon App (BE CAREFULL TO CHOOSE THE RIGHT ID)
114+
115+
For now, do not touch the 'USER_ID' before launching the Kotaemon app for the first time. (see further)
116+
117+
118+
### Running the RAG System
119+
120+
1) The 'dev' deployment is used to launch, work and debug with the python package in editable mode.
121+
Moreover, all the 'kotaemon_pipeline_scripts' folder is mapped (as a volume) inside the container, to allow working on it during this dev stage.
122+
123+
First, launch the different services with the docker compose provided in this folder.
124+
125+
Nothing to do — everything’s already set up: the Docker Compose file was created to save you the hassle.
126+
127+
You only need to pay attention, if necessary, to the volume mappings.
128+
129+
And if you don't have anny GPU on your lcal machine and you don't have set-up cuda with docker, remove these line for the Ollama service ;
130+
131+
```yaml
132+
deploy:
133+
resources:
134+
reservations:
135+
devices:
136+
- driver: nvidia
137+
count: all
138+
capabilities: [gpu]
139+
```
140+
141+
```bash
142+
docker compose up
143+
```
144+
145+
Additionally, the command that normally launches the Kotaemon app (./launch.sh) has been deliberately disabled so you can develop on the app — coding the different libraries (kotaemon, ktem, and our custom ones) — without having to stop/restart the Kotaemon container.
146+
147+
Indeed, to run the Kotaemon app for testing, you need to enter the container:
148+
149+
From the rag_system folder where the Docker Compose file is located:
150+
151+
```bash
152+
docker compose exec -it kotaemon bash
153+
```
154+
155+
IMPORTANT: After launching the Kotaemon App, go on a random page... see the logs to retrieve the USER ID !
156+
Shut-down the kotaemon app from inside the container. (or shut-down all the containers if you want)
157+
Replace the good USER ID within your .env.
158+
Re-launch the Kotaemon app. Your 'fast' ingestion pipeline scripts should be consistents.
159+
160+
161+
2) You also need to pull the different models with the Ollama service.
162+
Read and follow the point 2 of the README inside the 'kotaemon_install_guide' (FR) relative to this.
163+
164+
3) And now, for your first steps on the Kotaemon app, read and follow the point 3 of of the README inside the 'kotaemon_install_guide' (FR) relative to this.
165+
166+
167+
### Running the 'Fast' ingestion pipeline scripts
168+
169+
170+
171+
46172
## Kotaemon Subtree Setup
47173

48174
The Kotaemon folder is a shared Data4Good subtree, synchronized with the common project:

rag_system/docker-compose.yml

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
services:
22
kotaemon:
33
build:
4-
context: ./kotaemon
5-
target: full
4+
context: .
5+
target: dev-runtime
66
pull_policy: if_not_present
7-
entrypoint: ["/bin/sh", "-c", "pip install -e /app/taxonomy && tail -f /dev/null"]
7+
entrypoint: ["/bin/sh", "-c", "tail -f /dev/null"]
88
environment:
99
GRADIO_SERVER_NAME: 0.0.0.0
1010
GRADIO_SERVER_PORT: 7860
@@ -14,10 +14,11 @@ services:
1414
ports:
1515
- '7860:7860'
1616
volumes:
17-
- './kotaemon/flowsettings.py:/app/flowsettings.py'
17+
- './kotaemon_pipeline_scripts/flowsettings.py:/app/flowsettings.py'
1818
- './kotaemon/libs:/app/libs'
19+
- './kotaemon_pipeline_scripts/.env:/app/.env'
1920
- './kotaemon/ktem_app_data:/app/ktem_app_data'
20-
- './pipeline_scripts/:/app/pipeline_scripts'
21+
- './kotaemon_pipeline_scripts/:/app/pipeline_scripts'
2122
- './taxonomy/:/app/taxonomy'
2223
depends_on:
2324
- ollama

rag_system/kotaemon/flowsettings.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@
8282
config("KH_FEATURE_USER_MANAGEMENT_PASSWORD", default="admin")
8383
)
8484
KH_ENABLE_ALEMBIC = False
85-
KH_DATABASE = os.getenv("POSTGRESQL_ADDON_URI", None) # f"sqlite:///{KH_USER_DATA_DIR / 'sql.db'}"
85+
KH_DATABASE = f"sqlite:///{KH_USER_DATA_DIR / 'sql.db'}"
8686
# KH_DATABASE = "postgresql://postgres:my_pass@postgres-db:5432/my_db"
8787
KH_FILESTORAGE_PATH = str(KH_USER_DATA_DIR / "files")
8888

rag_system/kotaemon/libs/kotaemon/kotaemon/storages/docstores/lancedb.py

Lines changed: 23 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,24 @@
55
from kotaemon.base import Document
66

77
from .base import BaseDocumentStore
8+
9+
""" # Data4Good config - removed for dev setup
810
CELLAR_ADDON_KEY_ID = os.getenv("CELLAR_ADDON_KEY_ID", "")
911
CELLAR_ADDON_KEY_SECRET = os.getenv("CELLAR_ADDON_KEY_SECRET", "")
1012
CELLAR_ADDON_HOST = os.getenv("CELLAR_ADDON_HOST", "cellar-c2.services.clever-cloud.com")
1113
14+
# And add this in the __init__ method:
15+
self.db_connection = lancedb.connect(
16+
"s3://wsl-docstore-prod",
17+
storage_options={
18+
"region": "us-east-1",
19+
"aws_access_key_id": CELLAR_ADDON_KEY_ID,
20+
"aws_secret_access_key": CELLAR_ADDON_KEY_SECRET,
21+
"endpoint": f"http://{CELLAR_ADDON_HOST}",
22+
"allow_http": "true"
23+
}
24+
"""
25+
1226
MAX_DOCS_TO_GET = 10**4
1327

1428

@@ -25,16 +39,7 @@ def __init__(self, path: str = "lancedb", collection_name: str = "docstore"):
2539

2640
self.db_uri = path
2741
self.collection_name = collection_name
28-
self.db_connection = lancedb.connect(
29-
"s3://wsl-docstore-prod",
30-
storage_options={
31-
"region": "us-east-1",
32-
"aws_access_key_id": CELLAR_ADDON_KEY_ID,
33-
"aws_secret_access_key": CELLAR_ADDON_KEY_SECRET,
34-
"endpoint": f"http://{CELLAR_ADDON_HOST}",
35-
"allow_http": "true"
36-
}
37-
)
42+
self.db_connection = lancedb.connect(self.db_uri) # type: ignore
3843

3944
def add(
4045
self,
@@ -51,7 +56,7 @@ def add(
5156
"text": doc.text,
5257
"attributes": json.dumps(doc.metadata),
5358
}
54-
for doc_id, doc in zip(doc_ids, docs, strict=False)
59+
for doc_id, doc in zip(doc_ids, docs)
5560
]
5661

5762
if self.collection_name not in self.db_connection.table_names():
@@ -126,14 +131,18 @@ def get(self, ids: Union[List[str], str]) -> List[Document]:
126131
)
127132
except (ValueError, FileNotFoundError):
128133
docs = []
129-
return [
130-
Document(
134+
135+
# return the documents using the order of original
136+
# ids (which were ordered by score)
137+
doc_dict = {
138+
doc["id"]: Document(
131139
id_=doc["id"],
132140
text=doc["text"] if doc["text"] else "<empty>",
133141
metadata=json.loads(doc["attributes"]),
134142
)
135143
for doc in docs
136-
]
144+
}
145+
return [doc_dict[_id] for _id in ids if _id in doc_dict]
137146

138147
def delete(self, ids: Union[List[str], str], refresh_indices: bool = True):
139148
"""Delete document by id"""

rag_system/kotaemon/libs/kotaemon/kotaemon/storages/vectorstores/qdrant.py

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,19 @@
1-
import os
21
from typing import Any, List, Optional, cast
32

43
from .base import LlamaIndexVectorStore
54

5+
""" Data4Good config - removed for dev setup - please, try to add this on your settings and not here...
66
VECTORSTORE_URL = os.getenv("VECTOSTORE_URL", "")
77
default_api_key = os.getenv("API_KEY", "")
88
9+
#And add this in the __init__ method: (but please, try to add this on your settings and not here...)
10+
11+
self._url = VECTORSTORE_URL
12+
self._api_key = default_api_key
13+
14+
"""
15+
16+
917
class QdrantVectorStore(LlamaIndexVectorStore):
1018
_li_class = None
1119

@@ -31,16 +39,15 @@ def __init__(
3139
**kwargs: Any,
3240
):
3341
self._collection_name = collection_name
34-
self._url = VECTORSTORE_URL
35-
self._api_key = default_api_key
42+
self._url = url
43+
self._api_key = api_key
3644
self._client_kwargs = client_kwargs
3745
self._kwargs = kwargs
38-
print(f"url: {self._url}")
3946

4047
super().__init__(
4148
collection_name=collection_name,
42-
url=VECTORSTORE_URL,
43-
api_key=default_api_key,
49+
url=url,
50+
api_key=api_key,
4451
client_kwargs=client_kwargs,
4552
**kwargs,
4653
)

rag_system/kotaemon/libs/ktem/ktem/index/file/index.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
from ktem.components import filestorage_path, get_docstore, get_vectorstore
77
from ktem.db.engine import engine
88
from ktem.index.base import BaseIndex
9-
from sqlalchemy import JSON, Column, DateTime, Integer, String
9+
from sqlalchemy import JSON, Column, DateTime, Integer, String, UniqueConstraint
1010
from sqlalchemy.ext.declarative import declarative_base
1111
from sqlalchemy.ext.mutable import MutableDict
1212
from theflow.settings import settings as flowsettings

rag_system/kotaemon/libs/ktem/ktem/index/file/pipelines.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -154,7 +154,7 @@ def run(
154154
# do first round top_k extension
155155
retrieval_kwargs["do_extend"] = True
156156
retrieval_kwargs["scope"] = chunk_ids
157-
retrieval_kwargs["filters"] = MetadataFilters(
157+
"""retrieval_kwargs["filters"] = MetadataFilters(
158158
filters=[
159159
MetadataFilter(
160160
key="file_id",
@@ -163,7 +163,7 @@ def run(
163163
)
164164
],
165165
condition=FilterCondition.OR,
166-
)
166+
)"""
167167

168168
if self.mmr:
169169
# TODO: double check that llama-index MMR works correctly
@@ -173,6 +173,10 @@ def run(
173173
# rerank
174174
s_time = time.time()
175175
print(f"retrieval_kwargs: {retrieval_kwargs.keys()}")
176+
177+
import pdb
178+
pdb.set_trace()
179+
176180
docs = self.vector_retrieval(text=text, top_k=self.top_k, **retrieval_kwargs)
177181
print("retrieval step took", time.time() - s_time)
178182

File renamed without changes.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.env

0 commit comments

Comments
 (0)