|
1 | | -# TODO |
| 1 | +# Chat Sufficiency |
| 2 | + |
| 3 | +Folder org: |
| 4 | +``` |
| 5 | +- old: previous work based on Kotaemon |
| 6 | +- backend: python backend using FastAPI and implementing RAG |
| 7 | +- frontend: typescript frontend using SvelteKit |
| 8 | +``` |
| 9 | + |
| 10 | +## Quick start (dev setup) |
| 11 | +``` |
| 12 | +# start backend |
| 13 | +cd backend |
| 14 | +cp .env.example .env |
| 15 | +# fill in correct values in .env |
| 16 | +uv run uvicorn app.main:app --reload |
| 17 | +
|
| 18 | +# start frontend (dev) |
| 19 | +cd frontend |
| 20 | +npm install |
| 21 | +cp .env.example .env |
| 22 | +npm run dev |
| 23 | +``` |
| 24 | + |
| 25 | +## Pipeline and architecture |
| 26 | +Here is the pipeline and main architectural elements: |
| 27 | +1. The **SvelteKit frontend** sends a query to the **FastAPI backend** via a POST request. |
| 28 | +2. The backend calls a **generative AI API (Scaleway)** to determine whether the query is on-topic. If yes, it rewrites it for retrieval. If not, it answers directly. |
| 29 | +3. The rewritten query is embedded on the server's CPU with a small model using **sentence-transformers**. It would be more cost-efficient to use an API and a smaller instance, but the embedding model we started with, **Qwen3-embedding-0.6B** isn't available on any commercial API. |
| 30 | +4. The server sends the query to **Qdrant** (vector db), that returns the top $k_{vector}$ matches (configurable). |
| 31 | +5. It then reranks the results locally using **flashrank**. Again, a more mature version might use e.g. Cohere's API. |
| 32 | +6. The top $k_{rerank}$ chunks are then used to build the context. If the FETCH_PUBS env var is true (default), we use the OpenAlex ID of the chunks to fetch the corresponding publications from **OpenAlex's API**. We build the context using the title, abstract, and the retrieved chunks. |
| 33 | +7. The context is passed along with the original query to the generative API and the backend streams back the response the the frontend. |
| 34 | +8. The backend saves the messages and intermediary results to **Postgres**. |
| 35 | + |
| 36 | +SvelteKit is a full-stack framework for Svelte, similar to Next for React or Nuxt for Vue. We could almost have used it as a static site generator, but we use server-side functions to hide the backend's URL from the user. |
| 37 | + |
| 38 | +The schema below is an illustration of the aformentioned pipeline. Note that the policy analysis retrieval isn't implemented yet. |
| 39 | + |
| 40 | + |
| 41 | + |
| 42 | +## CleverCloud deployment |
| 43 | +Both applications (front and back) are deployed to CleverCloud on the World Sufficiency Lab organization. CC handles the continuous deployment at each push to the given branch. |
| 44 | + |
| 45 | +The frontend requires a larger instance to build than to run, so we configured one. CC also doesn't include build step by default, so it needs to be configured via env vars. |
| 46 | + |
| 47 | +Due to the local computations, the backend currently demands 4 GB of RAM. CleverCloud also doesn't handle uv well, so we need to go through pip. Use : |
| 48 | +``` |
| 49 | +uv pip compile pyproject.toml --output-file requirements.txt |
| 50 | +``` |
| 51 | +to create `requirements.txt` file from `pyproject.toml`. To avoid installing useless GPU-related libraries, pyproject.toml is configured to install torch+cpu. For this to work, don't forget to add this line at the start of `requirements.txt` after generating it : |
| 52 | +``` |
| 53 | +--extra-index-url https://download.pytorch.org/whl/cpu |
| 54 | +``` |
| 55 | + |
| 56 | + |
| 57 | +## TODO |
| 58 | +- suggestions (of questions) |
| 59 | +- prompt: answer in same language as query |
| 60 | +- prompt: don't forget social floor |
2 | 61 | - hybrid search |
3 | | -- logging (to postgres ?) |
4 | | -- error handling |
5 | 62 | - policy analysis |
6 | | -- suggestions (of questions) |
| 63 | +- optimize cost (use APIs for query embedding and reranking to use a smaller server instance) |
0 commit comments