-
Notifications
You must be signed in to change notification settings - Fork 0
Description
name: Feature Request
title: "[FEAT] - RAG Knowledge Base R&D & Evaluation"
labels: feature, backlog
assignees: @Jrodrigo06 @mdeekshita @SHarg9876
Summary
Research and evaluate different RAG configurations — chunking strategies, embedding models, and retrieval approaches — to determine the optimal setup for the ingredient suggestion pipeline. Produce metrics and visualizations to back decisions.
Motivation
The vector store infrastructure and similarity search are already built. Before locking in a RAG configuration for production, we should systematically evaluate our options and have data to back our decisions. This ticket is a research spike — the outputs directly inform the RAG tagging service prompt and configuration.
Requirements
Acceptance Criteria
Sub-task 1: Document Sourcing & Seeding
- Identify and collect candidate source documents for our usecase
- Store raw documents in
backend/data/raw/ - Seed
knowledge_chunkstable viabackend/scripts/seed_knowledge.py(I will hopefully have an endpoint soon enough to directly add and process pdfs so dont worry about this tm)
Sub-task 2: Chunking Strategy Experimentation (For this, before starting, you will have to come up with a good prompt to benchmark on for getting food tags or ingredients (I'd recommend the tags) this is a lot so ask questions!)
- Research and implement at least 3 different chunking strategies — document tradeoffs of each before implementing
- Re-seed vector store for each strategy
- Run fixed set of food queries against each and record retrieval results
- Plot precision@k and MRR across strategies
- Document which strategy performs best for autoimmune-specific queries
Sub-task 3: Embedding Model Comparison
- Research and select at least 3 embedding models to benchmark — document why each was chosen
- Embed same knowledge base chunks with each model
- Run same fixed query set and score top-k results against hand-labeled ground truth
- Produce comparison table and plot of scores per model
Sub-task 4: Embedding Space Exploration (Maybe fade/skip if its too much but some real cool work here ngl!)
- Extract all chunk embeddings from pgvector
- Reduce to 2D using UMAP and t-SNE (Other dimensionality reduction techs may work too!)
- Visualize and label clusters by trigger category
- Analyze whether autoimmune trigger categories naturally separate in embedding space
- Document findings — clean clusters = good model fit, messy = retrieval likely struggling
Sub-task 5: Findings Write-up & Slide Deck (MUST HAVE!)
- Short slide deck summarizing: sources chosen, chunking comparison, embedding model comparison, cluster visualizations, and final recommendations
- Recommendations feed directly into RAG tagging service configuration
Out of Scope
- RAG tagging service implementation — separate ticket
- Frontend
Technical Approach
Affected Areas
-
backend/scripts/seed_knowledge.py -
backend/data/raw/(new) -
backend/scripts/evaluate_retrieval.py(new) — runs query eval harness -
backend/notebooks/(new) — Jupyter notebooks for plots and visualizations
Dependencies
Dependencies
- Depends on [FEAT] LLM Rag pipeline for Recommendation #52
- Branch off
feat/llm-rag-pipeline—KnowledgeChunkmodel, pgvector, and similarity search live there - Add
sentence-transformers,pypdf,umap-learn,matplotlibviauv add
Testing Notes
PLEASE TEST AND SHOW IT WORKS we don't have much time so testing so cruicial to prove it works and prevents lingering bugs