-
-
Notifications
You must be signed in to change notification settings - Fork 50
Embedding pipeline chunking #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| notes/ |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| # Smart Notes β Local Q&A (RAG MVP) | ||
|
|
||
| This is a minimal, local-first MVP that allows users to ask natural-language questions over their markdown notes. | ||
|
|
||
| ## Features (Current MVP) | ||
|
|
||
| - Loads markdown files from a local `notes/` directory | ||
| - Supports natural-language questions (e.g., "what is AI", "where is AI used") | ||
| - Returns sentence-level answers from notes | ||
| - Shows the source note filename | ||
| - Interactive CLI loop (type `exit` to quit) | ||
|
|
||
| This is a starter implementation intended to be extended with embeddings and vector search in future iterations. | ||
|
|
||
| --- | ||
|
|
||
| ## How it works | ||
|
|
||
| 1. Notes are loaded from the local `notes/` directory. | ||
| 2. Question words (what, where, who, when, etc.) are filtered. | ||
| 3. Notes are split into sentences. | ||
| 4. Relevant sentences are returned based on keyword matching. | ||
|
|
||
| --- | ||
|
|
||
| ## How to run | ||
|
|
||
| ```bash | ||
| python smart-notes/rag_mvp/qa_cli.py | ||
|
|
||
|
|
||
|
|
||
| >> what is AI | ||
|
|
||
| [1] From test.md: | ||
| Artificial Intelligence (AI) is the simulation of human intelligence in machines. | ||
|
|
||
|
|
||
| >> what is machine learning | ||
| how is machine learning used | ||
| difference between AI and ML | ||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
| # Smart Notes β RAG MVP (Embeddings & FAISS) | ||
|
|
||
| This project is a simple **Retrieval-Augmented Generation (RAG)** pipeline for Smart Notes. | ||
| It allows users to store notes, convert them into embeddings, and search relevant notes using vector similarity. | ||
|
|
||
| --- | ||
|
|
||
| ## π Features | ||
|
|
||
| - Convert notes into embeddings using Sentence Transformers | ||
| - Store and search embeddings using FAISS (CPU) | ||
| - CLI tool to ask questions about your notes | ||
| - Simple chunking for text files | ||
| - Works fully offline after model download | ||
|
|
||
| --- | ||
|
|
||
| ## π§ Tech Stack | ||
|
|
||
| - Python 3.10+ | ||
| - sentence-transformers | ||
| - FAISS (faiss-cpu) | ||
| - HuggingFace Transformers | ||
|
|
||
| --- | ||
|
|
||
| ## π Project Structure | ||
|
|
||
| ```bash | ||
| smart-notes/ | ||
| βββ rag_mvp/ | ||
| β βββ embed.py # Embedding logic | ||
| β βββ index.py # FAISS index creation | ||
| β βββ qa_cli.py # CLI for asking questions | ||
| β βββ utils.py # Helper functions | ||
| βββ notes/ # Put your .txt notes here | ||
| βββ requirements.txt | ||
| βββ README.md | ||
|
Comment on lines
+75
to
+84
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Project structure in README doesn't match actual file layout. The documented structure references Please update the project structure to reflect the real file layout and close the code fence. π€ Prompt for AI Agents |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| """ | ||
| Chunking utilities for splitting long notes into overlapping chunks. | ||
| This helps embeddings capture local context. | ||
| """ | ||
|
|
||
| from typing import List | ||
|
|
||
|
|
||
| def chunk_text(text: str, max_length: int = 500, overlap: int = 50) -> List[str]: | ||
| if not text: | ||
| return [] | ||
|
|
||
| chunks = [] | ||
| start = 0 | ||
| text = text.strip() | ||
|
|
||
| while start < len(text): | ||
| end = start + max_length | ||
| chunk = text[start:end].strip() | ||
|
|
||
| if chunk: | ||
| chunks.append(chunk) | ||
|
|
||
| if end >= len(text): | ||
| break | ||
|
|
||
| start = end - overlap | ||
| if start < 0: | ||
| start = 0 | ||
|
Comment on lines
+9
to
+29
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Infinite loop when If Proposed fix def chunk_text(text: str, max_length: int = 500, overlap: int = 50) -> List[str]:
if not text:
return []
+ if overlap >= max_length:
+ raise ValueError("overlap must be less than max_length")
chunks = []
start = 0π€ Prompt for AI Agents |
||
|
|
||
| return chunks | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| """ | ||
| Embedding wrapper for converting text chunks into vectors. | ||
| Supports pluggable embedding backends later (Ollama, OpenAI, SentenceTransformers). | ||
| """ | ||
|
|
||
| from typing import List | ||
| import numpy as np | ||
|
|
||
| try: | ||
| from sentence_transformers import SentenceTransformer | ||
| except ImportError: | ||
| SentenceTransformer = None | ||
|
|
||
|
|
||
| class Embedder: | ||
| def __init__(self, model_name: str = "all-MiniLM-L6-v2"): | ||
| if SentenceTransformer is None: | ||
| raise ImportError( | ||
| "sentence-transformers not installed. Run: pip install sentence-transformers" | ||
| ) | ||
|
|
||
| self.model_name = model_name | ||
| self.model = SentenceTransformer(model_name) | ||
|
|
||
| def embed(self, texts: List[str]) -> np.ndarray: | ||
| if not texts: | ||
| return np.array([]) | ||
|
|
||
| embeddings = self.model.encode(texts, convert_to_numpy=True) | ||
| return embeddings |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,41 @@ | ||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||
| Simple vector indexer using FAISS for similarity search. | ||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| from typing import List | ||||||||||||||||||||||||||
| import numpy as np | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| try: | ||||||||||||||||||||||||||
| import faiss | ||||||||||||||||||||||||||
| except ImportError: | ||||||||||||||||||||||||||
| faiss = None | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| class VectorIndexer: | ||||||||||||||||||||||||||
| def __init__(self, dim: int): | ||||||||||||||||||||||||||
| if faiss is None: | ||||||||||||||||||||||||||
| raise ImportError("faiss not installed. Run: pip install faiss-cpu") | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| self.dim = dim | ||||||||||||||||||||||||||
| self.index = faiss.IndexFlatL2(dim) | ||||||||||||||||||||||||||
| self.texts: List[str] = [] | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| def add(self, embeddings: np.ndarray, chunks: List[str]): | ||||||||||||||||||||||||||
| if len(embeddings) == 0: | ||||||||||||||||||||||||||
| return | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| self.index.add(embeddings) | ||||||||||||||||||||||||||
| self.texts.extend(chunks) | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| def search(self, query_embedding: np.ndarray, k: int = 3): | ||||||||||||||||||||||||||
| if self.index.ntotal == 0: | ||||||||||||||||||||||||||
| return [] | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| distances, indices = self.index.search(query_embedding.reshape(1, -1), k) | ||||||||||||||||||||||||||
| results = [] | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| for idx in indices[0]: | ||||||||||||||||||||||||||
| if idx < len(self.texts): | ||||||||||||||||||||||||||
| results.append(self.texts[idx]) | ||||||||||||||||||||||||||
|
Comment on lines
+34
to
+39
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bug: FAISS returns When the index has fewer vectors than Proposed fix- distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
+ _distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
results = []
for idx in indices[0]:
- if idx < len(self.texts):
+ if 0 <= idx < len(self.texts):
results.append(self.texts[idx])π Committable suggestion
Suggested change
π§° Toolsπͺ Ruff (0.15.0)[warning] 34-34: Unpacked variable Prefix it with an underscore or any other dummy variable pattern (RUF059) π€ Prompt for AI Agents |
||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| return results | ||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,47 @@ | ||||||||||||||||||||||||||||||
| # rag_mvp/pipelines/embedding_pipeline.py | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| from sentence_transformers import SentenceTransformer | ||||||||||||||||||||||||||||||
| import faiss | ||||||||||||||||||||||||||||||
|
Comment on lines
+3
to
+4
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. π οΈ Refactor suggestion | π Major Direct imports without graceful error handling, unlike sibling modules.
More broadly, this pipeline class reimplements functionality already provided by π€ Prompt for AI Agents |
||||||||||||||||||||||||||||||
| import numpy as np | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| class EmbeddingPipeline: | ||||||||||||||||||||||||||||||
| def __init__(self, model_name="all-MiniLM-L6-v2"): | ||||||||||||||||||||||||||||||
| self.model = SentenceTransformer(model_name, cache_folder="D:/models_cache") | ||||||||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hardcoded Windows-specific cache path β breaks portability.
Proposed fix- self.model = SentenceTransformer(model_name, cache_folder="D:/models_cache")
+ self.model = SentenceTransformer(model_name)π Committable suggestion
Suggested change
π€ Prompt for AI Agents |
||||||||||||||||||||||||||||||
| self.index = None | ||||||||||||||||||||||||||||||
| self.chunks = [] | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| def chunk_text(self, text, max_length=300, overlap=50): | ||||||||||||||||||||||||||||||
| chunks = [] | ||||||||||||||||||||||||||||||
| start = 0 | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| while start < len(text): | ||||||||||||||||||||||||||||||
| end = start + max_length | ||||||||||||||||||||||||||||||
| chunk = text[start:end] | ||||||||||||||||||||||||||||||
| chunks.append(chunk) | ||||||||||||||||||||||||||||||
| start = end - overlap | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| return chunks | ||||||||||||||||||||||||||||||
|
Comment on lines
+14
to
+24
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. π οΈ Refactor suggestion | π Major Duplicate This is a copy of the logic in Reuse π€ Prompt for AI Agents |
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| def build_index(self, chunks): | ||||||||||||||||||||||||||||||
| embeddings = self.model.encode(chunks) | ||||||||||||||||||||||||||||||
| embeddings = np.array(embeddings).astype("float32") | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| dim = embeddings.shape[1] | ||||||||||||||||||||||||||||||
| self.index = faiss.IndexFlatL2(dim) | ||||||||||||||||||||||||||||||
| self.index.add(embeddings) | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| return embeddings | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| def process_notes(self, text): | ||||||||||||||||||||||||||||||
| self.chunks = self.chunk_text(text) | ||||||||||||||||||||||||||||||
| embeddings = self.build_index(self.chunks) | ||||||||||||||||||||||||||||||
| return self.chunks, embeddings | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| def semantic_search(self, query, top_k=3): | ||||||||||||||||||||||||||||||
| query_vec = self.model.encode([query]) | ||||||||||||||||||||||||||||||
| query_vec = np.array(query_vec).astype("float32") | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| distances, indices = self.index.search(query_vec, top_k) | ||||||||||||||||||||||||||||||
| results = [self.chunks[i] for i in indices[0]] | ||||||||||||||||||||||||||||||
| return results | ||||||||||||||||||||||||||||||
|
Comment on lines
+41
to
+47
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Proposed fix def semantic_search(self, query, top_k=3):
query_vec = self.model.encode([query])
query_vec = np.array(query_vec).astype("float32")
- distances, indices = self.index.search(query_vec, top_k)
- results = [self.chunks[i] for i in indices[0]]
+ _distances, indices = self.index.search(query_vec, top_k)
+ results = [self.chunks[i] for i in indices[0] if 0 <= i < len(self.chunks)]
return resultsπ Committable suggestion
Suggested change
π§° Toolsπͺ Ruff (0.15.0)[warning] 45-45: Unpacked variable Prefix it with an underscore or any other dummy variable pattern (RUF059) π€ Prompt for AI Agents |
||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,109 @@ | ||||||
| import os | ||||||
| import re | ||||||
|
|
||||||
| #-------------------emedding-pipeline-chunking concept | ||||||
| from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. π§© Analysis chainπ Script executed: #!/bin/bash
# Check package configuration and structure
fd -t f "__init__.py" smart-notes/
fd -t f "setup.py|pyproject.toml|setup.cfg" smart-notes/Repository: AOSSIE-Org/Info Length of output: 143 π Script executed: #!/bin/bash
# Check for __init__.py at rag_mvp level
ls -la smart-notes/rag_mvp/__init__.py 2>&1
# Find README
fd -t f "README|readme" . --max-depth 3
# Check for config files at repo root
fd -t f "setup.py|pyproject.toml|setup.cfg|poetry.lock" . --max-depth 1Repository: AOSSIE-Org/Info Length of output: 181 π Script executed: #!/bin/bash
# Check the rag_mvp README which is most relevant
cat smart-notes/rag_mvp/README.md
# Also check root README for context
echo "=== ROOT README ==="
head -100 ./README.mdRepository: AOSSIE-Org/Info Length of output: 5181 Absolute import will fail when script is run as documented.
Use relative imports ( π€ Prompt for AI Agents |
||||||
|
|
||||||
| def demo_embeddings_pipeline(): | ||||||
| pipeline = EmbeddingPipeline() | ||||||
|
|
||||||
| note_text = """ | ||||||
| Python is a programming language. | ||||||
| It is widely used in AI and machine learning projects. | ||||||
| Smart Notes helps users organize knowledge using embeddings. | ||||||
| """ | ||||||
|
|
||||||
| chunks, embeddings = pipeline.process_notes(note_text) | ||||||
|
|
||||||
| print("\n--- Chunks Created ---") | ||||||
| for i, c in enumerate(chunks): | ||||||
| print(f"[{i}] {c}") | ||||||
|
|
||||||
| query = "What is Python used for?" | ||||||
| results = pipeline.semantic_search(query) | ||||||
|
|
||||||
| print("\n--- Search Results ---") | ||||||
| for r in results: | ||||||
| print("-", r) | ||||||
| #------------------------------------------------- | ||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
| QUESTION_WORDS = { | ||||||
| "what", "where", "who", "when", "which", | ||||||
| "is", "are", "was", "were", "the", "a", "an", | ||||||
| "of", "to", "in", "on", "for" | ||||||
| } | ||||||
|
|
||||||
| NOTES_DIR = "notes" | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Proposed fix β resolve relative to the script location-NOTES_DIR = "notes"
+NOTES_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..", "notes")π Committable suggestion
Suggested change
π€ Prompt for AI Agents |
||||||
|
|
||||||
|
|
||||||
| def load_notes(): | ||||||
| notes = [] | ||||||
| if not os.path.exists(NOTES_DIR): | ||||||
| print(f"Notes directory '{NOTES_DIR}' not found.") | ||||||
| return notes | ||||||
|
|
||||||
| for file in os.listdir(NOTES_DIR): | ||||||
| if file.endswith(".md"): | ||||||
| path = os.path.join(NOTES_DIR, file) | ||||||
| with open(path, "r", encoding="utf-8") as f: | ||||||
| notes.append({ | ||||||
| "filename": file, | ||||||
| "content": f.read() | ||||||
| }) | ||||||
| return notes | ||||||
|
|
||||||
|
|
||||||
| def split_sentences(text): | ||||||
| return re.split(r'(?<=[.!?])\s+', text) | ||||||
|
|
||||||
|
|
||||||
| def search_notes(query, notes): | ||||||
| results = [] | ||||||
|
|
||||||
| query_words = [ | ||||||
| word.lower() | ||||||
| for word in query.split() | ||||||
| if word.lower() not in QUESTION_WORDS | ||||||
| ] | ||||||
|
|
||||||
| for note in notes: | ||||||
| sentences = split_sentences(note["content"]) | ||||||
| for sentence in sentences: | ||||||
| sentence_lower = sentence.lower() | ||||||
| if any(word in sentence_lower for word in query_words): | ||||||
| results.append({ | ||||||
| "filename": note["filename"], | ||||||
| "sentence": sentence.strip() | ||||||
| }) | ||||||
|
|
||||||
| return results | ||||||
|
|
||||||
|
|
||||||
| if __name__ == "__main__": | ||||||
|
|
||||||
| demo_embeddings_pipeline() # Temporary demo for embeddings pipeline | ||||||
|
Comment on lines
+85
to
+87
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This forces the SentenceTransformer model to load (and potentially download) every time a user launches the CLI, even if they only want the keyword-based Q&A. This adds significant startup latency. Consider making the embedding demo opt-in (e.g., via a CLI flag) or removing it from the default flow. π€ Prompt for AI Agents |
||||||
|
|
||||||
| notes = load_notes() | ||||||
|
|
||||||
| print("Ask questions about your notes (type 'exit' to quit)\n") | ||||||
|
|
||||||
| while True: | ||||||
| query = input(">> ").strip() | ||||||
|
|
||||||
| if query.lower() == "exit": | ||||||
| print("Goodbye π") | ||||||
| break | ||||||
|
|
||||||
| matches = search_notes(query, notes) | ||||||
|
|
||||||
| if not matches: | ||||||
| print("No relevant notes found.\n") | ||||||
| else: | ||||||
| print("\n--- Answers ---\n") | ||||||
| for i, m in enumerate(matches, 1): | ||||||
| print(f"[{i}] From {m['filename']}:") | ||||||
| print(m["sentence"]) | ||||||
| print() | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Malformed Markdown β unclosed code block bleeds into the rest of the document.
The code block opened at line 28 is never properly closed. The example CLI output (lines 33β43) and everything after it gets swallowed into the code fence, making the second half of the README render as a raw code block rather than formatted documentation.
Close the code block after the CLI example and before the second section heading.
π€ Prompt for AI Agents