Local Markdown storage Architecture #44
Conversation
📝 WalkthroughWalkthroughThis PR introduces a local markdown storage system with retrieval-augmented generation (RAG) capabilities. It includes architecture design documentation, UI design specification, Python implementation of embedding-based semantic search using sentence-transformers and FAISS vector indexing, and an interactive CLI for Q&A functionality. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/CLI
participant Pipeline as EmbeddingPipeline
participant Embedder as Embedder
participant Chunker as Chunker
participant Indexer as VectorIndexer
participant FAISS as FAISS Index
participant Results as Retrieved Chunks
User->>Pipeline: process_notes(text)
Pipeline->>Chunker: chunk_text(text)
Chunker-->>Pipeline: chunks[]
Pipeline->>Embedder: embed(chunks)
Embedder-->>Pipeline: embeddings[]
Pipeline->>Indexer: build_index(embeddings, chunks)
Indexer->>FAISS: add(embeddings)
FAISS-->>Indexer: indexed
User->>Pipeline: semantic_search(query)
Pipeline->>Embedder: embed(query)
Embedder-->>Pipeline: query_embedding
Pipeline->>Indexer: search(query_embedding, k=3)
Indexer->>FAISS: search_neighbors()
FAISS-->>Indexer: nearest indices
Indexer-->>Pipeline: matched_chunks[]
Pipeline-->>Results: return chunks
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🤖 Fix all issues with AI agents
In `@smart-notes/rag_mvp/embeddings/chunker.py`:
- Around line 9-31: Remove the duplicated logic by replacing
EmbeddingPipeline.chunk_text with a call to the single utility chunk_text in
chunker.py: import chunk_text and use it instead of reimplementing the loop;
ensure you pass the desired max_length (align defaults or explicitly pass 300
where pipeline relied on that) and keep the strip/empty-check behavior by
relying on the utility's implementation, then delete the
EmbeddingPipeline.chunk_text method to avoid divergence and update any callers
to use the imported chunk_text or the pipeline wrapper that delegates to it.
- Around line 9-29: The chunk_text function can hang when overlap >= max_length;
add input validation at the top of chunk_text to ensure 0 <= overlap <
max_length (raise ValueError or clamp overlap to max_length-1), and/or ensure
the loop always makes forward progress by computing next_start = max(end -
overlap, start + 1) and assigning start = next_start (or breaking if next_start
<= start). Update the chunk_text function to enforce the overlap constraint and
guarantee start increases each iteration to prevent infinite loops.
In `@smart-notes/rag_mvp/embeddings/indexer.py`:
- Around line 23-28: In the add method of the indexer (function add), validate
that embeddings and chunks have the same length before calling self.index.add
and self.texts.extend; if len(embeddings) != len(chunks) raise a clear
ValueError (or handle by truncating with an explicit log) so the FAISS index
(self.index.add) and the parallel list (self.texts.extend) cannot get out of
sync; locate the checks around the calls to self.index.add and self.texts.extend
and enforce this validation early in the method.
- Around line 37-39: FAISS may return -1 for missing neighbors so the current
loop "for idx in indices[0]" with check "if idx < len(self.texts]" will allow
negative indices and return the last item; change the guard to only accept
non-negative, in-range indices (e.g., ensure 0 <= idx < len(self.texts)) and
coerce numpy types to Python ints if needed before indexing; update the loop
that builds results (which iterates over indices[0] and accesses self.texts) to
skip any idx < 0.
In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py`:
- Around line 3-4: The module currently hard-imports sentence_transformers and
faiss which will crash at import time; update EmbeddingPipeline in
embedding_pipeline.py to delegate to the existing embeddings package utilities
(Embedder, VectorIndexer, chunk_text) instead of directly importing
sentence_transformers or faiss. Replace direct use of SentenceTransformer and
faiss with calls to Embedder and VectorIndexer (and use chunk_text for text
splitting), so optional-dependency handling in those classes is reused and
importing embedding_pipeline no longer fails when optional libs are absent.
- Line 10: Replace the hardcoded Windows cache path in the SentenceTransformer
instantiation (self.model = SentenceTransformer(model_name,
cache_folder="D:/models_cache")) with a platform-agnostic approach: import os,
read an environment variable like SMARTNOTES_MODEL_CACHE (or fallback to
None/expanduser-based path) and pass that value as cache_folder, or omit the
cache_folder argument entirely to use the library default; update the
SentenceTransformer call and ensure the new os import is added where
SentenceTransformer is constructed.
- Around line 14-24: Remove the duplicate chunk_text method from class in
embedding_pipeline.py and instead call/import the canonical function from
embeddings.chunker (e.g., from embeddings.chunker import chunk_text) to ensure
consistent behavior and defaults; update any usage inside the class (references
to self.chunk_text) to call chunk_text(text, max_length=500, overlap=50) or omit
overrides so the canonical default is used, and rely on the chunker
implementation's strip/empty-check guards rather than reimplementing them here.
- Around line 41-46: semantic_search currently assumes self.index is initialized
and that FAISS returns valid indices; guard against calling semantic_search
before process_notes by checking self.index is not None (and raise a clear
exception or return an empty list), and sanitize FAISS results by filtering out
negative indices and any indices out of range before using them to index
self.chunks (e.g., drop -1 entries and limit results to available chunks when
top_k > number of vectors). Update the semantic_search method to validate
self.index, validate/clean indices from self.index.search, and return only safe,
existing chunk entries instead of indexing with -1 or None.
In `@smart-notes/rag_mvp/qa_cli.py`:
- Around line 63-80: The current search_notes implementation uses substring
matching ("word in sentence_lower") which yields false positives (e.g., "AI"
matching "wait"); update search logic in search_notes to perform word-boundary,
case-insensitive matches by compiling a regex for each query word using word
boundaries (and escaping the query word) and testing that regex against each
sentence (e.g., build re.compile(r'\b' + re.escape(q) + r'\b', re.IGNORECASE)
for each q in query_words) before appending to results; ensure you reference
query_words, split_sentences, and the results append block when making this
change.
- Around line 4-5: Top-level import of EmbeddingPipeline causes the CLI to fail
when ML deps (sentence_transformers/faiss) are missing; move the import line
"from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline" into the
demo_embeddings_pipeline() function so the heavy modules are only loaded when
that demo runs, and correct the comment typo "emedding" → "embedding"; update
any references inside demo_embeddings_pipeline() to use the locally imported
EmbeddingPipeline.
In `@smart-notes/rag_mvp/README.md`:
- Around line 46-84: The README contains two concatenated README documents
starting at the duplicate header "Smart Notes – RAG MVP (Embeddings & FAISS)";
fix by either merging the duplicate sections into a single cohesive README
(combine Features, Tech Stack, and Project Structure under one header and remove
repeated text) or split the second document into a separate file (e.g.,
README_RAG_MVP.md) and update any top-level README links/TOC accordingly; ensure
the remaining README references the correct filenames and that the unique header
"Smart Notes – RAG MVP (Embeddings & FAISS)" appears only once.
- Around line 75-84: Update the README project structure to match the actual
repo layout: replace references to embed.py, index.py, utils.py, and notes/*.txt
with the real modules and paths such as embeddings/chunker.py,
embeddings/embedder.py, embeddings/indexer.py, pipelines/embedding_pipeline.py,
and notes/*.md (or explicitly map the old names to these new files if you want
to preserve conceptual equivalence); ensure the tree snippet and any text
describing where to put notes reflect notes/*.md and the embeddings/ and
pipelines/ directories so contributors are not confused.
- Around line 28-44: The README has an unclosed triple-backtick fence started
with "```bash" (around the python run example) causing the rest of the file to
be rendered as code; close that fence immediately after the command invocation
(add a terminating "```" right after the `python smart-notes/rag_mvp/qa_cli.py`
line) and move the example CLI interaction (the prompt/response lines like ">>
what is AI" and the output "[1] From test.md: ...") outside the fenced code
block as a separate fenced block or plain text so the rest of the document
renders normally.
🟡 Minor comments (9)
smart-notes/rag_mvp/embeddings/embedder.py-25-27 (1)
25-27:⚠️ Potential issue | 🟡 MinorEmpty return shape is inconsistent with normal return.
model.encode(...)returns a 2D array(n, dim), but the empty fallbacknp.array([])returns a 1D array with shape(0,). Callers that inspect.shape[1]or pass this to FAISS will get anIndexError. Return a properly shaped 2D array instead.Proposed fix
def embed(self, texts: List[str]) -> np.ndarray: if not texts: - return np.array([]) + return np.empty((0, self.model.get_sentence_embedding_dimension())) embeddings = self.model.encode(texts, convert_to_numpy=True) return embeddingssmart-notes/rag_mvp/qa_cli.py-39-39 (1)
39-39:⚠️ Potential issue | 🟡 MinorRelative
NOTES_DIRdepends on the working directory.
NOTES_DIR = "notes"resolves relative to CWD, not to the script's location. Running the CLI from a different directory will fail to find notes. Consider resolving relative to the script.Proposed fix
-NOTES_DIR = "notes" +NOTES_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..", "notes")Or use a configurable path via an environment variable as a fallback.
Design/local markdown storage.md-24-28 (1)
24-28:⚠️ Potential issue | 🟡 MinorFix heading spacing and list formatting.
The heading has extra spaces after the hash marks, and the list items should use proper markdown bullet syntax for consistency with standard markdown formatting.
📝 Proposed fix
-### 3. Non-Goals +### 3. Non-Goals -Cloud sync -Real-time collaboration -AI processing (handled in separate layers later) +- Cloud sync +- Real-time collaboration +- AI processing (handled in separate layers later)Design/local markdown storage.md-5-5 (1)
5-5:⚠️ Potential issue | 🟡 MinorFix excessive whitespace.
The phrase "offline reliability" contains multiple consecutive spaces that should be reduced to a single space for proper formatting.
📝 Proposed fix
-Most note-taking apps store user data in cloud services or proprietary formats, which reduces user control and offline reliability. +Most note-taking apps store user data in cloud services or proprietary formats, which reduces user control and offline reliability.Design/local markdown storage.md-13-16 (1)
13-16:⚠️ Potential issue | 🟡 MinorImprove list formatting and fix typo.
This section has inconsistent formatting and contains a typo:
- The list items lack proper markdown bullet syntax
- "Serives" should be "Services" on line 15
- The colon placement is unconventional
The content would be clearer with standard markdown list formatting.
📝 Proposed fix
-Offline-first experience : - NO Internet, - NO Cloud Serives, - NO Server +Offline-first experience: +- NO Internet +- NO Cloud Services +- NO ServerDesign/local markdown storage.md-1-3 (1)
1-3:⚠️ Potential issue | 🟡 MinorFix heading hierarchy.
The document starts with an h1 heading (line 1) but then jumps directly to h3 headings (line 3). Markdown heading levels should increment by one level at a time. All section headings should be h2 (
##) to maintain proper hierarchy.📝 Proposed fix
Change all section headings from
###to##:# Local Markdown Storage Architecture -### 1. Problem Statement +## 1. Problem StatementApply this change to all numbered sections (lines 3, 11, 24, 30, 40, 55, 71, 91, 109, 128, 165, 171).
Design/local markdown storage.md-101-107 (1)
101-107:⚠️ Potential issue | 🟡 MinorFormat example as a proper code block.
The frontmatter example should be enclosed in a markdown code block with proper syntax highlighting for better readability and to prevent it from being interpreted as actual frontmatter.
📝 Proposed fix
-* Example : ---- -title: My First Note -tags: [gsoc, ideas] -created: 2026-02-12 ---- -This is my note content. +**Example:** + +```markdown +--- +title: My First Note +tags: [gsoc, ideas] +created: 2026-02-12 +--- +This is my note content. +```Design/local markdown storage.md-120-126 (1)
120-126:⚠️ Potential issue | 🟡 MinorUse consistent heading style.
The document mixes atx-style headings (
###) with setext-style headings (underlined with---). Stick to atx-style headings throughout for consistency, as indicated by the markdown linter.📝 Proposed fix
-Avoid reloading everything on every file change ---- -Why this helps: + +**Why this helps:** Faster app startupDesign/local markdown storage.md-128-163 (1)
128-163:⚠️ Potential issue | 🟡 MinorStandardize heading styles and improve section markers.
This section uses setext-style heading underlines (dashes) instead of atx-style headings. Use consistent
###or####heading markers throughout. Also, the horizontal rule separators (-------) are excessive and inconsistent.📝 Proposed fix for heading styles
Replace setext-style headings with atx-style:
-* TRADE-OFFS +**Trade-offs:** Using plain markdown files keeps things simple and transparent... -Parsing frontmatter on every file adds a small performance cost. Very large folders can make the first scan slower when the app starts. ---------------------------------------- -* RISKS + +**Risks:** Files may get out of sync if the user edits... -Performance may degrade for extremely large note collections (10k+ files). ----------------------------------------- -* MITIGATIONS + +**Mitigations:** Handle file and parse errors gracefully... -Fall back to a full re-scan if the file watcher misses events. ---------------------------------------------
🧹 Nitpick comments (9)
smart-notes/rag_mvp/embeddings/indexer.py (1)
34-34: Prefix unuseddistanceswith underscore.Per static analysis (Ruff RUF059),
distancesis unpacked but never used.Proposed fix
- distances, indices = self.index.search(query_embedding.reshape(1, -1), k) + _distances, indices = self.index.search(query_embedding.reshape(1, -1), k)smart-notes/rag_mvp/qa_cli.py (1)
85-109: Unguardeddemo_embeddings_pipeline()call and missingKeyboardInterrupthandling.The demo is called unconditionally on every CLI start (Line 87), adding startup latency from model loading even when the user only wants keyword search. Additionally,
Ctrl+Cduring the input loop will print a raw traceback.♻️ Proposed fix
if __name__ == "__main__": - - demo_embeddings_pipeline() # Temporary demo for embeddings pipeline - notes = load_notes() print("Ask questions about your notes (type 'exit' to quit)\n") - while True: - query = input(">> ").strip() - - if query.lower() == "exit": - print("Goodbye 👋") - break - - matches = search_notes(query, notes) - - if not matches: - print("No relevant notes found.\n") - else: - print("\n--- Answers ---\n") - for i, m in enumerate(matches, 1): - print(f"[{i}] From {m['filename']}:") - print(m["sentence"]) - print() + try: + while True: + query = input(">> ").strip() + + if query.lower() == "exit": + break + + matches = search_notes(query, notes) + + if not matches: + print("No relevant notes found.\n") + else: + print("\n--- Answers ---\n") + for i, m in enumerate(matches, 1): + print(f"[{i}] From {m['filename']}:") + print(m["sentence"]) + print() + except (KeyboardInterrupt, EOFError): + pass + + print("Goodbye 👋")Design/local markdown storage.md (7)
42-42: Clarify "Editor folder" terminology.The term "Editor folder" is unclear and doesn't appear elsewhere in the document. Consider using a more descriptive term like "workspace folder" or "notes directory" for consistency with the rest of the architecture description.
📝 Suggested clarification
-The app treats a Editor folder as a "notes workspace." +The app treats a designated folder as a "notes workspace."Or be more specific:
-The app treats a Editor folder as a "notes workspace." +The app designates a folder (e.g., Notes/) as the "notes workspace."
47-47: Fix inconsistent quotation marks and formatting.The quotation marks around technical terms are inconsistently applied and the formatting needs improvement. Use consistent formatting for technical terms throughout the document.
📝 Proposed fix
- "in-memory index " and "local db(SQLite) +- An in-memory index and local database (SQLite)
65-69: Consider using code block for directory structure.The directory structure would be more readable and properly formatted using a markdown code block instead of custom arrow symbols.
📝 Suggested improvement
-* STRUCTURE : - - Notes/ - |--->Note1.md - |--->Note2.md +**Structure:** + +``` +Notes/ +├── Note1.md +└── Note2.md +```
73-74: Improve grammar and clarity.The explanation has grammar issues and could be more concise. Also, tool names should be properly capitalized.
📝 Proposed fix
-WHY needed? - User may write notes outside the app like (vs code, notepad) then app automatically detect changes and update by using FOLDER WATCHER +**Why needed?** + +Users may edit notes outside the app (e.g., VS Code, Notepad). The folder watcher automatically detects these external changes and updates the app accordingly.
40-53: Consider documenting the SQLite database schema and purpose.The architecture mentions both an "in-memory index" and a "local db (SQLite)" but doesn't explain what each stores or why both are needed. Future implementers would benefit from understanding:
- What data goes in SQLite vs. in-memory?
- Is SQLite used for persistence of the index?
- What happens if SQLite DB is corrupted or deleted?
This clarification would strengthen the architecture document.
55-63: Consider documenting file encoding and character set assumptions.The storage model section doesn't specify file encoding expectations. Consider explicitly stating that:
- All markdown files should be UTF-8 encoded
- How the app handles files with different encodings or BOM markers
- How line ending differences across platforms (CRLF vs LF) are handled
This would prevent implementation issues and edge cases.
128-163: Missing discussion: handling binary attachments and images.The document focuses on markdown text files but doesn't address how images, PDFs, or other binary attachments referenced in markdown notes should be stored and managed. Consider adding:
- Where attachments are stored relative to markdown files
- How the app handles broken image links
- Whether attachments are watched for changes
- Size limits or recommendations for attachments
This is relevant given the "thousands of files" scalability goal.
| def chunk_text(text: str, max_length: int = 500, overlap: int = 50) -> List[str]: | ||
| if not text: | ||
| return [] | ||
|
|
||
| chunks = [] | ||
| start = 0 | ||
| text = text.strip() | ||
|
|
||
| while start < len(text): | ||
| end = start + max_length | ||
| chunk = text[start:end].strip() | ||
|
|
||
| if chunk: | ||
| chunks.append(chunk) | ||
|
|
||
| if end >= len(text): | ||
| break | ||
|
|
||
| start = end - overlap | ||
| if start < 0: | ||
| start = 0 | ||
|
|
||
| return chunks |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Duplicate chunk_text logic exists in embedding_pipeline.py.
EmbeddingPipeline.chunk_text (in smart-notes/rag_mvp/pipelines/embedding_pipeline.py, Lines 14-24) reimplements the same chunking logic with different defaults (max_length=300) and without the strip/empty-check guards. The pipeline should import and reuse this utility instead of maintaining a divergent copy.
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/embeddings/chunker.py` around lines 9 - 31, Remove the
duplicated logic by replacing EmbeddingPipeline.chunk_text with a call to the
single utility chunk_text in chunker.py: import chunk_text and use it instead of
reimplementing the loop; ensure you pass the desired max_length (align defaults
or explicitly pass 300 where pipeline relied on that) and keep the
strip/empty-check behavior by relying on the utility's implementation, then
delete the EmbeddingPipeline.chunk_text method to avoid divergence and update
any callers to use the imported chunk_text or the pipeline wrapper that
delegates to it.
| def chunk_text(text: str, max_length: int = 500, overlap: int = 50) -> List[str]: | ||
| if not text: | ||
| return [] | ||
|
|
||
| chunks = [] | ||
| start = 0 | ||
| text = text.strip() | ||
|
|
||
| while start < len(text): | ||
| end = start + max_length | ||
| chunk = text[start:end].strip() | ||
|
|
||
| if chunk: | ||
| chunks.append(chunk) | ||
|
|
||
| if end >= len(text): | ||
| break | ||
|
|
||
| start = end - overlap | ||
| if start < 0: | ||
| start = 0 |
There was a problem hiding this comment.
Infinite loop when overlap >= max_length.
If overlap >= max_length, start never advances (or regresses), causing an infinite loop. The guard on Line 28 only clamps to 0 but doesn't break the cycle. Add input validation or ensure forward progress.
🐛 Proposed fix
def chunk_text(text: str, max_length: int = 500, overlap: int = 50) -> List[str]:
- if not text:
+ if not text or max_length <= 0:
+ return []
+
+ if overlap >= max_length:
+ raise ValueError("overlap must be less than max_length")
- return []
chunks = []
start = 0🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/embeddings/chunker.py` around lines 9 - 29, The
chunk_text function can hang when overlap >= max_length; add input validation at
the top of chunk_text to ensure 0 <= overlap < max_length (raise ValueError or
clamp overlap to max_length-1), and/or ensure the loop always makes forward
progress by computing next_start = max(end - overlap, start + 1) and assigning
start = next_start (or breaking if next_start <= start). Update the chunk_text
function to enforce the overlap constraint and guarantee start increases each
iteration to prevent infinite loops.
| def add(self, embeddings: np.ndarray, chunks: List[str]): | ||
| if len(embeddings) == 0: | ||
| return | ||
|
|
||
| self.index.add(embeddings) | ||
| self.texts.extend(chunks) |
There was a problem hiding this comment.
Missing validation: embeddings and chunks length mismatch.
If len(embeddings) != len(chunks), the parallel self.texts list silently drifts out of sync with the FAISS index, causing search to return wrong text for a given vector.
🛡️ Proposed fix
def add(self, embeddings: np.ndarray, chunks: List[str]):
if len(embeddings) == 0:
return
+ if len(embeddings) != len(chunks):
+ raise ValueError(
+ f"embeddings length ({len(embeddings)}) != chunks length ({len(chunks)})"
+ )
+
self.index.add(embeddings)
self.texts.extend(chunks)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def add(self, embeddings: np.ndarray, chunks: List[str]): | |
| if len(embeddings) == 0: | |
| return | |
| self.index.add(embeddings) | |
| self.texts.extend(chunks) | |
| def add(self, embeddings: np.ndarray, chunks: List[str]): | |
| if len(embeddings) == 0: | |
| return | |
| if len(embeddings) != len(chunks): | |
| raise ValueError( | |
| f"embeddings length ({len(embeddings)}) != chunks length ({len(chunks)})" | |
| ) | |
| self.index.add(embeddings) | |
| self.texts.extend(chunks) |
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/embeddings/indexer.py` around lines 23 - 28, In the add
method of the indexer (function add), validate that embeddings and chunks have
the same length before calling self.index.add and self.texts.extend; if
len(embeddings) != len(chunks) raise a clear ValueError (or handle by truncating
with an explicit log) so the FAISS index (self.index.add) and the parallel list
(self.texts.extend) cannot get out of sync; locate the checks around the calls
to self.index.add and self.texts.extend and enforce this validation early in the
method.
| for idx in indices[0]: | ||
| if idx < len(self.texts): | ||
| results.append(self.texts[idx]) |
There was a problem hiding this comment.
Bug: FAISS returns -1 for missing neighbors, which passes the bounds check.
When the index has fewer than k entries, FAISS pads results with index -1. In Python, -1 < len(self.texts) is always True, so self.texts[-1] silently returns the last chunk — a wrong result.
🐛 Proposed fix
for idx in indices[0]:
- if idx < len(self.texts):
+ if 0 <= idx < len(self.texts):
results.append(self.texts[idx])📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for idx in indices[0]: | |
| if idx < len(self.texts): | |
| results.append(self.texts[idx]) | |
| for idx in indices[0]: | |
| if 0 <= idx < len(self.texts): | |
| results.append(self.texts[idx]) |
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/embeddings/indexer.py` around lines 37 - 39, FAISS may
return -1 for missing neighbors so the current loop "for idx in indices[0]" with
check "if idx < len(self.texts]" will allow negative indices and return the last
item; change the guard to only accept non-negative, in-range indices (e.g.,
ensure 0 <= idx < len(self.texts)) and coerce numpy types to Python ints if
needed before indexing; update the loop that builds results (which iterates over
indices[0] and accesses self.texts) to skip any idx < 0.
| from sentence_transformers import SentenceTransformer | ||
| import faiss |
There was a problem hiding this comment.
Hard imports of optional dependencies will crash at import time.
Unlike embedder.py and indexer.py which use try/except to give a friendly error at instantiation, this module fails entirely if sentence_transformers or faiss are missing. This also prevents importing the module for testing or documentation purposes.
Proposed fix — use the existing utility modules
The embeddings/ package already provides Embedder, VectorIndexer, and chunk_text with proper optional-dependency handling. Consider refactoring EmbeddingPipeline to delegate to those modules instead of duplicating their logic and imports.
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py` around lines 3 - 4, The
module currently hard-imports sentence_transformers and faiss which will crash
at import time; update EmbeddingPipeline in embedding_pipeline.py to delegate to
the existing embeddings package utilities (Embedder, VectorIndexer, chunk_text)
instead of directly importing sentence_transformers or faiss. Replace direct use
of SentenceTransformer and faiss with calls to Embedder and VectorIndexer (and
use chunk_text for text splitting), so optional-dependency handling in those
classes is reused and importing embedding_pipeline no longer fails when optional
libs are absent.
| #-------------------emedding-pipeline-chunking concept | ||
| from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline |
There was a problem hiding this comment.
Top-level import of EmbeddingPipeline breaks the CLI when ML dependencies are absent.
This import eagerly loads sentence_transformers and faiss (which are hard-imported in embedding_pipeline.py). If those heavy dependencies aren't installed, the entire CLI — including the dependency-free keyword search — fails to start. Move this import inside demo_embeddings_pipeline().
Also, minor typo: "emedding" → "embedding".
🐛 Proposed fix
-#-------------------emedding-pipeline-chunking concept
-from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline
-
def demo_embeddings_pipeline():
+ # embedding-pipeline-chunking concept
+ from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline
+
pipeline = EmbeddingPipeline()🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/qa_cli.py` around lines 4 - 5, Top-level import of
EmbeddingPipeline causes the CLI to fail when ML deps
(sentence_transformers/faiss) are missing; move the import line "from
rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline" into the
demo_embeddings_pipeline() function so the heavy modules are only loaded when
that demo runs, and correct the comment typo "emedding" → "embedding"; update
any references inside demo_embeddings_pipeline() to use the locally imported
EmbeddingPipeline.
| def search_notes(query, notes): | ||
| results = [] | ||
|
|
||
| query_words = [ | ||
| word.lower() | ||
| for word in query.split() | ||
| if word.lower() not in QUESTION_WORDS | ||
| ] | ||
|
|
||
| for note in notes: | ||
| sentences = split_sentences(note["content"]) | ||
| for sentence in sentences: | ||
| sentence_lower = sentence.lower() | ||
| if any(word in sentence_lower for word in query_words): | ||
| results.append({ | ||
| "filename": note["filename"], | ||
| "sentence": sentence.strip() | ||
| }) |
There was a problem hiding this comment.
Substring matching produces false positives.
word in sentence_lower is a substring check, not a word-boundary check. For example, querying "AI" will match sentences containing "wait", "train", "brain", etc.
🐛 Proposed fix — use word-boundary regex
+import re
+
def search_notes(query, notes):
results = []
query_words = [
word.lower()
for word in query.split()
if word.lower() not in QUESTION_WORDS
]
for note in notes:
sentences = split_sentences(note["content"])
for sentence in sentences:
sentence_lower = sentence.lower()
- if any(word in sentence_lower for word in query_words):
+ if any(re.search(r'\b' + re.escape(word) + r'\b', sentence_lower) for word in query_words):
results.append({
"filename": note["filename"],
"sentence": sentence.strip()
})🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/qa_cli.py` around lines 63 - 80, The current search_notes
implementation uses substring matching ("word in sentence_lower") which yields
false positives (e.g., "AI" matching "wait"); update search logic in
search_notes to perform word-boundary, case-insensitive matches by compiling a
regex for each query word using word boundaries (and escaping the query word)
and testing that regex against each sentence (e.g., build re.compile(r'\b' +
re.escape(q) + r'\b', re.IGNORECASE) for each q in query_words) before appending
to results; ensure you reference query_words, split_sentences, and the results
append block when making this change.
| ```bash | ||
| python smart-notes/rag_mvp/qa_cli.py | ||
|
|
||
|
|
||
|
|
||
| >> what is AI | ||
|
|
||
| [1] From test.md: | ||
| Artificial Intelligence (AI) is the simulation of human intelligence in machines. | ||
|
|
||
|
|
||
| >> what is machine learning | ||
| how is machine learning used | ||
| difference between AI and ML | ||
|
|
||
|
|
||
|
|
There was a problem hiding this comment.
Broken markdown: unclosed code block swallows the rest of the document.
The ```bash block opened at Line 28 is never closed. The example CLI interaction (Lines 33-42) and all subsequent content renders as part of the code block. Add a closing ``` after the run command and format the example interaction separately.
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/README.md` around lines 28 - 44, The README has an
unclosed triple-backtick fence started with "```bash" (around the python run
example) causing the rest of the file to be rendered as code; close that fence
immediately after the command invocation (add a terminating "```" right after
the `python smart-notes/rag_mvp/qa_cli.py` line) and move the example CLI
interaction (the prompt/response lines like ">> what is AI" and the output "[1]
From test.md: ...") outside the fenced code block as a separate fenced block or
plain text so the rest of the document renders normally.
|
|
||
| # Smart Notes – RAG MVP (Embeddings & FAISS) | ||
|
|
||
| This project is a simple **Retrieval-Augmented Generation (RAG)** pipeline for Smart Notes. | ||
| It allows users to store notes, convert them into embeddings, and search relevant notes using vector similarity. | ||
|
|
||
| --- | ||
|
|
||
| ## 🚀 Features | ||
|
|
||
| - Convert notes into embeddings using Sentence Transformers | ||
| - Store and search embeddings using FAISS (CPU) | ||
| - CLI tool to ask questions about your notes | ||
| - Simple chunking for text files | ||
| - Works fully offline after model download | ||
|
|
||
| --- | ||
|
|
||
| ## 🧠 Tech Stack | ||
|
|
||
| - Python 3.10+ | ||
| - sentence-transformers | ||
| - FAISS (faiss-cpu) | ||
| - HuggingFace Transformers | ||
|
|
||
| --- | ||
|
|
||
| ## 📁 Project Structure | ||
|
|
||
| ```bash | ||
| smart-notes/ | ||
| ├── rag_mvp/ | ||
| │ ├── embed.py # Embedding logic | ||
| │ ├── index.py # FAISS index creation | ||
| │ ├── qa_cli.py # CLI for asking questions | ||
| │ └── utils.py # Helper functions | ||
| ├── notes/ # Put your .txt notes here | ||
| ├── requirements.txt | ||
| └── README.md |
There was a problem hiding this comment.
Two separate README documents appear concatenated into one file.
Lines 46-84 introduce a second "Smart Notes – RAG MVP (Embeddings & FAISS)" header with its own features, tech stack, and project structure sections. This should either be merged cohesively into a single README or split into separate files for each MVP track.
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/README.md` around lines 46 - 84, The README contains two
concatenated README documents starting at the duplicate header "Smart Notes –
RAG MVP (Embeddings & FAISS)"; fix by either merging the duplicate sections into
a single cohesive README (combine Features, Tech Stack, and Project Structure
under one header and remove repeated text) or split the second document into a
separate file (e.g., README_RAG_MVP.md) and update any top-level README
links/TOC accordingly; ensure the remaining README references the correct
filenames and that the unique header "Smart Notes – RAG MVP (Embeddings &
FAISS)" appears only once.
| ```bash | ||
| smart-notes/ | ||
| ├── rag_mvp/ | ||
| │ ├── embed.py # Embedding logic | ||
| │ ├── index.py # FAISS index creation | ||
| │ ├── qa_cli.py # CLI for asking questions | ||
| │ └── utils.py # Helper functions | ||
| ├── notes/ # Put your .txt notes here | ||
| ├── requirements.txt | ||
| └── README.md |
There was a problem hiding this comment.
Documented project structure doesn't match actual files.
The structure references embed.py, index.py, utils.py, and notes/*.txt, but the actual PR adds embeddings/chunker.py, embeddings/embedder.py, embeddings/indexer.py, pipelines/embedding_pipeline.py, and notes/*.md. This will confuse contributors.
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/README.md` around lines 75 - 84, Update the README
project structure to match the actual repo layout: replace references to
embed.py, index.py, utils.py, and notes/*.txt with the real modules and paths
such as embeddings/chunker.py, embeddings/embedder.py, embeddings/indexer.py,
pipelines/embedding_pipeline.py, and notes/*.md (or explicitly map the old names
to these new files if you want to preserve conceptual equivalence); ensure the
tree snippet and any text describing where to put notes reflect notes/*.md and
the embeddings/ and pipelines/ directories so contributors are not confused.
Nice, you’re on the PR page 👍 Your title looks good.
Here’s a clean description you can paste directly into the PR box:
📌 What this PR does
Adds documentation for the Local Markdown Storage Architecture, explaining how notes are stored locally as markdown files, how metadata and folder watching work, and how the system scales for large note collections.
🎯 Why this change
This gives contributors and mentors a clear view of the storage design, key decisions, and trade-offs, and sets a foundation for future features like search and an AI layer.
🧩 What’s included
Local-first storage model (filesystem as source of truth)
Metadata handling with frontmatter
Folder watching strategy
Performance considerations
Trade-offs, risks, and mitigations
Future extension points
🧪 How to review
Read the added README/architecture section{ Design/ ->local markdown storage.md}
Check if it aligns with project goals (offline-first, privacy-first)
Leave feedback on edge cases or missing scenarios
Summary by CodeRabbit
New Features
Documentation