Skip to content

Local Markdown storage Architecture #44

Open
tubajavedd wants to merge 4 commits intoAOSSIE-Org:mainfrom
tubajavedd:Local-markdown-storage-arch-design
Open

Local Markdown storage Architecture #44
tubajavedd wants to merge 4 commits intoAOSSIE-Org:mainfrom
tubajavedd:Local-markdown-storage-arch-design

Conversation

@tubajavedd
Copy link

@tubajavedd tubajavedd commented Feb 12, 2026

Nice, you’re on the PR page 👍 Your title looks good.
Here’s a clean description you can paste directly into the PR box:

📌 What this PR does

Adds documentation for the Local Markdown Storage Architecture, explaining how notes are stored locally as markdown files, how metadata and folder watching work, and how the system scales for large note collections.

🎯 Why this change

This gives contributors and mentors a clear view of the storage design, key decisions, and trade-offs, and sets a foundation for future features like search and an AI layer.

🧩 What’s included

Local-first storage model (filesystem as source of truth)

Metadata handling with frontmatter

Folder watching strategy

Performance considerations

Trade-offs, risks, and mitigations

Future extension points

🧪 How to review

Read the added README/architecture section{ Design/ ->local markdown storage.md}

Check if it aligns with project goals (offline-first, privacy-first)

Leave feedback on edge cases or missing scenarios

Summary by CodeRabbit

  • New Features

    • Interactive CLI for natural-language queries over markdown notes
    • Semantic search capability powered by local embeddings
    • Landing page UI design with privacy-first and offline-by-default approach
  • Documentation

    • Local markdown storage architecture guide
    • RAG MVP project documentation with setup and usage examples

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Walkthrough

This PR introduces a local markdown storage system with retrieval-augmented generation (RAG) capabilities. It includes architecture design documentation, UI design specification, Python implementation of embedding-based semantic search using sentence-transformers and FAISS vector indexing, and an interactive CLI for Q&A functionality.

Changes

Cohort / File(s) Summary
Configuration & Design Architecture
.gitignore, Design/local markdown storage.md, smart-notes-design/README.md
Adds ignore rule for notes/ directory; introduces comprehensive design document for offline-first local markdown storage with privacy preservation; includes UI design README for landing page with Bento-style card layout and privacy-first approach.
RAG Embedding Infrastructure
smart-notes/rag_mvp/embeddings/chunker.py, smart-notes/rag_mvp/embeddings/embedder.py, smart-notes/rag_mvp/embeddings/indexer.py
Implements text chunking with overlap, sentence-transformers wrapper for embedding generation, and FAISS-based vector indexing with nearest-neighbor search.
RAG Pipeline & Query Interface
smart-notes/rag_mvp/pipelines/embedding_pipeline.py, smart-notes/rag_mvp/qa_cli.py
Introduces EmbeddingPipeline orchestrating chunking, embedding, and semantic search; provides interactive CLI with note loading, sentence splitting, keyword matching, and demo functionality.
Documentation
smart-notes/rag_mvp/README.md
Comprehensive README documenting both Local Q&A (keyword-based) and RAG MVP (embeddings-based) implementations with features, workflow, run instructions, and project structure.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/CLI
    participant Pipeline as EmbeddingPipeline
    participant Embedder as Embedder
    participant Chunker as Chunker
    participant Indexer as VectorIndexer
    participant FAISS as FAISS Index
    participant Results as Retrieved Chunks

    User->>Pipeline: process_notes(text)
    Pipeline->>Chunker: chunk_text(text)
    Chunker-->>Pipeline: chunks[]
    Pipeline->>Embedder: embed(chunks)
    Embedder-->>Pipeline: embeddings[]
    Pipeline->>Indexer: build_index(embeddings, chunks)
    Indexer->>FAISS: add(embeddings)
    FAISS-->>Indexer: indexed

    User->>Pipeline: semantic_search(query)
    Pipeline->>Embedder: embed(query)
    Embedder-->>Pipeline: query_embedding
    Pipeline->>Indexer: search(query_embedding, k=3)
    Indexer->>FAISS: search_neighbors()
    FAISS-->>Indexer: nearest indices
    Indexer-->>Pipeline: matched_chunks[]
    Pipeline-->>Results: return chunks
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A markdown garden grows so fine,
With embeddings that softly shine,
Chunks and vectors dance in place,
Local storage, privacy's grace!
Now queries find their matching home. 🎩✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: it introduces the Local Markdown Storage Architecture design document, which is the primary focus of this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🤖 Fix all issues with AI agents
In `@smart-notes/rag_mvp/embeddings/chunker.py`:
- Around line 9-31: Remove the duplicated logic by replacing
EmbeddingPipeline.chunk_text with a call to the single utility chunk_text in
chunker.py: import chunk_text and use it instead of reimplementing the loop;
ensure you pass the desired max_length (align defaults or explicitly pass 300
where pipeline relied on that) and keep the strip/empty-check behavior by
relying on the utility's implementation, then delete the
EmbeddingPipeline.chunk_text method to avoid divergence and update any callers
to use the imported chunk_text or the pipeline wrapper that delegates to it.
- Around line 9-29: The chunk_text function can hang when overlap >= max_length;
add input validation at the top of chunk_text to ensure 0 <= overlap <
max_length (raise ValueError or clamp overlap to max_length-1), and/or ensure
the loop always makes forward progress by computing next_start = max(end -
overlap, start + 1) and assigning start = next_start (or breaking if next_start
<= start). Update the chunk_text function to enforce the overlap constraint and
guarantee start increases each iteration to prevent infinite loops.

In `@smart-notes/rag_mvp/embeddings/indexer.py`:
- Around line 23-28: In the add method of the indexer (function add), validate
that embeddings and chunks have the same length before calling self.index.add
and self.texts.extend; if len(embeddings) != len(chunks) raise a clear
ValueError (or handle by truncating with an explicit log) so the FAISS index
(self.index.add) and the parallel list (self.texts.extend) cannot get out of
sync; locate the checks around the calls to self.index.add and self.texts.extend
and enforce this validation early in the method.
- Around line 37-39: FAISS may return -1 for missing neighbors so the current
loop "for idx in indices[0]" with check "if idx < len(self.texts]" will allow
negative indices and return the last item; change the guard to only accept
non-negative, in-range indices (e.g., ensure 0 <= idx < len(self.texts)) and
coerce numpy types to Python ints if needed before indexing; update the loop
that builds results (which iterates over indices[0] and accesses self.texts) to
skip any idx < 0.

In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py`:
- Around line 3-4: The module currently hard-imports sentence_transformers and
faiss which will crash at import time; update EmbeddingPipeline in
embedding_pipeline.py to delegate to the existing embeddings package utilities
(Embedder, VectorIndexer, chunk_text) instead of directly importing
sentence_transformers or faiss. Replace direct use of SentenceTransformer and
faiss with calls to Embedder and VectorIndexer (and use chunk_text for text
splitting), so optional-dependency handling in those classes is reused and
importing embedding_pipeline no longer fails when optional libs are absent.
- Line 10: Replace the hardcoded Windows cache path in the SentenceTransformer
instantiation (self.model = SentenceTransformer(model_name,
cache_folder="D:/models_cache")) with a platform-agnostic approach: import os,
read an environment variable like SMARTNOTES_MODEL_CACHE (or fallback to
None/expanduser-based path) and pass that value as cache_folder, or omit the
cache_folder argument entirely to use the library default; update the
SentenceTransformer call and ensure the new os import is added where
SentenceTransformer is constructed.
- Around line 14-24: Remove the duplicate chunk_text method from class in
embedding_pipeline.py and instead call/import the canonical function from
embeddings.chunker (e.g., from embeddings.chunker import chunk_text) to ensure
consistent behavior and defaults; update any usage inside the class (references
to self.chunk_text) to call chunk_text(text, max_length=500, overlap=50) or omit
overrides so the canonical default is used, and rely on the chunker
implementation's strip/empty-check guards rather than reimplementing them here.
- Around line 41-46: semantic_search currently assumes self.index is initialized
and that FAISS returns valid indices; guard against calling semantic_search
before process_notes by checking self.index is not None (and raise a clear
exception or return an empty list), and sanitize FAISS results by filtering out
negative indices and any indices out of range before using them to index
self.chunks (e.g., drop -1 entries and limit results to available chunks when
top_k > number of vectors). Update the semantic_search method to validate
self.index, validate/clean indices from self.index.search, and return only safe,
existing chunk entries instead of indexing with -1 or None.

In `@smart-notes/rag_mvp/qa_cli.py`:
- Around line 63-80: The current search_notes implementation uses substring
matching ("word in sentence_lower") which yields false positives (e.g., "AI"
matching "wait"); update search logic in search_notes to perform word-boundary,
case-insensitive matches by compiling a regex for each query word using word
boundaries (and escaping the query word) and testing that regex against each
sentence (e.g., build re.compile(r'\b' + re.escape(q) + r'\b', re.IGNORECASE)
for each q in query_words) before appending to results; ensure you reference
query_words, split_sentences, and the results append block when making this
change.
- Around line 4-5: Top-level import of EmbeddingPipeline causes the CLI to fail
when ML deps (sentence_transformers/faiss) are missing; move the import line
"from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline" into the
demo_embeddings_pipeline() function so the heavy modules are only loaded when
that demo runs, and correct the comment typo "emedding" → "embedding"; update
any references inside demo_embeddings_pipeline() to use the locally imported
EmbeddingPipeline.

In `@smart-notes/rag_mvp/README.md`:
- Around line 46-84: The README contains two concatenated README documents
starting at the duplicate header "Smart Notes – RAG MVP (Embeddings & FAISS)";
fix by either merging the duplicate sections into a single cohesive README
(combine Features, Tech Stack, and Project Structure under one header and remove
repeated text) or split the second document into a separate file (e.g.,
README_RAG_MVP.md) and update any top-level README links/TOC accordingly; ensure
the remaining README references the correct filenames and that the unique header
"Smart Notes – RAG MVP (Embeddings & FAISS)" appears only once.
- Around line 75-84: Update the README project structure to match the actual
repo layout: replace references to embed.py, index.py, utils.py, and notes/*.txt
with the real modules and paths such as embeddings/chunker.py,
embeddings/embedder.py, embeddings/indexer.py, pipelines/embedding_pipeline.py,
and notes/*.md (or explicitly map the old names to these new files if you want
to preserve conceptual equivalence); ensure the tree snippet and any text
describing where to put notes reflect notes/*.md and the embeddings/ and
pipelines/ directories so contributors are not confused.
- Around line 28-44: The README has an unclosed triple-backtick fence started
with "```bash" (around the python run example) causing the rest of the file to
be rendered as code; close that fence immediately after the command invocation
(add a terminating "```" right after the `python smart-notes/rag_mvp/qa_cli.py`
line) and move the example CLI interaction (the prompt/response lines like ">>
what is AI" and the output "[1] From test.md: ...") outside the fenced code
block as a separate fenced block or plain text so the rest of the document
renders normally.
🟡 Minor comments (9)
smart-notes/rag_mvp/embeddings/embedder.py-25-27 (1)

25-27: ⚠️ Potential issue | 🟡 Minor

Empty return shape is inconsistent with normal return.

model.encode(...) returns a 2D array (n, dim), but the empty fallback np.array([]) returns a 1D array with shape (0,). Callers that inspect .shape[1] or pass this to FAISS will get an IndexError. Return a properly shaped 2D array instead.

Proposed fix
     def embed(self, texts: List[str]) -> np.ndarray:
         if not texts:
-            return np.array([])
+            return np.empty((0, self.model.get_sentence_embedding_dimension()))
 
         embeddings = self.model.encode(texts, convert_to_numpy=True)
         return embeddings
smart-notes/rag_mvp/qa_cli.py-39-39 (1)

39-39: ⚠️ Potential issue | 🟡 Minor

Relative NOTES_DIR depends on the working directory.

NOTES_DIR = "notes" resolves relative to CWD, not to the script's location. Running the CLI from a different directory will fail to find notes. Consider resolving relative to the script.

Proposed fix
-NOTES_DIR = "notes"
+NOTES_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..", "notes")

Or use a configurable path via an environment variable as a fallback.

Design/local markdown storage.md-24-28 (1)

24-28: ⚠️ Potential issue | 🟡 Minor

Fix heading spacing and list formatting.

The heading has extra spaces after the hash marks, and the list items should use proper markdown bullet syntax for consistency with standard markdown formatting.

📝 Proposed fix
-###  3. Non-Goals
+### 3. Non-Goals

-Cloud sync
-Real-time collaboration
-AI processing (handled in separate layers later)
+- Cloud sync
+- Real-time collaboration
+- AI processing (handled in separate layers later)
Design/local markdown storage.md-5-5 (1)

5-5: ⚠️ Potential issue | 🟡 Minor

Fix excessive whitespace.

The phrase "offline reliability" contains multiple consecutive spaces that should be reduced to a single space for proper formatting.

📝 Proposed fix
-Most note-taking apps store user data in cloud services or proprietary formats, which reduces user control and offline      reliability.
+Most note-taking apps store user data in cloud services or proprietary formats, which reduces user control and offline reliability.
Design/local markdown storage.md-13-16 (1)

13-16: ⚠️ Potential issue | 🟡 Minor

Improve list formatting and fix typo.

This section has inconsistent formatting and contains a typo:

  1. The list items lack proper markdown bullet syntax
  2. "Serives" should be "Services" on line 15
  3. The colon placement is unconventional

The content would be clearer with standard markdown list formatting.

📝 Proposed fix
-Offline-first experience : 
-    NO Internet,
-    NO Cloud Serives,
-    NO Server
+Offline-first experience:
+- NO Internet
+- NO Cloud Services
+- NO Server
Design/local markdown storage.md-1-3 (1)

1-3: ⚠️ Potential issue | 🟡 Minor

Fix heading hierarchy.

The document starts with an h1 heading (line 1) but then jumps directly to h3 headings (line 3). Markdown heading levels should increment by one level at a time. All section headings should be h2 (##) to maintain proper hierarchy.

📝 Proposed fix

Change all section headings from ### to ##:

 # Local Markdown Storage Architecture
 
-### 1. Problem Statement
+## 1. Problem Statement

Apply this change to all numbered sections (lines 3, 11, 24, 30, 40, 55, 71, 91, 109, 128, 165, 171).

Design/local markdown storage.md-101-107 (1)

101-107: ⚠️ Potential issue | 🟡 Minor

Format example as a proper code block.

The frontmatter example should be enclosed in a markdown code block with proper syntax highlighting for better readability and to prevent it from being interpreted as actual frontmatter.

📝 Proposed fix
-* Example :
----
-title: My First Note
-tags: [gsoc, ideas]
-created: 2026-02-12
----
-This is my note content.
+**Example:**
+
+```markdown
+---
+title: My First Note
+tags: [gsoc, ideas]
+created: 2026-02-12
+---
+This is my note content.
+```
Design/local markdown storage.md-120-126 (1)

120-126: ⚠️ Potential issue | 🟡 Minor

Use consistent heading style.

The document mixes atx-style headings (###) with setext-style headings (underlined with ---). Stick to atx-style headings throughout for consistency, as indicated by the markdown linter.

📝 Proposed fix
 -Avoid reloading everything on every file change
----
-Why this helps:
+
+**Why this helps:**
 
 Faster app startup
Design/local markdown storage.md-128-163 (1)

128-163: ⚠️ Potential issue | 🟡 Minor

Standardize heading styles and improve section markers.

This section uses setext-style heading underlines (dashes) instead of atx-style headings. Use consistent ### or #### heading markers throughout. Also, the horizontal rule separators (-------) are excessive and inconsistent.

📝 Proposed fix for heading styles

Replace setext-style headings with atx-style:

-* TRADE-OFFS
+**Trade-offs:**
 
 Using plain markdown files keeps things simple and transparent...
-Parsing frontmatter on every file adds a small performance cost.
 
 Very large folders can make the first scan slower when the app starts.
----------------------------------------
-* RISKS
+
+**Risks:**
 
 Files may get out of sync if the user edits...
-Performance may degrade for extremely large note collections (10k+ files).
-----------------------------------------
-* MITIGATIONS
+
+**Mitigations:**
 
 Handle file and parse errors gracefully...
-Fall back to a full re-scan if the file watcher misses events.
---------------------------------------------
🧹 Nitpick comments (9)
smart-notes/rag_mvp/embeddings/indexer.py (1)

34-34: Prefix unused distances with underscore.

Per static analysis (Ruff RUF059), distances is unpacked but never used.

Proposed fix
-        distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
+        _distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
smart-notes/rag_mvp/qa_cli.py (1)

85-109: Unguarded demo_embeddings_pipeline() call and missing KeyboardInterrupt handling.

The demo is called unconditionally on every CLI start (Line 87), adding startup latency from model loading even when the user only wants keyword search. Additionally, Ctrl+C during the input loop will print a raw traceback.

♻️ Proposed fix
 if __name__ == "__main__":
-
-    demo_embeddings_pipeline()      # Temporary demo for embeddings pipeline
-
     notes = load_notes()
 
     print("Ask questions about your notes (type 'exit' to quit)\n")
 
-    while True:
-        query = input(">> ").strip()
-
-        if query.lower() == "exit":
-            print("Goodbye 👋")
-            break
-
-        matches = search_notes(query, notes)
-
-        if not matches:
-            print("No relevant notes found.\n")
-        else:
-            print("\n--- Answers ---\n")
-            for i, m in enumerate(matches, 1):
-                print(f"[{i}] From {m['filename']}:")
-                print(m["sentence"])
-                print()
+    try:
+        while True:
+            query = input(">> ").strip()
+
+            if query.lower() == "exit":
+                break
+
+            matches = search_notes(query, notes)
+
+            if not matches:
+                print("No relevant notes found.\n")
+            else:
+                print("\n--- Answers ---\n")
+                for i, m in enumerate(matches, 1):
+                    print(f"[{i}] From {m['filename']}:")
+                    print(m["sentence"])
+                    print()
+    except (KeyboardInterrupt, EOFError):
+        pass
+
+    print("Goodbye 👋")
Design/local markdown storage.md (7)

42-42: Clarify "Editor folder" terminology.

The term "Editor folder" is unclear and doesn't appear elsewhere in the document. Consider using a more descriptive term like "workspace folder" or "notes directory" for consistency with the rest of the architecture description.

📝 Suggested clarification
-The app treats a Editor folder as a "notes workspace."
+The app treats a designated folder as a "notes workspace."

Or be more specific:

-The app treats a Editor folder as a "notes workspace."
+The app designates a folder (e.g., Notes/) as the "notes workspace."

47-47: Fix inconsistent quotation marks and formatting.

The quotation marks around technical terms are inconsistently applied and the formatting needs improvement. Use consistent formatting for technical terms throughout the document.

📝 Proposed fix
-  "in-memory index " and "local db(SQLite) 
+- An in-memory index and local database (SQLite)

65-69: Consider using code block for directory structure.

The directory structure would be more readable and properly formatted using a markdown code block instead of custom arrow symbols.

📝 Suggested improvement
-* STRUCTURE :
-
-   Notes/
-    |--->Note1.md
-    |--->Note2.md
+**Structure:**
+
+```
+Notes/
+├── Note1.md
+└── Note2.md
+```

73-74: Improve grammar and clarity.

The explanation has grammar issues and could be more concise. Also, tool names should be properly capitalized.

📝 Proposed fix
-WHY needed?
-    User may write notes outside the app like (vs code, notepad) then app automatically detect changes  and update by using FOLDER WATCHER
+**Why needed?**
+
+Users may edit notes outside the app (e.g., VS Code, Notepad). The folder watcher automatically detects these external changes and updates the app accordingly.

40-53: Consider documenting the SQLite database schema and purpose.

The architecture mentions both an "in-memory index" and a "local db (SQLite)" but doesn't explain what each stores or why both are needed. Future implementers would benefit from understanding:

  • What data goes in SQLite vs. in-memory?
  • Is SQLite used for persistence of the index?
  • What happens if SQLite DB is corrupted or deleted?

This clarification would strengthen the architecture document.


55-63: Consider documenting file encoding and character set assumptions.

The storage model section doesn't specify file encoding expectations. Consider explicitly stating that:

  • All markdown files should be UTF-8 encoded
  • How the app handles files with different encodings or BOM markers
  • How line ending differences across platforms (CRLF vs LF) are handled

This would prevent implementation issues and edge cases.


128-163: Missing discussion: handling binary attachments and images.

The document focuses on markdown text files but doesn't address how images, PDFs, or other binary attachments referenced in markdown notes should be stored and managed. Consider adding:

  • Where attachments are stored relative to markdown files
  • How the app handles broken image links
  • Whether attachments are watched for changes
  • Size limits or recommendations for attachments

This is relevant given the "thousands of files" scalability goal.

Comment on lines +9 to +31
def chunk_text(text: str, max_length: int = 500, overlap: int = 50) -> List[str]:
if not text:
return []

chunks = []
start = 0
text = text.strip()

while start < len(text):
end = start + max_length
chunk = text[start:end].strip()

if chunk:
chunks.append(chunk)

if end >= len(text):
break

start = end - overlap
if start < 0:
start = 0

return chunks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Duplicate chunk_text logic exists in embedding_pipeline.py.

EmbeddingPipeline.chunk_text (in smart-notes/rag_mvp/pipelines/embedding_pipeline.py, Lines 14-24) reimplements the same chunking logic with different defaults (max_length=300) and without the strip/empty-check guards. The pipeline should import and reuse this utility instead of maintaining a divergent copy.

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/embeddings/chunker.py` around lines 9 - 31, Remove the
duplicated logic by replacing EmbeddingPipeline.chunk_text with a call to the
single utility chunk_text in chunker.py: import chunk_text and use it instead of
reimplementing the loop; ensure you pass the desired max_length (align defaults
or explicitly pass 300 where pipeline relied on that) and keep the
strip/empty-check behavior by relying on the utility's implementation, then
delete the EmbeddingPipeline.chunk_text method to avoid divergence and update
any callers to use the imported chunk_text or the pipeline wrapper that
delegates to it.

Comment on lines +9 to +29
def chunk_text(text: str, max_length: int = 500, overlap: int = 50) -> List[str]:
if not text:
return []

chunks = []
start = 0
text = text.strip()

while start < len(text):
end = start + max_length
chunk = text[start:end].strip()

if chunk:
chunks.append(chunk)

if end >= len(text):
break

start = end - overlap
if start < 0:
start = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Infinite loop when overlap >= max_length.

If overlap >= max_length, start never advances (or regresses), causing an infinite loop. The guard on Line 28 only clamps to 0 but doesn't break the cycle. Add input validation or ensure forward progress.

🐛 Proposed fix
 def chunk_text(text: str, max_length: int = 500, overlap: int = 50) -> List[str]:
-    if not text:
+    if not text or max_length <= 0:
+        return []
+
+    if overlap >= max_length:
+        raise ValueError("overlap must be less than max_length")
-        return []
 
     chunks = []
     start = 0
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/embeddings/chunker.py` around lines 9 - 29, The
chunk_text function can hang when overlap >= max_length; add input validation at
the top of chunk_text to ensure 0 <= overlap < max_length (raise ValueError or
clamp overlap to max_length-1), and/or ensure the loop always makes forward
progress by computing next_start = max(end - overlap, start + 1) and assigning
start = next_start (or breaking if next_start <= start). Update the chunk_text
function to enforce the overlap constraint and guarantee start increases each
iteration to prevent infinite loops.

Comment on lines +23 to +28
def add(self, embeddings: np.ndarray, chunks: List[str]):
if len(embeddings) == 0:
return

self.index.add(embeddings)
self.texts.extend(chunks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing validation: embeddings and chunks length mismatch.

If len(embeddings) != len(chunks), the parallel self.texts list silently drifts out of sync with the FAISS index, causing search to return wrong text for a given vector.

🛡️ Proposed fix
     def add(self, embeddings: np.ndarray, chunks: List[str]):
         if len(embeddings) == 0:
             return
 
+        if len(embeddings) != len(chunks):
+            raise ValueError(
+                f"embeddings length ({len(embeddings)}) != chunks length ({len(chunks)})"
+            )
+
         self.index.add(embeddings)
         self.texts.extend(chunks)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def add(self, embeddings: np.ndarray, chunks: List[str]):
if len(embeddings) == 0:
return
self.index.add(embeddings)
self.texts.extend(chunks)
def add(self, embeddings: np.ndarray, chunks: List[str]):
if len(embeddings) == 0:
return
if len(embeddings) != len(chunks):
raise ValueError(
f"embeddings length ({len(embeddings)}) != chunks length ({len(chunks)})"
)
self.index.add(embeddings)
self.texts.extend(chunks)
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/embeddings/indexer.py` around lines 23 - 28, In the add
method of the indexer (function add), validate that embeddings and chunks have
the same length before calling self.index.add and self.texts.extend; if
len(embeddings) != len(chunks) raise a clear ValueError (or handle by truncating
with an explicit log) so the FAISS index (self.index.add) and the parallel list
(self.texts.extend) cannot get out of sync; locate the checks around the calls
to self.index.add and self.texts.extend and enforce this validation early in the
method.

Comment on lines +37 to +39
for idx in indices[0]:
if idx < len(self.texts):
results.append(self.texts[idx])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: FAISS returns -1 for missing neighbors, which passes the bounds check.

When the index has fewer than k entries, FAISS pads results with index -1. In Python, -1 < len(self.texts) is always True, so self.texts[-1] silently returns the last chunk — a wrong result.

🐛 Proposed fix
         for idx in indices[0]:
-            if idx < len(self.texts):
+            if 0 <= idx < len(self.texts):
                 results.append(self.texts[idx])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for idx in indices[0]:
if idx < len(self.texts):
results.append(self.texts[idx])
for idx in indices[0]:
if 0 <= idx < len(self.texts):
results.append(self.texts[idx])
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/embeddings/indexer.py` around lines 37 - 39, FAISS may
return -1 for missing neighbors so the current loop "for idx in indices[0]" with
check "if idx < len(self.texts]" will allow negative indices and return the last
item; change the guard to only accept non-negative, in-range indices (e.g.,
ensure 0 <= idx < len(self.texts)) and coerce numpy types to Python ints if
needed before indexing; update the loop that builds results (which iterates over
indices[0] and accesses self.texts) to skip any idx < 0.

Comment on lines +3 to +4
from sentence_transformers import SentenceTransformer
import faiss
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hard imports of optional dependencies will crash at import time.

Unlike embedder.py and indexer.py which use try/except to give a friendly error at instantiation, this module fails entirely if sentence_transformers or faiss are missing. This also prevents importing the module for testing or documentation purposes.

Proposed fix — use the existing utility modules

The embeddings/ package already provides Embedder, VectorIndexer, and chunk_text with proper optional-dependency handling. Consider refactoring EmbeddingPipeline to delegate to those modules instead of duplicating their logic and imports.

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py` around lines 3 - 4, The
module currently hard-imports sentence_transformers and faiss which will crash
at import time; update EmbeddingPipeline in embedding_pipeline.py to delegate to
the existing embeddings package utilities (Embedder, VectorIndexer, chunk_text)
instead of directly importing sentence_transformers or faiss. Replace direct use
of SentenceTransformer and faiss with calls to Embedder and VectorIndexer (and
use chunk_text for text splitting), so optional-dependency handling in those
classes is reused and importing embedding_pipeline no longer fails when optional
libs are absent.

Comment on lines +4 to +5
#-------------------emedding-pipeline-chunking concept
from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Top-level import of EmbeddingPipeline breaks the CLI when ML dependencies are absent.

This import eagerly loads sentence_transformers and faiss (which are hard-imported in embedding_pipeline.py). If those heavy dependencies aren't installed, the entire CLI — including the dependency-free keyword search — fails to start. Move this import inside demo_embeddings_pipeline().

Also, minor typo: "emedding" → "embedding".

🐛 Proposed fix
-#-------------------emedding-pipeline-chunking concept
-from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline
-
 def demo_embeddings_pipeline():
+    # embedding-pipeline-chunking concept
+    from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline
+
     pipeline = EmbeddingPipeline()
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/qa_cli.py` around lines 4 - 5, Top-level import of
EmbeddingPipeline causes the CLI to fail when ML deps
(sentence_transformers/faiss) are missing; move the import line "from
rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline" into the
demo_embeddings_pipeline() function so the heavy modules are only loaded when
that demo runs, and correct the comment typo "emedding" → "embedding"; update
any references inside demo_embeddings_pipeline() to use the locally imported
EmbeddingPipeline.

Comment on lines +63 to +80
def search_notes(query, notes):
results = []

query_words = [
word.lower()
for word in query.split()
if word.lower() not in QUESTION_WORDS
]

for note in notes:
sentences = split_sentences(note["content"])
for sentence in sentences:
sentence_lower = sentence.lower()
if any(word in sentence_lower for word in query_words):
results.append({
"filename": note["filename"],
"sentence": sentence.strip()
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Substring matching produces false positives.

word in sentence_lower is a substring check, not a word-boundary check. For example, querying "AI" will match sentences containing "wait", "train", "brain", etc.

🐛 Proposed fix — use word-boundary regex
+import re
+
 def search_notes(query, notes):
     results = []
 
     query_words = [
         word.lower()
         for word in query.split()
         if word.lower() not in QUESTION_WORDS
     ]
 
     for note in notes:
         sentences = split_sentences(note["content"])
         for sentence in sentences:
             sentence_lower = sentence.lower()
-            if any(word in sentence_lower for word in query_words):
+            if any(re.search(r'\b' + re.escape(word) + r'\b', sentence_lower) for word in query_words):
                 results.append({
                     "filename": note["filename"],
                     "sentence": sentence.strip()
                 })
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/qa_cli.py` around lines 63 - 80, The current search_notes
implementation uses substring matching ("word in sentence_lower") which yields
false positives (e.g., "AI" matching "wait"); update search logic in
search_notes to perform word-boundary, case-insensitive matches by compiling a
regex for each query word using word boundaries (and escaping the query word)
and testing that regex against each sentence (e.g., build re.compile(r'\b' +
re.escape(q) + r'\b', re.IGNORECASE) for each q in query_words) before appending
to results; ensure you reference query_words, split_sentences, and the results
append block when making this change.

Comment on lines +28 to +44
```bash
python smart-notes/rag_mvp/qa_cli.py



>> what is AI

[1] From test.md:
Artificial Intelligence (AI) is the simulation of human intelligence in machines.


>> what is machine learning
how is machine learning used
difference between AI and ML



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Broken markdown: unclosed code block swallows the rest of the document.

The ```bash block opened at Line 28 is never closed. The example CLI interaction (Lines 33-42) and all subsequent content renders as part of the code block. Add a closing ``` after the run command and format the example interaction separately.

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/README.md` around lines 28 - 44, The README has an
unclosed triple-backtick fence started with "```bash" (around the python run
example) causing the rest of the file to be rendered as code; close that fence
immediately after the command invocation (add a terminating "```" right after
the `python smart-notes/rag_mvp/qa_cli.py` line) and move the example CLI
interaction (the prompt/response lines like ">> what is AI" and the output "[1]
From test.md: ...") outside the fenced code block as a separate fenced block or
plain text so the rest of the document renders normally.

Comment on lines +46 to +84

# Smart Notes – RAG MVP (Embeddings & FAISS)

This project is a simple **Retrieval-Augmented Generation (RAG)** pipeline for Smart Notes.
It allows users to store notes, convert them into embeddings, and search relevant notes using vector similarity.

---

## 🚀 Features

- Convert notes into embeddings using Sentence Transformers
- Store and search embeddings using FAISS (CPU)
- CLI tool to ask questions about your notes
- Simple chunking for text files
- Works fully offline after model download

---

## 🧠 Tech Stack

- Python 3.10+
- sentence-transformers
- FAISS (faiss-cpu)
- HuggingFace Transformers

---

## 📁 Project Structure

```bash
smart-notes/
├── rag_mvp/
│ ├── embed.py # Embedding logic
│ ├── index.py # FAISS index creation
│ ├── qa_cli.py # CLI for asking questions
│ └── utils.py # Helper functions
├── notes/ # Put your .txt notes here
├── requirements.txt
└── README.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Two separate README documents appear concatenated into one file.

Lines 46-84 introduce a second "Smart Notes – RAG MVP (Embeddings & FAISS)" header with its own features, tech stack, and project structure sections. This should either be merged cohesively into a single README or split into separate files for each MVP track.

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/README.md` around lines 46 - 84, The README contains two
concatenated README documents starting at the duplicate header "Smart Notes –
RAG MVP (Embeddings & FAISS)"; fix by either merging the duplicate sections into
a single cohesive README (combine Features, Tech Stack, and Project Structure
under one header and remove repeated text) or split the second document into a
separate file (e.g., README_RAG_MVP.md) and update any top-level README
links/TOC accordingly; ensure the remaining README references the correct
filenames and that the unique header "Smart Notes – RAG MVP (Embeddings &
FAISS)" appears only once.

Comment on lines +75 to +84
```bash
smart-notes/
├── rag_mvp/
│ ├── embed.py # Embedding logic
│ ├── index.py # FAISS index creation
│ ├── qa_cli.py # CLI for asking questions
│ └── utils.py # Helper functions
├── notes/ # Put your .txt notes here
├── requirements.txt
└── README.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Documented project structure doesn't match actual files.

The structure references embed.py, index.py, utils.py, and notes/*.txt, but the actual PR adds embeddings/chunker.py, embeddings/embedder.py, embeddings/indexer.py, pipelines/embedding_pipeline.py, and notes/*.md. This will confuse contributors.

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/README.md` around lines 75 - 84, Update the README
project structure to match the actual repo layout: replace references to
embed.py, index.py, utils.py, and notes/*.txt with the real modules and paths
such as embeddings/chunker.py, embeddings/embedder.py, embeddings/indexer.py,
pipelines/embedding_pipeline.py, and notes/*.md (or explicitly map the old names
to these new files if you want to preserve conceptual equivalence); ensure the
tree snippet and any text describing where to put notes reflect notes/*.md and
the embeddings/ and pipelines/ directories so contributors are not confused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant