Embedding pipeline chunking by tubajavedd · Pull Request #38 · AOSSIE-Org/Info

tubajavedd · 2026-02-10T19:46:54Z

Smart Notes – RAG MVP (Embeddings & FAISS)

This project is a simple Retrieval-Augmented Generation (RAG) pipeline for Smart Notes.
It allows users to store notes, convert them into embeddings, and search relevant notes using vector similarity.

🚀 Features

Convert notes into embeddings using Sentence Transformers
Store and search embeddings using FAISS (CPU)
CLI tool to ask questions about your notes
Simple chunking for text files
Works fully offline after model download

🧠 Tech Stack

Python 3.10+
sentence-transformers
FAISS (faiss-cpu)
HuggingFace Transformers

📁 Project Structure

smart-notes/
├── rag_mvp/
│   ├── embed.py        # Embedding logic
│   ├── index.py        # FAISS index creation
│   ├── qa_cli.py       # CLI for asking questions
│   └── utils.py        # Helper functions
├── notes/              # Put your .txt notes here
├── requirements.txt
└── README.md


<!-- This is an auto-generated comment: release notes by coderabbit.ai -->

## Summary by CodeRabbit

## Release Notes

* **New Features**
  * Added semantic search functionality for querying notes using vector embeddings
  * Introduced interactive Q&A command-line interface for local markdown notes
  * Added text chunking with configurable overlap for document processing

* **Documentation**
  * Updated README documenting RAG MVP variants and embedding-based workflow

* **Chores**
  * Updated .gitignore to exclude notes directory

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

coderabbitai · 2026-02-10T19:47:13Z

📝 Walkthrough

Walkthrough

This PR introduces an embedding-based Retrieval-Augmented Generation (RAG) MVP system for semantic search over markdown notes. It adds text chunking utilities, a sentence-transformer embedder wrapper, a FAISS-backed vector indexer, an orchestrating pipeline, and an interactive CLI for querying notes. Accompanying documentation describes the architecture and workflow.

Changes

Cohort / File(s)	Summary
Git Configuration `.gitignore`	Added pattern to ignore the `notes/` directory.
Documentation `smart-notes/rag_mvp/README.md`	Added comprehensive documentation covering RAG MVP variants (Local Q&A and embedding-based), features, question processing workflow, sample CLI run, embedding stack, and project structure.
Embeddings Infrastructure `smart-notes/rag_mvp/embeddings/chunker.py`, `smart-notes/rag_mvp/embeddings/embedder.py`, `smart-notes/rag_mvp/embeddings/indexer.py`	Introduced three new modules: `chunker.py` with `chunk_text()` function for overlapping text chunks; `embedder.py` with `Embedder` class wrapping sentence-transformers for text-to-embedding conversion; `indexer.py` with `VectorIndexer` class providing FAISS-backed similarity search with configurable dimension and top-k retrieval.
Pipeline & CLI `smart-notes/rag_mvp/pipelines/embedding_pipeline.py`, `smart-notes/rag_mvp/qa_cli.py`	Added `EmbeddingPipeline` orchestrating chunking, embedding generation, and FAISS index construction for semantic search; added `qa_cli.py` providing interactive CLI for loading markdown notes, processing queries, and returning relevant text chunks via semantic search or keyword matching.

Sequence Diagram

sequenceDiagram
    participant User
    participant QA_CLI as QA CLI
    participant Pipeline as EmbeddingPipeline
    participant Embedder as Embedder
    participant Indexer as VectorIndexer
    participant FAISS

    User->>QA_CLI: Enter query
    QA_CLI->>Pipeline: process_notes(notes_text)
    Pipeline->>Pipeline: chunk_text(text)
    Pipeline->>Embedder: embed(chunks)
    Embedder-->>Pipeline: embeddings (NumPy array)
    Pipeline->>Indexer: add(embeddings, chunks)
    Indexer->>FAISS: index.add(vectors)
    FAISS-->>Indexer: index built
    
    User->>QA_CLI: semantic_search(query)
    QA_CLI->>Pipeline: semantic_search(query)
    Pipeline->>Embedder: embed(query)
    Embedder-->>Pipeline: query_embedding
    Pipeline->>Indexer: search(query_embedding, k)
    Indexer->>FAISS: search(query_vector)
    FAISS-->>Indexer: top-k indices
    Indexer-->>Pipeline: matching_chunks
    Pipeline-->>QA_CLI: results
    QA_CLI-->>User: display matches

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hops with joy
Chunks and embeddings in perfect sync,
FAISS indexes dance without a blink!
Semantic search now guides the way,
My fuzzy notes won't go astray! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Embedding pipeline chunking' is partially related to the changeset; it refers to a real component (chunking functionality in the embedding pipeline), but misses the broader scope of the PR which introduces a complete RAG MVP system with embedding pipeline, FAISS indexer, CLI tool, and documentation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

🤖 Fix all issues with AI agents

In `@smart-notes/rag_mvp/embeddings/chunker.py`:
- Around line 9-29: The chunk_text function can enter an infinite loop when
overlap >= max_length; add input validation at the start of chunk_text to ensure
max_length is a positive integer and 0 <= overlap < max_length (or raise a
ValueError with a clear message), and reject or normalize invalid inputs before
the while loop; reference the chunk_text function and the start/end/overlap
logic so the check runs immediately after parameters are received and before any
slicing logic.

In `@smart-notes/rag_mvp/embeddings/indexer.py`:
- Around line 34-39: The loop over FAISS search results in indexer.py
incorrectly treats FAISS sentinel -1 as a valid index; update the post-search
filtering (after self.index.search(...) that assigns distances, indices) to
ignore any idx values that are negative (e.g., check idx >= 0) and also ensure
idx < len(self.texts) before appending self.texts[idx], so negative padded
results are not used; reference the variables indices, distances, and self.texts
in the change.

In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py`:
- Around line 14-24: The chunk_text implementation in embedding_pipeline.py
duplicates the logic from embeddings/chunker.py and lacks the end>=len(text)
guard (causing an infinite loop when overlap >= max_length); replace the local
method with a call to the canonical chunk_text from embeddings.chunker (import
chunk_text and delegate to it) so you reuse the tested implementation and its
break guard; if for some reason you cannot import, modify the existing
chunk_text to include the same guard (check if end >= len(text) then append
final chunk and break) and keep the same signature (chunk_text(self, text,
max_length=300, overlap=50)) so callers remain compatible.
- Around line 3-4: This file directly imports SentenceTransformer and faiss
which will raise raw ModuleNotFoundError; wrap those imports in try/except
ImportError and raise a helpful message matching sibling modules (e.g.,
instructing to pip install sentence-transformers/faiss) or fallback gracefully.
Also refactor the pipeline to reuse existing embeddings package code instead of
duplicating logic: import and call Embedder, VectorIndexer, and chunk_text
(instead of reimplementing embedding/indexing/chunking) so embedding_pipeline.py
delegates to those classes/functions for model loading, embedding generation,
and FAISS indexing.
- Around line 41-47: The semantic_search method should guard against invalid
indices returned by self.index.search (e.g., -1 or out-of-range) before indexing
into self.chunks; update semantic_search (and the handling of distances,
indices) to iterate over indices[0] and distances[0], filter out any index < 0
or >= len(self.chunks), and build results and corresponding distances only from
valid entries (returning fewer than top_k if needed) so attempts to access
self.chunks[i] never use invalid positions.
- Line 10: The SentenceTransformer instantiation uses a hardcoded Windows-only
cache folder; update the constructor call in embedding_pipeline (the place where
self.model = SentenceTransformer(...)) to remove the fixed "D:/models_cache" and
instead compute a platform-agnostic cache directory: prefer honoring environment
vars like HF_HOME or SENTENCE_TRANSFORMERS_HOME (via os.getenv) and fall back to
a project-relative cache (e.g., os.path.join(project_root, ".cache",
"sentence_transformers")) or the default by omitting cache_folder; ensure you
import os and construct the path using os.path.join so the code is portable
across OSes.

In `@smart-notes/rag_mvp/qa_cli.py`:
- Around line 85-87: The demo_embeddings_pipeline() is invoked unconditionally
in the __main__ block causing the SentenceTransformer to load on every CLI
start; make the demo opt-in by adding a CLI flag (e.g., --embeddings-demo or
--demo-embeddings) via argparse/typer and only call demo_embeddings_pipeline()
when that flag is set (leave existing keyword-based Q&A flow unchanged), or
remove the call entirely if you prefer no demo; update the __main__ block to
check the new flag before invoking demo_embeddings_pipeline() so startup latency
is avoided unless the user explicitly requests the demo.
- Line 39: NOTES_DIR is a relative path that depends on the current working
directory; change it to be resolved relative to the script location by
constructing NOTES_DIR from the module file path (e.g., using
Path(__file__).parent / "notes") so functions that reference NOTES_DIR always
point to the project's notes directory regardless of CWD; update any imports to
use pathlib.Path if necessary and ensure downstream code expects a Path or str
consistently (reference: NOTES_DIR in qa_cli.py).
- Line 5: The import in qa_cli.py uses an absolute path that will fail when run
directly; update the import to a relative import (e.g., replace "from
rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline" with a relative
import like "from .pipelines.embedding_pipeline import EmbeddingPipeline") so
EmbeddingPipeline is resolved when running python smart-notes/rag_mvp/qa_cli.py,
or alternatively add an __init__.py to smart-notes/rag_mvp and adjust run
instructions to execute the package form; ensure the change targets the import
line in qa_cli.py referencing EmbeddingPipeline.

In `@smart-notes/rag_mvp/README.md`:
- Around line 28-44: The README.md has an unclosed code fence around the CLI
example (the block starting with the ```bash before "python
smart-notes/rag_mvp/qa_cli.py"), so close that code block immediately after the
shown CLI output (after the lines that show the prompts and responses like ">>
what is AI" and the subsequent output) to prevent the rest of the document from
rendering as code; locate the code fence in the section containing the qa_cli.py
example and add the closing ``` on its own line before the next section heading
or normal text.
- Around line 75-84: Update the README project structure to match the actual
files introduced in the PR: replace references to embed.py, index.py, and
utils.py with the new modules embeddings/chunker.py, embeddings/embedder.py,
embeddings/indexer.py and include pipelines/embedding_pipeline.py under the
rag_mvp/ tree; also close the open Markdown code fence at the end of the example
block. Ensure the README lists the correct filenames and paths exactly as in the
diff (embeddings/chunker.py, embeddings/embedder.py, embeddings/indexer.py,
pipelines/embedding_pipeline.py) and that the triple backtick that starts the
code block is properly terminated.

🧹 Nitpick comments (1)

.gitignore (1)

1-1: Consider adding standard Python ignore patterns.

This .gitignore only covers the notes/ directory. For a Python project, you'd typically also ignore __pycache__/, *.pyc, .env, *.egg-info/, virtual environment directories, and any local model cache directories.

coderabbitai · 2026-02-10T19:50:24Z

smart-notes/rag_mvp/embeddings/chunker.py

+def chunk_text(text: str, max_length: int = 500, overlap: int = 50) -> List[str]:
+    if not text:
+        return []
+
+    chunks = []
+    start = 0
+    text = text.strip()
+
+    while start < len(text):
+        end = start + max_length
+        chunk = text[start:end].strip()
+
+        if chunk:
+            chunks.append(chunk)
+
+        if end >= len(text):
+            break
+
+        start = end - overlap
+        if start < 0:
+            start = 0


⚠️ Potential issue | 🟠 Major

Infinite loop when overlap >= max_length.

If overlap >= max_length, start never advances (it gets clamped to 0 or stays the same), causing an infinite loop. Add input validation.

Proposed fix

def chunk_text(text: str, max_length: int = 500, overlap: int = 50) -> List[str]: if not text: return [] + if overlap >= max_length: + raise ValueError("overlap must be less than max_length") chunks = [] start = 0

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/embeddings/chunker.py` around lines 9 - 29, The chunk_text function can enter an infinite loop when overlap >= max_length; add input validation at the start of chunk_text to ensure max_length is a positive integer and 0 <= overlap < max_length (or raise a ValueError with a clear message), and reject or normalize invalid inputs before the while loop; reference the chunk_text function and the start/end/overlap logic so the check runs immediately after parameters are received and before any slicing logic.

coderabbitai · 2026-02-10T19:50:24Z

smart-notes/rag_mvp/embeddings/indexer.py

+        distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
+        results = []
+
+        for idx in indices[0]:
+            if idx < len(self.texts):
+                results.append(self.texts[idx])


⚠️ Potential issue | 🔴 Critical

Bug: FAISS returns -1 for unfilled result slots, which passes the bounds check.

When the index has fewer vectors than k, FAISS pads results with index -1. Since -1 < len(self.texts) is always True in Python, this silently returns self.texts[-1] (the last chunk) instead of being filtered out.

Proposed fix

- distances, indices = self.index.search(query_embedding.reshape(1, -1), k) + _distances, indices = self.index.search(query_embedding.reshape(1, -1), k) results = [] for idx in indices[0]: - if idx < len(self.texts): + if 0 <= idx < len(self.texts): results.append(self.texts[idx])

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

distances, indices = self.index.search(query_embedding.reshape(1, -1), k)

results = []

for idx in indices[0]:

if idx < len(self.texts):

results.append(self.texts[idx])

_distances, indices = self.index.search(query_embedding.reshape(1, -1), k)

results = []

for idx in indices[0]:

if 0 <= idx < len(self.texts):

results.append(self.texts[idx])

🧰 Tools

🪛 Ruff (0.15.0)

[warning] 34-34: Unpacked variable distances is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/embeddings/indexer.py` around lines 34 - 39, The loop over FAISS search results in indexer.py incorrectly treats FAISS sentinel -1 as a valid index; update the post-search filtering (after self.index.search(...) that assigns distances, indices) to ignore any idx values that are negative (e.g., check idx >= 0) and also ensure idx < len(self.texts) before appending self.texts[idx], so negative padded results are not used; reference the variables indices, distances, and self.texts in the change.

coderabbitai · 2026-02-10T19:50:24Z

smart-notes/rag_mvp/pipelines/embedding_pipeline.py

+from sentence_transformers import SentenceTransformer
+import faiss


🛠️ Refactor suggestion | 🟠 Major

Direct imports without graceful error handling, unlike sibling modules.

embedder.py and indexer.py use try/except ImportError to provide helpful messages when dependencies are missing. This file imports SentenceTransformer and faiss directly, which will produce a raw ModuleNotFoundError instead.

More broadly, this pipeline class reimplements functionality already provided by Embedder, VectorIndexer, and chunk_text from the embeddings/ package. Consider reusing those modules instead of duplicating logic.

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py` around lines 3 - 4, This file directly imports SentenceTransformer and faiss which will raise raw ModuleNotFoundError; wrap those imports in try/except ImportError and raise a helpful message matching sibling modules (e.g., instructing to pip install sentence-transformers/faiss) or fallback gracefully. Also refactor the pipeline to reuse existing embeddings package code instead of duplicating logic: import and call Embedder, VectorIndexer, and chunk_text (instead of reimplementing embedding/indexing/chunking) so embedding_pipeline.py delegates to those classes/functions for model loading, embedding generation, and FAISS indexing.

coderabbitai · 2026-02-10T19:50:25Z

smart-notes/rag_mvp/pipelines/embedding_pipeline.py

+
+class EmbeddingPipeline:
+    def __init__(self, model_name="all-MiniLM-L6-v2"):
+        self.model = SentenceTransformer(model_name, cache_folder="D:/models_cache")


⚠️ Potential issue | 🔴 Critical

Hardcoded Windows-specific cache path — breaks portability.

cache_folder="D:/models_cache" will fail on non-Windows systems and on other developers' machines. Remove this or use a platform-agnostic default (e.g., a project-relative directory or respect HF_HOME/SENTENCE_TRANSFORMERS_HOME env vars).

Proposed fix

- self.model = SentenceTransformer(model_name, cache_folder="D:/models_cache") + self.model = SentenceTransformer(model_name)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self.model = SentenceTransformer(model_name, cache_folder="D:/models_cache")

self.model = SentenceTransformer(model_name)

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py` at line 10, The SentenceTransformer instantiation uses a hardcoded Windows-only cache folder; update the constructor call in embedding_pipeline (the place where self.model = SentenceTransformer(...)) to remove the fixed "D:/models_cache" and instead compute a platform-agnostic cache directory: prefer honoring environment vars like HF_HOME or SENTENCE_TRANSFORMERS_HOME (via os.getenv) and fall back to a project-relative cache (e.g., os.path.join(project_root, ".cache", "sentence_transformers")) or the default by omitting cache_folder; ensure you import os and construct the path using os.path.join so the code is portable across OSes.

coderabbitai · 2026-02-10T19:50:25Z

smart-notes/rag_mvp/pipelines/embedding_pipeline.py

+    def chunk_text(self, text, max_length=300, overlap=50):
+        chunks = []
+        start = 0
+
+        while start < len(text):
+            end = start + max_length
+            chunk = text[start:end]
+            chunks.append(chunk)
+            start = end - overlap
+
+        return chunks


🛠️ Refactor suggestion | 🟠 Major

Duplicate chunk_text — also susceptible to infinite loop.

This is a copy of the logic in embeddings/chunker.py with a different default max_length (300 vs 500) and without the end >= len(text) break guard. If overlap >= max_length, this version loops forever.

Reuse chunk_text from embeddings/chunker.py instead.

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py` around lines 14 - 24, The chunk_text implementation in embedding_pipeline.py duplicates the logic from embeddings/chunker.py and lacks the end>=len(text) guard (causing an infinite loop when overlap >= max_length); replace the local method with a call to the canonical chunk_text from embeddings.chunker (import chunk_text and delegate to it) so you reuse the tested implementation and its break guard; if for some reason you cannot import, modify the existing chunk_text to include the same guard (check if end >= len(text) then append final chunk and break) and keep the same signature (chunk_text(self, text, max_length=300, overlap=50)) so callers remain compatible.

coderabbitai · 2026-02-10T19:50:25Z

smart-notes/rag_mvp/qa_cli.py

+import re
+
+#-------------------emedding-pipeline-chunking concept
+from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check package configuration and structure fd -t f "__init__.py" smart-notes/ fd -t f "setup.py|pyproject.toml|setup.cfg" smart-notes/

Repository: AOSSIE-Org/Info

Length of output: 143

🏁 Script executed:

#!/bin/bash # Check for __init__.py at rag_mvp level ls -la smart-notes/rag_mvp/__init__.py 2>&1 # Find README fd -t f "README|readme" . --max-depth 3 # Check for config files at repo root fd -t f "setup.py|pyproject.toml|setup.cfg|poetry.lock" . --max-depth 1

Repository: AOSSIE-Org/Info

Length of output: 181

🏁 Script executed:

#!/bin/bash # Check the rag_mvp README which is most relevant cat smart-notes/rag_mvp/README.md # Also check root README for context echo "=== ROOT README ===" head -100 ./README.md

Repository: AOSSIE-Org/Info

Length of output: 5181

Absolute import will fail when script is run as documented.

from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline requires rag_mvp to be a package discoverable from sys.path. Running python smart-notes/rag_mvp/qa_cli.py from the repo root (as the README suggests) adds only smart-notes/rag_mvp to sys.path, not smart-notes. Additionally, there is no __init__.py at smart-notes/rag_mvp/, so rag_mvp is not recognized as a package. This will fail with ModuleNotFoundError.

Use relative imports (from .pipelines.embedding_pipeline import ...), add an __init__.py at smart-notes/rag_mvp/, or update the execution instructions to use a method that properly configures the module path.

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/qa_cli.py` at line 5, The import in qa_cli.py uses an absolute path that will fail when run directly; update the import to a relative import (e.g., replace "from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline" with a relative import like "from .pipelines.embedding_pipeline import EmbeddingPipeline") so EmbeddingPipeline is resolved when running python smart-notes/rag_mvp/qa_cli.py, or alternatively add an __init__.py to smart-notes/rag_mvp and adjust run instructions to execute the package form; ensure the change targets the import line in qa_cli.py referencing EmbeddingPipeline.

coderabbitai · 2026-02-10T19:50:25Z

smart-notes/rag_mvp/qa_cli.py

+    "of", "to", "in", "on", "for"
+}
+
+NOTES_DIR = "notes"


⚠️ Potential issue | 🟡 Minor

NOTES_DIR is a relative path — behavior depends on the working directory.

"notes" resolves relative to the CWD, not relative to the script or project root. This will silently find no notes if the user runs the CLI from a different directory.

Proposed fix — resolve relative to the script location

-NOTES_DIR = "notes" +NOTES_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..", "notes")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

NOTES_DIR = "notes"

NOTES_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..", "notes")

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/qa_cli.py` at line 39, NOTES_DIR is a relative path that depends on the current working directory; change it to be resolved relative to the script location by constructing NOTES_DIR from the module file path (e.g., using Path(__file__).parent / "notes") so functions that reference NOTES_DIR always point to the project's notes directory regardless of CWD; update any imports to use pathlib.Path if necessary and ensure downstream code expects a Path or str consistently (reference: NOTES_DIR in qa_cli.py).

coderabbitai · 2026-02-10T19:50:25Z

smart-notes/rag_mvp/qa_cli.py

+if __name__ == "__main__":
+
+    demo_embeddings_pipeline()      # Temporary demo for embeddings pipeline


⚠️ Potential issue | 🟠 Major

demo_embeddings_pipeline() runs unconditionally on every CLI start.

This forces the SentenceTransformer model to load (and potentially download) every time a user launches the CLI, even if they only want the keyword-based Q&A. This adds significant startup latency. Consider making the embedding demo opt-in (e.g., via a CLI flag) or removing it from the default flow.

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/qa_cli.py` around lines 85 - 87, The demo_embeddings_pipeline() is invoked unconditionally in the __main__ block causing the SentenceTransformer to load on every CLI start; make the demo opt-in by adding a CLI flag (e.g., --embeddings-demo or --demo-embeddings) via argparse/typer and only call demo_embeddings_pipeline() when that flag is set (leave existing keyword-based Q&A flow unchanged), or remove the call entirely if you prefer no demo; update the __main__ block to check the new flag before invoking demo_embeddings_pipeline() so startup latency is avoided unless the user explicitly requests the demo.

coderabbitai · 2026-02-10T19:50:25Z

smart-notes/rag_mvp/README.md

+```bash
+python smart-notes/rag_mvp/qa_cli.py
+
+
+
+>> what is AI
+
+[1] From test.md:
+Artificial Intelligence (AI) is the simulation of human intelligence in machines.
+
+
+>>  what is machine learning
+how is machine learning used
+difference between AI and ML
+
+
+


⚠️ Potential issue | 🟡 Minor

Malformed Markdown — unclosed code block bleeds into the rest of the document.

The code block opened at line 28 is never properly closed. The example CLI output (lines 33–43) and everything after it gets swallowed into the code fence, making the second half of the README render as a raw code block rather than formatted documentation.

Close the code block after the CLI example and before the second section heading.

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/README.md` around lines 28 - 44, The README.md has an unclosed code fence around the CLI example (the block starting with the ```bash before "python smart-notes/rag_mvp/qa_cli.py"), so close that code block immediately after the shown CLI output (after the lines that show the prompts and responses like ">> what is AI" and the subsequent output) to prevent the rest of the document from rendering as code; locate the code fence in the section containing the qa_cli.py example and add the closing ``` on its own line before the next section heading or normal text.

coderabbitai · 2026-02-10T19:50:25Z

smart-notes/rag_mvp/README.md

+```bash
+smart-notes/
+├── rag_mvp/
+│   ├── embed.py        # Embedding logic
+│   ├── index.py        # FAISS index creation
+│   ├── qa_cli.py       # CLI for asking questions
+│   └── utils.py        # Helper functions
+├── notes/              # Put your .txt notes here
+├── requirements.txt
+└── README.md


⚠️ Potential issue | 🟡 Minor

Project structure in README doesn't match actual file layout.

The documented structure references embed.py, index.py, and utils.py, but the actual PR introduces embeddings/chunker.py, embeddings/embedder.py, embeddings/indexer.py, and pipelines/embedding_pipeline.py. Also, the trailing code block is never closed.

Please update the project structure to reflect the real file layout and close the code fence.

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/README.md` around lines 75 - 84, Update the README project structure to match the actual files introduced in the PR: replace references to embed.py, index.py, and utils.py with the new modules embeddings/chunker.py, embeddings/embedder.py, embeddings/indexer.py and include pipelines/embedding_pipeline.py under the rag_mvp/ tree; also close the open Markdown code fence at the end of the example block. Ensure the README lists the correct filenames and paths exactly as in the diff (embeddings/chunker.py, embeddings/embedder.py, embeddings/indexer.py, pipelines/embedding_pipeline.py) and that the triple backtick that starts the code block is properly terminated.

Tuba Javed added 2 commits February 9, 2026 00:18

Add local Q&A CLI MVP for markdown notes

564741c

embedding-pipeline-chunking

5238131

coderabbitai bot reviewed Feb 10, 2026

View reviewed changes

		from sentence_transformers import SentenceTransformer
		import faiss

	self.model = SentenceTransformer(model_name, cache_folder="D:/models_cache")
	self.model = SentenceTransformer(model_name)

	NOTES_DIR = "notes"
	NOTES_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..", "notes")

		if __name__ == "__main__":

		demo_embeddings_pipeline() # Temporary demo for embeddings pipeline

Uh oh!

Conversation

tubajavedd commented Feb 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Smart Notes – RAG MVP (Embeddings & FAISS)

🚀 Features

🧠 Tech Stack

📁 Project Structure

Uh oh!

coderabbitai bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated Code Review Effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tubajavedd commented Feb 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 10, 2026 •

edited

Loading