Skip to content

Embedding pipeline chunking#38

Open
tubajavedd wants to merge 2 commits intoAOSSIE-Org:mainfrom
tubajavedd:embedding-pipeline-chunking
Open

Embedding pipeline chunking#38
tubajavedd wants to merge 2 commits intoAOSSIE-Org:mainfrom
tubajavedd:embedding-pipeline-chunking

Conversation

@tubajavedd
Copy link

@tubajavedd tubajavedd commented Feb 10, 2026

Smart Notes – RAG MVP (Embeddings & FAISS)

This project is a simple Retrieval-Augmented Generation (RAG) pipeline for Smart Notes.
It allows users to store notes, convert them into embeddings, and search relevant notes using vector similarity.


🚀 Features

  • Convert notes into embeddings using Sentence Transformers
  • Store and search embeddings using FAISS (CPU)
  • CLI tool to ask questions about your notes
  • Simple chunking for text files
  • Works fully offline after model download

🧠 Tech Stack

  • Python 3.10+
  • sentence-transformers
  • FAISS (faiss-cpu)
  • HuggingFace Transformers

📁 Project Structure

smart-notes/
├── rag_mvp/
│   ├── embed.py        # Embedding logic
│   ├── index.py        # FAISS index creation
│   ├── qa_cli.py       # CLI for asking questions
│   └── utils.py        # Helper functions
├── notes/              # Put your .txt notes here
├── requirements.txt
└── README.md


<!-- This is an auto-generated comment: release notes by coderabbit.ai -->

## Summary by CodeRabbit

## Release Notes

* **New Features**
  * Added semantic search functionality for querying notes using vector embeddings
  * Introduced interactive Q&A command-line interface for local markdown notes
  * Added text chunking with configurable overlap for document processing

* **Documentation**
  * Updated README documenting RAG MVP variants and embedding-based workflow

* **Chores**
  * Updated .gitignore to exclude notes directory

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

This PR introduces an embedding-based Retrieval-Augmented Generation (RAG) MVP system for semantic search over markdown notes. It adds text chunking utilities, a sentence-transformer embedder wrapper, a FAISS-backed vector indexer, an orchestrating pipeline, and an interactive CLI for querying notes. Accompanying documentation describes the architecture and workflow.

Changes

Cohort / File(s) Summary
Git Configuration
.gitignore
Added pattern to ignore the notes/ directory.
Documentation
smart-notes/rag_mvp/README.md
Added comprehensive documentation covering RAG MVP variants (Local Q&A and embedding-based), features, question processing workflow, sample CLI run, embedding stack, and project structure.
Embeddings Infrastructure
smart-notes/rag_mvp/embeddings/chunker.py, smart-notes/rag_mvp/embeddings/embedder.py, smart-notes/rag_mvp/embeddings/indexer.py
Introduced three new modules: chunker.py with chunk_text() function for overlapping text chunks; embedder.py with Embedder class wrapping sentence-transformers for text-to-embedding conversion; indexer.py with VectorIndexer class providing FAISS-backed similarity search with configurable dimension and top-k retrieval.
Pipeline & CLI
smart-notes/rag_mvp/pipelines/embedding_pipeline.py, smart-notes/rag_mvp/qa_cli.py
Added EmbeddingPipeline orchestrating chunking, embedding generation, and FAISS index construction for semantic search; added qa_cli.py providing interactive CLI for loading markdown notes, processing queries, and returning relevant text chunks via semantic search or keyword matching.

Sequence Diagram

sequenceDiagram
    participant User
    participant QA_CLI as QA CLI
    participant Pipeline as EmbeddingPipeline
    participant Embedder as Embedder
    participant Indexer as VectorIndexer
    participant FAISS

    User->>QA_CLI: Enter query
    QA_CLI->>Pipeline: process_notes(notes_text)
    Pipeline->>Pipeline: chunk_text(text)
    Pipeline->>Embedder: embed(chunks)
    Embedder-->>Pipeline: embeddings (NumPy array)
    Pipeline->>Indexer: add(embeddings, chunks)
    Indexer->>FAISS: index.add(vectors)
    FAISS-->>Indexer: index built
    
    User->>QA_CLI: semantic_search(query)
    QA_CLI->>Pipeline: semantic_search(query)
    Pipeline->>Embedder: embed(query)
    Embedder-->>Pipeline: query_embedding
    Pipeline->>Indexer: search(query_embedding, k)
    Indexer->>FAISS: search(query_vector)
    FAISS-->>Indexer: top-k indices
    Indexer-->>Pipeline: matching_chunks
    Pipeline-->>QA_CLI: results
    QA_CLI-->>User: display matches
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Hops with joy
Chunks and embeddings in perfect sync,
FAISS indexes dance without a blink!
Semantic search now guides the way,
My fuzzy notes won't go astray! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Embedding pipeline chunking' is partially related to the changeset; it refers to a real component (chunking functionality in the embedding pipeline), but misses the broader scope of the PR which introduces a complete RAG MVP system with embedding pipeline, FAISS indexer, CLI tool, and documentation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🤖 Fix all issues with AI agents
In `@smart-notes/rag_mvp/embeddings/chunker.py`:
- Around line 9-29: The chunk_text function can enter an infinite loop when
overlap >= max_length; add input validation at the start of chunk_text to ensure
max_length is a positive integer and 0 <= overlap < max_length (or raise a
ValueError with a clear message), and reject or normalize invalid inputs before
the while loop; reference the chunk_text function and the start/end/overlap
logic so the check runs immediately after parameters are received and before any
slicing logic.

In `@smart-notes/rag_mvp/embeddings/indexer.py`:
- Around line 34-39: The loop over FAISS search results in indexer.py
incorrectly treats FAISS sentinel -1 as a valid index; update the post-search
filtering (after self.index.search(...) that assigns distances, indices) to
ignore any idx values that are negative (e.g., check idx >= 0) and also ensure
idx < len(self.texts) before appending self.texts[idx], so negative padded
results are not used; reference the variables indices, distances, and self.texts
in the change.

In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py`:
- Around line 14-24: The chunk_text implementation in embedding_pipeline.py
duplicates the logic from embeddings/chunker.py and lacks the end>=len(text)
guard (causing an infinite loop when overlap >= max_length); replace the local
method with a call to the canonical chunk_text from embeddings.chunker (import
chunk_text and delegate to it) so you reuse the tested implementation and its
break guard; if for some reason you cannot import, modify the existing
chunk_text to include the same guard (check if end >= len(text) then append
final chunk and break) and keep the same signature (chunk_text(self, text,
max_length=300, overlap=50)) so callers remain compatible.
- Around line 3-4: This file directly imports SentenceTransformer and faiss
which will raise raw ModuleNotFoundError; wrap those imports in try/except
ImportError and raise a helpful message matching sibling modules (e.g.,
instructing to pip install sentence-transformers/faiss) or fallback gracefully.
Also refactor the pipeline to reuse existing embeddings package code instead of
duplicating logic: import and call Embedder, VectorIndexer, and chunk_text
(instead of reimplementing embedding/indexing/chunking) so embedding_pipeline.py
delegates to those classes/functions for model loading, embedding generation,
and FAISS indexing.
- Around line 41-47: The semantic_search method should guard against invalid
indices returned by self.index.search (e.g., -1 or out-of-range) before indexing
into self.chunks; update semantic_search (and the handling of distances,
indices) to iterate over indices[0] and distances[0], filter out any index < 0
or >= len(self.chunks), and build results and corresponding distances only from
valid entries (returning fewer than top_k if needed) so attempts to access
self.chunks[i] never use invalid positions.
- Line 10: The SentenceTransformer instantiation uses a hardcoded Windows-only
cache folder; update the constructor call in embedding_pipeline (the place where
self.model = SentenceTransformer(...)) to remove the fixed "D:/models_cache" and
instead compute a platform-agnostic cache directory: prefer honoring environment
vars like HF_HOME or SENTENCE_TRANSFORMERS_HOME (via os.getenv) and fall back to
a project-relative cache (e.g., os.path.join(project_root, ".cache",
"sentence_transformers")) or the default by omitting cache_folder; ensure you
import os and construct the path using os.path.join so the code is portable
across OSes.

In `@smart-notes/rag_mvp/qa_cli.py`:
- Around line 85-87: The demo_embeddings_pipeline() is invoked unconditionally
in the __main__ block causing the SentenceTransformer to load on every CLI
start; make the demo opt-in by adding a CLI flag (e.g., --embeddings-demo or
--demo-embeddings) via argparse/typer and only call demo_embeddings_pipeline()
when that flag is set (leave existing keyword-based Q&A flow unchanged), or
remove the call entirely if you prefer no demo; update the __main__ block to
check the new flag before invoking demo_embeddings_pipeline() so startup latency
is avoided unless the user explicitly requests the demo.
- Line 39: NOTES_DIR is a relative path that depends on the current working
directory; change it to be resolved relative to the script location by
constructing NOTES_DIR from the module file path (e.g., using
Path(__file__).parent / "notes") so functions that reference NOTES_DIR always
point to the project's notes directory regardless of CWD; update any imports to
use pathlib.Path if necessary and ensure downstream code expects a Path or str
consistently (reference: NOTES_DIR in qa_cli.py).
- Line 5: The import in qa_cli.py uses an absolute path that will fail when run
directly; update the import to a relative import (e.g., replace "from
rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline" with a relative
import like "from .pipelines.embedding_pipeline import EmbeddingPipeline") so
EmbeddingPipeline is resolved when running python smart-notes/rag_mvp/qa_cli.py,
or alternatively add an __init__.py to smart-notes/rag_mvp and adjust run
instructions to execute the package form; ensure the change targets the import
line in qa_cli.py referencing EmbeddingPipeline.

In `@smart-notes/rag_mvp/README.md`:
- Around line 28-44: The README.md has an unclosed code fence around the CLI
example (the block starting with the ```bash before "python
smart-notes/rag_mvp/qa_cli.py"), so close that code block immediately after the
shown CLI output (after the lines that show the prompts and responses like ">>
what is AI" and the subsequent output) to prevent the rest of the document from
rendering as code; locate the code fence in the section containing the qa_cli.py
example and add the closing ``` on its own line before the next section heading
or normal text.
- Around line 75-84: Update the README project structure to match the actual
files introduced in the PR: replace references to embed.py, index.py, and
utils.py with the new modules embeddings/chunker.py, embeddings/embedder.py,
embeddings/indexer.py and include pipelines/embedding_pipeline.py under the
rag_mvp/ tree; also close the open Markdown code fence at the end of the example
block. Ensure the README lists the correct filenames and paths exactly as in the
diff (embeddings/chunker.py, embeddings/embedder.py, embeddings/indexer.py,
pipelines/embedding_pipeline.py) and that the triple backtick that starts the
code block is properly terminated.
🧹 Nitpick comments (1)
.gitignore (1)

1-1: Consider adding standard Python ignore patterns.

This .gitignore only covers the notes/ directory. For a Python project, you'd typically also ignore __pycache__/, *.pyc, .env, *.egg-info/, virtual environment directories, and any local model cache directories.

Comment on lines +9 to +29
def chunk_text(text: str, max_length: int = 500, overlap: int = 50) -> List[str]:
if not text:
return []

chunks = []
start = 0
text = text.strip()

while start < len(text):
end = start + max_length
chunk = text[start:end].strip()

if chunk:
chunks.append(chunk)

if end >= len(text):
break

start = end - overlap
if start < 0:
start = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Infinite loop when overlap >= max_length.

If overlap >= max_length, start never advances (it gets clamped to 0 or stays the same), causing an infinite loop. Add input validation.

Proposed fix
 def chunk_text(text: str, max_length: int = 500, overlap: int = 50) -> List[str]:
     if not text:
         return []
+    if overlap >= max_length:
+        raise ValueError("overlap must be less than max_length")
 
     chunks = []
     start = 0
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/embeddings/chunker.py` around lines 9 - 29, The
chunk_text function can enter an infinite loop when overlap >= max_length; add
input validation at the start of chunk_text to ensure max_length is a positive
integer and 0 <= overlap < max_length (or raise a ValueError with a clear
message), and reject or normalize invalid inputs before the while loop;
reference the chunk_text function and the start/end/overlap logic so the check
runs immediately after parameters are received and before any slicing logic.

Comment on lines +34 to +39
distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
results = []

for idx in indices[0]:
if idx < len(self.texts):
results.append(self.texts[idx])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: FAISS returns -1 for unfilled result slots, which passes the bounds check.

When the index has fewer vectors than k, FAISS pads results with index -1. Since -1 < len(self.texts) is always True in Python, this silently returns self.texts[-1] (the last chunk) instead of being filtered out.

Proposed fix
-        distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
+        _distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
         results = []
 
         for idx in indices[0]:
-            if idx < len(self.texts):
+            if 0 <= idx < len(self.texts):
                 results.append(self.texts[idx])
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
results = []
for idx in indices[0]:
if idx < len(self.texts):
results.append(self.texts[idx])
_distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
results = []
for idx in indices[0]:
if 0 <= idx < len(self.texts):
results.append(self.texts[idx])
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 34-34: Unpacked variable distances is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/embeddings/indexer.py` around lines 34 - 39, The loop
over FAISS search results in indexer.py incorrectly treats FAISS sentinel -1 as
a valid index; update the post-search filtering (after self.index.search(...)
that assigns distances, indices) to ignore any idx values that are negative
(e.g., check idx >= 0) and also ensure idx < len(self.texts) before appending
self.texts[idx], so negative padded results are not used; reference the
variables indices, distances, and self.texts in the change.

Comment on lines +3 to +4
from sentence_transformers import SentenceTransformer
import faiss
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Direct imports without graceful error handling, unlike sibling modules.

embedder.py and indexer.py use try/except ImportError to provide helpful messages when dependencies are missing. This file imports SentenceTransformer and faiss directly, which will produce a raw ModuleNotFoundError instead.

More broadly, this pipeline class reimplements functionality already provided by Embedder, VectorIndexer, and chunk_text from the embeddings/ package. Consider reusing those modules instead of duplicating logic.

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py` around lines 3 - 4, This
file directly imports SentenceTransformer and faiss which will raise raw
ModuleNotFoundError; wrap those imports in try/except ImportError and raise a
helpful message matching sibling modules (e.g., instructing to pip install
sentence-transformers/faiss) or fallback gracefully. Also refactor the pipeline
to reuse existing embeddings package code instead of duplicating logic: import
and call Embedder, VectorIndexer, and chunk_text (instead of reimplementing
embedding/indexing/chunking) so embedding_pipeline.py delegates to those
classes/functions for model loading, embedding generation, and FAISS indexing.


class EmbeddingPipeline:
def __init__(self, model_name="all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name, cache_folder="D:/models_cache")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Hardcoded Windows-specific cache path — breaks portability.

cache_folder="D:/models_cache" will fail on non-Windows systems and on other developers' machines. Remove this or use a platform-agnostic default (e.g., a project-relative directory or respect HF_HOME/SENTENCE_TRANSFORMERS_HOME env vars).

Proposed fix
-        self.model = SentenceTransformer(model_name, cache_folder="D:/models_cache")
+        self.model = SentenceTransformer(model_name)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self.model = SentenceTransformer(model_name, cache_folder="D:/models_cache")
self.model = SentenceTransformer(model_name)
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py` at line 10, The
SentenceTransformer instantiation uses a hardcoded Windows-only cache folder;
update the constructor call in embedding_pipeline (the place where self.model =
SentenceTransformer(...)) to remove the fixed "D:/models_cache" and instead
compute a platform-agnostic cache directory: prefer honoring environment vars
like HF_HOME or SENTENCE_TRANSFORMERS_HOME (via os.getenv) and fall back to a
project-relative cache (e.g., os.path.join(project_root, ".cache",
"sentence_transformers")) or the default by omitting cache_folder; ensure you
import os and construct the path using os.path.join so the code is portable
across OSes.

Comment on lines +14 to +24
def chunk_text(self, text, max_length=300, overlap=50):
chunks = []
start = 0

while start < len(text):
end = start + max_length
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap

return chunks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Duplicate chunk_text — also susceptible to infinite loop.

This is a copy of the logic in embeddings/chunker.py with a different default max_length (300 vs 500) and without the end >= len(text) break guard. If overlap >= max_length, this version loops forever.

Reuse chunk_text from embeddings/chunker.py instead.

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py` around lines 14 - 24,
The chunk_text implementation in embedding_pipeline.py duplicates the logic from
embeddings/chunker.py and lacks the end>=len(text) guard (causing an infinite
loop when overlap >= max_length); replace the local method with a call to the
canonical chunk_text from embeddings.chunker (import chunk_text and delegate to
it) so you reuse the tested implementation and its break guard; if for some
reason you cannot import, modify the existing chunk_text to include the same
guard (check if end >= len(text) then append final chunk and break) and keep the
same signature (chunk_text(self, text, max_length=300, overlap=50)) so callers
remain compatible.

import re

#-------------------emedding-pipeline-chunking concept
from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check package configuration and structure
fd -t f "__init__.py" smart-notes/
fd -t f "setup.py|pyproject.toml|setup.cfg" smart-notes/

Repository: AOSSIE-Org/Info

Length of output: 143


🏁 Script executed:

#!/bin/bash
# Check for __init__.py at rag_mvp level
ls -la smart-notes/rag_mvp/__init__.py 2>&1

# Find README
fd -t f "README|readme" . --max-depth 3

# Check for config files at repo root
fd -t f "setup.py|pyproject.toml|setup.cfg|poetry.lock" . --max-depth 1

Repository: AOSSIE-Org/Info

Length of output: 181


🏁 Script executed:

#!/bin/bash
# Check the rag_mvp README which is most relevant
cat smart-notes/rag_mvp/README.md

# Also check root README for context
echo "=== ROOT README ===" 
head -100 ./README.md

Repository: AOSSIE-Org/Info

Length of output: 5181


Absolute import will fail when script is run as documented.

from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline requires rag_mvp to be a package discoverable from sys.path. Running python smart-notes/rag_mvp/qa_cli.py from the repo root (as the README suggests) adds only smart-notes/rag_mvp to sys.path, not smart-notes. Additionally, there is no __init__.py at smart-notes/rag_mvp/, so rag_mvp is not recognized as a package. This will fail with ModuleNotFoundError.

Use relative imports (from .pipelines.embedding_pipeline import ...), add an __init__.py at smart-notes/rag_mvp/, or update the execution instructions to use a method that properly configures the module path.

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/qa_cli.py` at line 5, The import in qa_cli.py uses an
absolute path that will fail when run directly; update the import to a relative
import (e.g., replace "from rag_mvp.pipelines.embedding_pipeline import
EmbeddingPipeline" with a relative import like "from
.pipelines.embedding_pipeline import EmbeddingPipeline") so EmbeddingPipeline is
resolved when running python smart-notes/rag_mvp/qa_cli.py, or alternatively add
an __init__.py to smart-notes/rag_mvp and adjust run instructions to execute the
package form; ensure the change targets the import line in qa_cli.py referencing
EmbeddingPipeline.

"of", "to", "in", "on", "for"
}

NOTES_DIR = "notes"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

NOTES_DIR is a relative path — behavior depends on the working directory.

"notes" resolves relative to the CWD, not relative to the script or project root. This will silently find no notes if the user runs the CLI from a different directory.

Proposed fix — resolve relative to the script location
-NOTES_DIR = "notes"
+NOTES_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..", "notes")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
NOTES_DIR = "notes"
NOTES_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "..", "notes")
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/qa_cli.py` at line 39, NOTES_DIR is a relative path that
depends on the current working directory; change it to be resolved relative to
the script location by constructing NOTES_DIR from the module file path (e.g.,
using Path(__file__).parent / "notes") so functions that reference NOTES_DIR
always point to the project's notes directory regardless of CWD; update any
imports to use pathlib.Path if necessary and ensure downstream code expects a
Path or str consistently (reference: NOTES_DIR in qa_cli.py).

Comment on lines +85 to +87
if __name__ == "__main__":

demo_embeddings_pipeline() # Temporary demo for embeddings pipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

demo_embeddings_pipeline() runs unconditionally on every CLI start.

This forces the SentenceTransformer model to load (and potentially download) every time a user launches the CLI, even if they only want the keyword-based Q&A. This adds significant startup latency. Consider making the embedding demo opt-in (e.g., via a CLI flag) or removing it from the default flow.

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/qa_cli.py` around lines 85 - 87, The
demo_embeddings_pipeline() is invoked unconditionally in the __main__ block
causing the SentenceTransformer to load on every CLI start; make the demo opt-in
by adding a CLI flag (e.g., --embeddings-demo or --demo-embeddings) via
argparse/typer and only call demo_embeddings_pipeline() when that flag is set
(leave existing keyword-based Q&A flow unchanged), or remove the call entirely
if you prefer no demo; update the __main__ block to check the new flag before
invoking demo_embeddings_pipeline() so startup latency is avoided unless the
user explicitly requests the demo.

Comment on lines +28 to +44
```bash
python smart-notes/rag_mvp/qa_cli.py



>> what is AI

[1] From test.md:
Artificial Intelligence (AI) is the simulation of human intelligence in machines.


>> what is machine learning
how is machine learning used
difference between AI and ML



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Malformed Markdown — unclosed code block bleeds into the rest of the document.

The code block opened at line 28 is never properly closed. The example CLI output (lines 33–43) and everything after it gets swallowed into the code fence, making the second half of the README render as a raw code block rather than formatted documentation.

Close the code block after the CLI example and before the second section heading.

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/README.md` around lines 28 - 44, The README.md has an
unclosed code fence around the CLI example (the block starting with the ```bash
before "python smart-notes/rag_mvp/qa_cli.py"), so close that code block
immediately after the shown CLI output (after the lines that show the prompts
and responses like ">> what is AI" and the subsequent output) to prevent the
rest of the document from rendering as code; locate the code fence in the
section containing the qa_cli.py example and add the closing ``` on its own line
before the next section heading or normal text.

Comment on lines +75 to +84
```bash
smart-notes/
├── rag_mvp/
│ ├── embed.py # Embedding logic
│ ├── index.py # FAISS index creation
│ ├── qa_cli.py # CLI for asking questions
│ └── utils.py # Helper functions
├── notes/ # Put your .txt notes here
├── requirements.txt
└── README.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Project structure in README doesn't match actual file layout.

The documented structure references embed.py, index.py, and utils.py, but the actual PR introduces embeddings/chunker.py, embeddings/embedder.py, embeddings/indexer.py, and pipelines/embedding_pipeline.py. Also, the trailing code block is never closed.

Please update the project structure to reflect the real file layout and close the code fence.

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/README.md` around lines 75 - 84, Update the README
project structure to match the actual files introduced in the PR: replace
references to embed.py, index.py, and utils.py with the new modules
embeddings/chunker.py, embeddings/embedder.py, embeddings/indexer.py and include
pipelines/embedding_pipeline.py under the rag_mvp/ tree; also close the open
Markdown code fence at the end of the example block. Ensure the README lists the
correct filenames and paths exactly as in the diff (embeddings/chunker.py,
embeddings/embedder.py, embeddings/indexer.py, pipelines/embedding_pipeline.py)
and that the triple backtick that starts the code block is properly terminated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant