GSoC 2026: Add Discord–GitHub Sync Bot Idea by tubajavedd · Pull Request #46 · AOSSIE-Org/Info

tubajavedd · 2026-02-14T18:04:40Z

SUMMARY

This PR proposes a new GSoC 2026 project idea: Discord–GitHub Sync Bot.

The goal of this idea is to automate contributor management between Discord and GitHub for large open-source organizations like AOSSIE.

MOTOVATION
As organizations grow, managing:

Discord roles
GitHub issue assignments
PR review requests
Contributor recognition

becomes increasingly manual and time-consuming.

This idea aims to reduce that overhead by building a self-hosted, configurable bot that synchronizes activity between Discord and GitHub.

WHAT THAT IDEA COVERS

Assign Discord roles based on GitHub contributions
Assign issues or review requests based on Discord roles
Track contributions and maintain a scoring system
Provide a configurable system reusable by other organizations

WHY IT MAY BE VALUABLE FOR AOSSIE

Reduces manual coordination work for mentors
Encourages contributor engagement
Improves transparency in recognition
Scalable for growing communities

I would appreciate feedback from mentors regarding scope, feasibility, and alignment with AOSSIE’s goals for GSoC 2026.

Summary by CodeRabbit

New Features
- Interactive Q&A system for querying local Markdown notes with keyword-based and semantic search capabilities
- Two MVP implementations: basic keyword search and advanced embeddings-based retrieval for intelligent note discovery
Documentation
- Comprehensive local Markdown storage architecture design
- Smart Notes landing page UI design specifications
- Updated README with improved structure and contributor guidance
- Complete MVP documentation with usage instructions
Chores
- Updated .gitignore to exclude notes directory

coderabbitai · 2026-02-14T18:04:54Z

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive RAG (Retrieval-Augmented Generation) MVP system for Smart Notes alongside expanded documentation. New Python modules implement text chunking, embedding generation, vector indexing, and a CLI interface. Documentation updates include design specifications for local markdown storage, GSoC project ideas, and reorganized README structures.

Changes

Cohort / File(s)	Summary
Documentation & Configuration `.gitignore`, `Design/local markdown storage.md`, `GSoC-Ideas/2026/Discord-GitHub-Sync-Bot.md`, `README.md`, `smart-notes-design/README.md`	Added notes directory to gitignore. Introduced design document for offline-first local markdown storage system with file watcher and SQLite indexing. Added GSoC project proposal for Discord–GitHub Sync Bot. Restructured main README with role-based sections, relative link normalization, and contribution guidance.
RAG MVP Core Components `smart-notes/rag_mvp/embeddings/chunker.py`, `smart-notes/rag_mvp/embeddings/embedder.py`, `smart-notes/rag_mvp/embeddings/indexer.py`, `smart-notes/rag_mvp/pipelines/embedding_pipeline.py`	Introduced text chunking with overlapping windows for context preservation. Added embedding wrapper using sentence-transformers. Implemented FAISS-backed vector indexer for semantic search. Created unified embedding pipeline orchestrating chunking, embedding generation, and index building.
RAG MVP Integration `smart-notes/rag_mvp/README.md`, `smart-notes/rag_mvp/qa_cli.py`	Added comprehensive README documenting two MVPs: Local Q&A and Embeddings & FAISS-based RAG. Introduced interactive QA CLI with note loading, sentence-level search, and embedding pipeline demonstration.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as qa_cli
    participant Pipeline as EmbeddingPipeline
    participant Chunker as chunker
    participant Embedder as embedder
    participant Index as VectorIndexer
    participant FAISS

    User->>CLI: Enter query
    CLI->>CLI: load_notes()
    CLI->>Pipeline: process_notes(text)
    Pipeline->>Chunker: chunk_text(text)
    Chunker-->>Pipeline: chunks[]
    Pipeline->>Embedder: embed(chunks)
    Embedder-->>Pipeline: embeddings[]
    Pipeline->>Index: add(embeddings, chunks)
    Index->>FAISS: add_vectors()
    FAISS-->>Index: index
    Index-->>Pipeline: indexed
    User->>CLI: semantic query
    CLI->>Pipeline: semantic_search(query)
    Pipeline->>Embedder: embed(query)
    Embedder-->>Pipeline: query_embedding
    Pipeline->>Index: search(query_embedding)
    Index->>FAISS: search()
    FAISS-->>Index: top_k results
    Index-->>Pipeline: relevant_chunks[]
    Pipeline-->>CLI: search_results
    CLI-->>User: Display matches

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

docs: add role-based guidance and navigation links to README #9: Overlaps with README restructuring, role-based sections, and contributor/mentor/admin guidance updates.
Create SmartNotes.md for GSoC 2026 proposal #26: Both PRs introduce documentation and code for the Smart Notes project with complementary feature sets.

Suggested reviewers

Zahnentferner

Poem

🐰 Hop, hop! New embeddings take flight,
Chunks and vectors, indexed just right,
From notes to queries, semantically true,
A RAG MVP—fresh and brand new! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (3 files): ⚔️ `GSoC-Ideas/2026/MiniChain.md` (content) ⚔️ `GSoC-Ideas/2026/Rein.md` (content) ⚔️ `README.md` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'GSoC 2026: Add Discord–GitHub Sync Bot Idea' clearly and accurately summarizes the primary change: adding a new GSoC 2026 project proposal for a Discord–GitHub Sync Bot.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

⚔️ Resolve merge conflicts (beta)

Auto-commit resolved conflicts to branch gsoc-2026-Discord-github-bot
Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 15

🤖 Fix all issues with AI agents

In `@Design/local` markdown storage.md:
- Around line 13-16: Fix the spelling mistake by replacing "Serives" with
"Services" in the list under the "Offline-first experience" section (the line
currently reading "NO Cloud Serives,"). Ensure the corrected line reads "NO
Cloud Services," so the three bullet items are "NO Internet, NO Cloud Services,
NO Server".
- Line 47: Fix the mismatched quotation in the string fragment by adding the
missing closing double-quote so both quoted items are balanced; update the text
`"in-memory index " and "local db(SQLite)` to `"in-memory index" and "local db
(SQLite)"` (adjust spacing as needed) to ensure both quoted phrases — "in-memory
index" and "local db (SQLite)" — are properly closed.
- Line 170: Section numbering skips 11: locate the markdown headings "### 12.
Summary" and the later "### 13" heading and renumber them so the sequence is
continuous (change "### 12. Summary" to "### 11. Summary" and update the "###
13" heading to "### 12"), ensuring all subsequent section numbers are adjusted
accordingly.

In `@GSoC-Ideas/2026/Discord-GitHub-Sync-Bot.md`:
- Line 1: The document title uses a level-6 heading; change the top-level title
line "###### Discord–GitHub Sync Bot" to a level-1 heading by replacing it with
"# Discord–GitHub Sync Bot" so the file's main title renders correctly as H1;
leave other headings unchanged unless they also need demotion to lower levels.
- Around line 104-108: The "Contribution scoring table" block under the "Action
Points" section is not valid Markdown; replace the plain text lines with a
proper Markdown table by adding a header row and a separator row using pipes and
hyphens (e.g., a header like "Action | Points" and a separator like "--- | ---")
so the entries ("PR merged", "Issue closed", "PR review", "Documentation PR")
appear as rows with their corresponding scores (+10, +5, +7, +3); update the
block where the current "Action	Points" lines appear to use this pipe-separated
table format so it renders correctly in Markdown.

In `@README.md`:
- Around line 11-20: TOC anchor links for emoji-prefixed headings (for example
the anchors "#-guidance-for-admins" and "#-important-notes") may not match
GitHub's generated anchors for headings like "🛡️ Guidance for Admins" and "⚠️
Important Notes"; update the README TOC to use the exact anchors GitHub creates
(or remove the leading emojis from the corresponding headings) so links
resolve—locate the headings "🛡️ Guidance for Admins" and "⚠️ Important Notes"
and either (a) change the heading text to remove the emoji or (b) replace the
TOC entries "- [🛡️ Guidance for Admins](`#-guidance-for-admins`)" and "- [⚠️
Important Notes](`#-important-notes`)" with the correct GitHub anchor strings that
match the rendered headings.

In `@smart-notes/rag_mvp/embeddings/chunker.py`:
- Around line 9-29: The chunk_text function can infinite-loop when overlap >=
max_length because start = end - overlap may not advance; inside chunk_text
validate and guard the overlap parameter (e.g., if overlap >= max_length raise
ValueError or clamp overlap to max(0, max_length-1)) before the loop, or ensure
the loop advances by forcing start = max(end - overlap, start + 1); update the
logic around the start/end calculation in chunk_text to guarantee progress on
each iteration.

In `@smart-notes/rag_mvp/embeddings/indexer.py`:
- Around line 23-28: The add method currently allows mismatched inputs which
will corrupt the FAISS index; before calling self.index.add and
self.texts.extend (in the add(self, embeddings: np.ndarray, chunks: List[str])
method) validate that embeddings is a 2D array and that embeddings.shape[0] ==
len(chunks); if embeddings is empty keep the existing early return, otherwise
raise a ValueError with a clear message when the row count and chunks length
differ so the index and self.texts cannot go out of sync; keep using the same
symbols (embeddings, chunks, self.index.add, self.texts.extend, search) so
reviewers can locate the change.

In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py`:
- Line 10: The SentenceTransformer instance is using a hardcoded Windows-only
cache_folder ("D:/models_cache") which will break on other OSes; update the
SentenceTransformer initialization in the embedding pipeline (where self.model
is set) to derive the cache folder from a portable source—preferably use an
environment variable (e.g., MODEL_CACHE_DIR) falling back to a user-relative
path (expand ~) or a relative directory using os.path.expanduser/os.path.join or
pathlib.Path—and add the required import for os (or pathlib) at the top of the
file so the cache path is platform-independent and configurable.
- Around line 14-24: The chunk_text method can infinite-loop when overlap >=
max_length and duplicates logic in embeddings/chunker.py; update chunk_text in
class (method name chunk_text) to validate inputs and ensure progress: raise or
clamp overlap to be less than max_length (or set overlap = min(overlap,
max_length-1)) and ensure start increases (e.g., compute next_start = max(start
+ 1, end - overlap)) so the while loop always advances; alternatively, remove
this method and import/reuse the existing chunk_text implementation from
embeddings/chunker.py (which uses max_length=500) to avoid duplication.
- Around line 41-46: The semantic_search method currently assumes self.index and
valid indices; first add a guard at the start of semantic_search to raise a
clear error if self.index is None (hint: reference build_index and process_notes
which populate the index) so callers know to run process_notes/build_index
first, and then sanitize FAISS indices before indexing self.chunks by mapping
the returned indices from self.index.search to results only when index >= 0 and
< len(self.chunks) (for any -1 or out-of-range values, skip or return
None/placeholders) to avoid silently returning the last chunk via negative
indexing.

In `@smart-notes/rag_mvp/qa_cli.py`:
- Around line 4-5: Fix the typo in the top-of-file comment: change
"emedding-pipeline-chunking concept" to "embedding-pipeline-chunking concept"
near the import of EmbeddingPipeline (class/import reference: EmbeddingPipeline
from rag_mvp.pipelines.embedding_pipeline) so the comment correctly spells
"embedding".
- Around line 63-82: In search_notes, avoid substring matches and silently
returning nothing: instead of checking "word in sentence_lower", perform
word-boundary matching (e.g., build a regex using r'\b{re.escape(word)}\b' for
each word or split sentence_lower into tokens and compare token sets) so "ai"
won't match inside other words; also handle the case where query_words becomes
empty (all tokens in QUESTION_WORDS) by returning an explicit informative result
or raising/returning a sentinel (e.g., a message like "no query terms after
removing stopwords") rather than silently returning an empty list; update
references in this function (search_notes), using QUESTION_WORDS and
split_sentences, and ensure you normalize casing and strip punctuation when
tokenizing.

In `@smart-notes/rag_mvp/README.md`:
- Around line 28-44: The README contains an unclosed fenced code block that
starts with "```bash" around the example run command; close that fence
immediately after the run command by adding a terminating "```" so the following
sections (sample output, the second heading, and the project structure) are not
swallowed into the code block, and move or reformat the sample CLI output (the
lines beginning with ">>" and the "[1] From test.md:" result) into its own
fenced block or plain paragraph so the "How to run" section and other headings
render correctly.
- Around line 75-84: The README's project tree is inaccurate: it lists embed.py,
index.py, and utils.py but the codebase actually contains
embeddings/embedder.py, embeddings/indexer.py, embeddings/chunker.py, and
pipelines/embedding_pipeline.py; update the README to either mirror the real
layout (replace embed.py → embeddings/embedder.py, index.py →
embeddings/indexer.py, add embeddings/chunker.py and
pipelines/embedding_pipeline.py) or provide a brief mapping note explaining
which modules correspond to the listed conceptual files so contributors can find
embedder, indexer, chunker and embedding_pipeline easily.

🧹 Nitpick comments (5)

smart-notes/rag_mvp/embeddings/chunker.py (1)

9-29: Duplicate chunking logic with EmbeddingPipeline.chunk_text.

A nearly identical chunk_text exists in smart-notes/rag_mvp/pipelines/embedding_pipeline.py (lines 13–23) with different defaults (max_length=300) and without the .strip()/empty-check refinements. Consider having the pipeline delegate to this standalone function to avoid divergence.
smart-notes/rag_mvp/embeddings/indexer.py (1)
34-34: Prefix unused distances with underscore.

Per Ruff RUF059, distances is never used.
-        distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
+        _distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
smart-notes/rag_mvp/pipelines/embedding_pipeline.py (1)

8-12: EmbeddingPipeline re-implements chunking, embedding, and indexing already provided by sibling modules.

The PR introduces embeddings/chunker.py, embeddings/embedder.py, and embeddings/indexer.py with the same responsibilities. This pipeline should compose those modules rather than duplicate their logic, to avoid divergence (e.g., different default max_length values: 300 here vs 500 in chunker.py).
smart-notes/rag_mvp/qa_cli.py (2)
85-88: demo_embeddings_pipeline() loads a heavy ML model unconditionally on every CLI start.

This forces a model download/load even when the user only wants the keyword-based note search. Gate it behind a flag or make it opt-in.
Example
 if __name__ == "__main__":
-
-    demo_embeddings_pipeline()      # Temporary demo for embeddings pipeline
+    import sys
+    if "--demo-embeddings" in sys.argv:
+        demo_embeddings_pipeline()
16-16: Unused variable embeddings.

Per Ruff RUF059, prefix with underscore since it's only used for the side effect.
-    chunks, embeddings = pipeline.process_notes(note_text)
+    chunks, _embeddings = pipeline.process_notes(note_text)

coderabbitai · 2026-02-14T18:07:22Z

Design/local markdown storage.md

+Offline-first experience : 
+    NO Internet,
+    NO Cloud Serives,
+    NO Server


⚠️ Potential issue | 🟡 Minor

Typo: "Serives" → "Services" (Line 15).

- NO Cloud Serives, + NO Cloud Services,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Offline-first experience :

NO Internet,

NO Cloud Serives,

NO Server

Offline-first experience :

NO Internet,

NO Cloud Services,

NO Server

🤖 Prompt for AI Agents

In `@Design/local` markdown storage.md around lines 13 - 16, Fix the spelling mistake by replacing "Serives" with "Services" in the list under the "Offline-first experience" section (the line currently reading "NO Cloud Serives,"). Ensure the corrected line reads "NO Cloud Services," so the three bullet items are "NO Internet, NO Cloud Services, NO Server".

coderabbitai · 2026-02-14T18:07:22Z

Design/local markdown storage.md

+
+The app maintains:
+
+  "in-memory index " and "local db(SQLite) 


⚠️ Potential issue | 🟡 Minor

Unclosed quotation mark.

The line has a mismatched quote: "local db(SQLite) is missing its closing ".

- "in-memory index " and "local db(SQLite) + "in-memory index" and "local db (SQLite)"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"in-memory index " and "local db(SQLite)

"in-memory index" and "local db (SQLite)"

🤖 Prompt for AI Agents

In `@Design/local` markdown storage.md at line 47, Fix the mismatched quotation in the string fragment by adding the missing closing double-quote so both quoted items are balanced; update the text `"in-memory index " and "local db(SQLite)` to `"in-memory index" and "local db (SQLite)"` (adjust spacing as needed) to ensure both quoted phrases — "in-memory index" and "local db (SQLite)" — are properly closed.

coderabbitai · 2026-02-14T18:07:22Z

Design/local markdown storage.md

+
+
+
+### 12. Summary


⚠️ Potential issue | 🟡 Minor

Section numbering skips from 10 to 12 — section 11 is missing.

-### 12. Summary +### 11. Summary

(And renumber section 13 → 12 accordingly on Line 176.)

🤖 Prompt for AI Agents

In `@Design/local` markdown storage.md at line 170, Section numbering skips 11: locate the markdown headings "### 12. Summary" and the later "### 13" heading and renumber them so the sequence is continuous (change "### 12. Summary" to "### 11. Summary" and update the "### 13" heading to "### 12"), ensuring all subsequent section numbers are adjusted accordingly.

coderabbitai · 2026-02-14T18:07:22Z

GSoC-Ideas/2026/Discord-GitHub-Sync-Bot.md

@@ -0,0 +1,203 @@
+###### Discord–GitHub Sync Bot


⚠️ Potential issue | 🟡 Minor

Heading level: use # (H1) for the document title, not ###### (H6).

###### renders as the smallest heading. For a top-level document title, use #.

-###### Discord–GitHub Sync Bot +# Discord–GitHub Sync Bot

🤖 Prompt for AI Agents

In `@GSoC-Ideas/2026/Discord-GitHub-Sync-Bot.md` at line 1, The document title uses a level-6 heading; change the top-level title line "###### Discord–GitHub Sync Bot" to a level-1 heading by replacing it with "# Discord–GitHub Sync Bot" so the file's main title renders correctly as H1; leave other headings unchanged unless they also need demotion to lower levels.

coderabbitai · 2026-02-14T18:07:22Z

GSoC-Ideas/2026/Discord-GitHub-Sync-Bot.md

+Action	Points
+PR merged-> 	+10
+Issue closed-> 	+5
+PR review-> 	+7
+Documentation PR-> 	+3


⚠️ Potential issue | 🟡 Minor

Contribution scoring table is not valid Markdown.

The table on these lines won't render correctly. Use proper Markdown table syntax.

Proposed fix

-Action Points -PR merged-> +10 -Issue closed-> +5 -PR review-> +7 -Documentation PR-> +3 +| Action | Points | +|------------------|--------| +| PR merged | +10 | +| Issue closed | +5 | +| PR review | +7 | +| Documentation PR | +3 |

🤖 Prompt for AI Agents

In `@GSoC-Ideas/2026/Discord-GitHub-Sync-Bot.md` around lines 104 - 108, The "Contribution scoring table" block under the "Action Points" section is not valid Markdown; replace the plain text lines with a proper Markdown table by adding a header row and a separator row using pipes and hyphens (e.g., a header like "Action | Points" and a separator like "--- | ---") so the entries ("PR merged", "Issue closed", "PR review", "Documentation PR") appear as rows with their corresponding scores (+10, +5, +7, +3); update the block where the current "Action Points" lines appear to use this pipe-separated table format so it renders correctly in Markdown.

coderabbitai · 2026-02-14T18:07:23Z

smart-notes/rag_mvp/pipelines/embedding_pipeline.py

+    def semantic_search(self, query, top_k=3):
+        query_vec = self.model.encode([query])
+        query_vec = np.array(query_vec).astype("float32")
+
+        distances, indices = self.index.search(query_vec, top_k)
+        results = [self.chunks[i] for i in indices[0]]


⚠️ Potential issue | 🟠 Major

semantic_search crashes if called before process_notes, and can index into self.chunks with -1.

Two issues:

self.index is None until build_index is called — calling semantic_search first will raise AttributeError.

FAISS returns -1 for indices when fewer than top_k results exist. self.chunks[-1] silently returns the last chunk instead of indicating "no result."

Proposed fix

def semantic_search(self, query, top_k=3): + if self.index is None: + raise RuntimeError("Index not built. Call process_notes() first.") query_vec = self.model.encode([query]) query_vec = np.array(query_vec).astype("float32") - distances, indices = self.index.search(query_vec, top_k) - results = [self.chunks[i] for i in indices[0]] + _distances, indices = self.index.search(query_vec, top_k) + results = [self.chunks[i] for i in indices[0] if i != -1] return results

🧰 Tools

🪛 Ruff (0.15.0)

[warning] 45-45: Unpacked variable distances is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py` around lines 41 - 46, The semantic_search method currently assumes self.index and valid indices; first add a guard at the start of semantic_search to raise a clear error if self.index is None (hint: reference build_index and process_notes which populate the index) so callers know to run process_notes/build_index first, and then sanitize FAISS indices before indexing self.chunks by mapping the returned indices from self.index.search to results only when index >= 0 and < len(self.chunks) (for any -1 or out-of-range values, skip or return None/placeholders) to avoid silently returning the last chunk via negative indexing.

coderabbitai · 2026-02-14T18:07:23Z

smart-notes/rag_mvp/qa_cli.py

+#-------------------emedding-pipeline-chunking concept
+from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline


⚠️ Potential issue | 🟡 Minor

Typo in comment: "emedding" → "embedding".

-#-------------------emedding-pipeline-chunking concept +#-------------------embedding-pipeline-chunking concept

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#-------------------emedding-pipeline-chunking concept

from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline

`#-------------------embedding-pipeline-chunking` concept

from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/qa_cli.py` around lines 4 - 5, Fix the typo in the top-of-file comment: change "emedding-pipeline-chunking concept" to "embedding-pipeline-chunking concept" near the import of EmbeddingPipeline (class/import reference: EmbeddingPipeline from rag_mvp.pipelines.embedding_pipeline) so the comment correctly spells "embedding".

coderabbitai · 2026-02-14T18:07:23Z

smart-notes/rag_mvp/qa_cli.py

+def search_notes(query, notes):
+    results = []
+
+    query_words = [
+        word.lower()
+        for word in query.split()
+        if word.lower() not in QUESTION_WORDS
+    ]
+
+    for note in notes:
+        sentences = split_sentences(note["content"])
+        for sentence in sentences:
+            sentence_lower = sentence.lower()
+            if any(word in sentence_lower for word in query_words):
+                results.append({
+                    "filename": note["filename"],
+                    "sentence": sentence.strip()
+                })
+
+    return results


⚠️ Potential issue | 🟠 Major

Keyword search uses substring matching, producing false positives; also silently returns nothing if all words are stopwords.

Two issues:

word in sentence_lower matches substrings — e.g., query word "ai" matches "ertainly", "is" matches "this". Use word-boundary matching instead.

If every word in the query is in QUESTION_WORDS, query_words becomes empty, any(...) returns False, and the user gets "No relevant notes found" with no explanation.

Proposed fix sketch

+import re + def search_notes(query, notes): results = [] query_words = [ word.lower() for word in query.split() if word.lower() not in QUESTION_WORDS ] + if not query_words: + return results # or warn the user + for note in notes: sentences = split_sentences(note["content"]) for sentence in sentences: sentence_lower = sentence.lower() - if any(word in sentence_lower for word in query_words): + if any(re.search(r'\b' + re.escape(word) + r'\b', sentence_lower) for word in query_words): results.append({ "filename": note["filename"], "sentence": sentence.strip() }) return results

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/qa_cli.py` around lines 63 - 82, In search_notes, avoid substring matches and silently returning nothing: instead of checking "word in sentence_lower", perform word-boundary matching (e.g., build a regex using r'\b{re.escape(word)}\b' for each word or split sentence_lower into tokens and compare token sets) so "ai" won't match inside other words; also handle the case where query_words becomes empty (all tokens in QUESTION_WORDS) by returning an explicit informative result or raising/returning a sentinel (e.g., a message like "no query terms after removing stopwords") rather than silently returning an empty list; update references in this function (search_notes), using QUESTION_WORDS and split_sentences, and ensure you normalize casing and strip punctuation when tokenizing.

coderabbitai · 2026-02-14T18:07:23Z

smart-notes/rag_mvp/README.md

+```bash
+python smart-notes/rag_mvp/qa_cli.py
+
+
+
+>> what is AI
+
+[1] From test.md:
+Artificial Intelligence (AI) is the simulation of human intelligence in machines.
+
+
+>>  what is machine learning
+how is machine learning used
+difference between AI and ML
+
+
+


⚠️ Potential issue | 🟠 Major

Broken markdown: unclosed code block merges the "How to run" section with sample output and the second half of the document.

The ```bash block starting at line 28 is never closed. Everything after it — including sample queries, the second README heading, and the project structure — renders as one continuous code block.

Add the closing ``` after the run command and format sample output separately.

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/README.md` around lines 28 - 44, The README contains an unclosed fenced code block that starts with "```bash" around the example run command; close that fence immediately after the run command by adding a terminating "```" so the following sections (sample output, the second heading, and the project structure) are not swallowed into the code block, and move or reformat the sample CLI output (the lines beginning with ">>" and the "[1] From test.md:" result) into its own fenced block or plain paragraph so the "How to run" section and other headings render correctly.

coderabbitai · 2026-02-14T18:07:23Z

smart-notes/rag_mvp/README.md

+```bash
+smart-notes/
+├── rag_mvp/
+│   ├── embed.py        # Embedding logic
+│   ├── index.py        # FAISS index creation
+│   ├── qa_cli.py       # CLI for asking questions
+│   └── utils.py        # Helper functions
+├── notes/              # Put your .txt notes here
+├── requirements.txt
+└── README.md


⚠️ Potential issue | 🟡 Minor

Project structure lists filenames that don't match the actual modules.

The tree references embed.py, index.py, and utils.py, but the actual files are embeddings/embedder.py, embeddings/indexer.py, embeddings/chunker.py, and pipelines/embedding_pipeline.py. This will confuse contributors.

🤖 Prompt for AI Agents

In `@smart-notes/rag_mvp/README.md` around lines 75 - 84, The README's project tree is inaccurate: it lists embed.py, index.py, and utils.py but the codebase actually contains embeddings/embedder.py, embeddings/indexer.py, embeddings/chunker.py, and pipelines/embedding_pipeline.py; update the README to either mirror the real layout (replace embed.py → embeddings/embedder.py, index.py → embeddings/indexer.py, add embeddings/chunker.py and pipelines/embedding_pipeline.py) or provide a brief mapping note explaining which modules correspond to the listed conceptual files so contributors can find embedder, indexer, chunker and embedding_pipeline easily.

Tuba Javed added 8 commits February 9, 2026 00:18

Add local Q&A CLI MVP for markdown notes

564741c

embedding-pipeline-chunking

5238131

Design ui landing page of smart-notes

40aba7e

Local-markdown-storage-design

917ef7c

Local-markdown-storage-design changes

aa2b95c

do changes in architecture(trade-offs) of local markdown storage.md

a007c92

docs: improve README navigation and contribution guidance

f268eb3

add Discord-Github-Sync-Bot.md in GSoC-Ideas\2026

b4ac89d

coderabbitai bot reviewed Feb 14, 2026

View reviewed changes

	"in-memory index " and "local db(SQLite)
	"in-memory index" and "local db (SQLite)"

		#-------------------emedding-pipeline-chunking concept
		from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline

Uh oh!

Conversation

tubajavedd commented Feb 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tubajavedd commented Feb 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 14, 2026 •

edited

Loading