Skip to content

GSoC 2026: Add Discord–GitHub Sync Bot Idea#46

Open
tubajavedd wants to merge 8 commits intoAOSSIE-Org:mainfrom
tubajavedd:gsoc-2026-Discord-github-bot
Open

GSoC 2026: Add Discord–GitHub Sync Bot Idea#46
tubajavedd wants to merge 8 commits intoAOSSIE-Org:mainfrom
tubajavedd:gsoc-2026-Discord-github-bot

Conversation

@tubajavedd
Copy link

@tubajavedd tubajavedd commented Feb 14, 2026

SUMMARY

This PR proposes a new GSoC 2026 project idea: Discord–GitHub Sync Bot.

The goal of this idea is to automate contributor management between Discord and GitHub for large open-source organizations like AOSSIE.

MOTOVATION
As organizations grow, managing:

Discord roles
GitHub issue assignments
PR review requests
Contributor recognition

becomes increasingly manual and time-consuming.

This idea aims to reduce that overhead by building a self-hosted, configurable bot that synchronizes activity between Discord and GitHub.

WHAT THAT IDEA COVERS

Assign Discord roles based on GitHub contributions
Assign issues or review requests based on Discord roles
Track contributions and maintain a scoring system
Provide a configurable system reusable by other organizations

WHY IT MAY BE VALUABLE FOR AOSSIE

Reduces manual coordination work for mentors
Encourages contributor engagement
Improves transparency in recognition
Scalable for growing communities

I would appreciate feedback from mentors regarding scope, feasibility, and alignment with AOSSIE’s goals for GSoC 2026.

Summary by CodeRabbit

  • New Features

    • Interactive Q&A system for querying local Markdown notes with keyword-based and semantic search capabilities
    • Two MVP implementations: basic keyword search and advanced embeddings-based retrieval for intelligent note discovery
  • Documentation

    • Comprehensive local Markdown storage architecture design
    • Smart Notes landing page UI design specifications
    • Updated README with improved structure and contributor guidance
    • Complete MVP documentation with usage instructions
  • Chores

    • Updated .gitignore to exclude notes directory

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 14, 2026

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive RAG (Retrieval-Augmented Generation) MVP system for Smart Notes alongside expanded documentation. New Python modules implement text chunking, embedding generation, vector indexing, and a CLI interface. Documentation updates include design specifications for local markdown storage, GSoC project ideas, and reorganized README structures.

Changes

Cohort / File(s) Summary
Documentation & Configuration
.gitignore, Design/local markdown storage.md, GSoC-Ideas/2026/Discord-GitHub-Sync-Bot.md, README.md, smart-notes-design/README.md
Added notes directory to gitignore. Introduced design document for offline-first local markdown storage system with file watcher and SQLite indexing. Added GSoC project proposal for Discord–GitHub Sync Bot. Restructured main README with role-based sections, relative link normalization, and contribution guidance.
RAG MVP Core Components
smart-notes/rag_mvp/embeddings/chunker.py, smart-notes/rag_mvp/embeddings/embedder.py, smart-notes/rag_mvp/embeddings/indexer.py, smart-notes/rag_mvp/pipelines/embedding_pipeline.py
Introduced text chunking with overlapping windows for context preservation. Added embedding wrapper using sentence-transformers. Implemented FAISS-backed vector indexer for semantic search. Created unified embedding pipeline orchestrating chunking, embedding generation, and index building.
RAG MVP Integration
smart-notes/rag_mvp/README.md, smart-notes/rag_mvp/qa_cli.py
Added comprehensive README documenting two MVPs: Local Q&A and Embeddings & FAISS-based RAG. Introduced interactive QA CLI with note loading, sentence-level search, and embedding pipeline demonstration.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as qa_cli
    participant Pipeline as EmbeddingPipeline
    participant Chunker as chunker
    participant Embedder as embedder
    participant Index as VectorIndexer
    participant FAISS

    User->>CLI: Enter query
    CLI->>CLI: load_notes()
    CLI->>Pipeline: process_notes(text)
    Pipeline->>Chunker: chunk_text(text)
    Chunker-->>Pipeline: chunks[]
    Pipeline->>Embedder: embed(chunks)
    Embedder-->>Pipeline: embeddings[]
    Pipeline->>Index: add(embeddings, chunks)
    Index->>FAISS: add_vectors()
    FAISS-->>Index: index
    Index-->>Pipeline: indexed
    User->>CLI: semantic query
    CLI->>Pipeline: semantic_search(query)
    Pipeline->>Embedder: embed(query)
    Embedder-->>Pipeline: query_embedding
    Pipeline->>Index: search(query_embedding)
    Index->>FAISS: search()
    FAISS-->>Index: top_k results
    Index-->>Pipeline: relevant_chunks[]
    Pipeline-->>CLI: search_results
    CLI-->>User: Display matches
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • Zahnentferner

Poem

🐰 Hop, hop! New embeddings take flight,
Chunks and vectors, indexed just right,
From notes to queries, semantically true,
A RAG MVP—fresh and brand new!

🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (3 files):

⚔️ GSoC-Ideas/2026/MiniChain.md (content)
⚔️ GSoC-Ideas/2026/Rein.md (content)
⚔️ README.md (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'GSoC 2026: Add Discord–GitHub Sync Bot Idea' clearly and accurately summarizes the primary change: adding a new GSoC 2026 project proposal for a Discord–GitHub Sync Bot.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch gsoc-2026-Discord-github-bot
  • Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🤖 Fix all issues with AI agents
In `@Design/local` markdown storage.md:
- Around line 13-16: Fix the spelling mistake by replacing "Serives" with
"Services" in the list under the "Offline-first experience" section (the line
currently reading "NO Cloud Serives,"). Ensure the corrected line reads "NO
Cloud Services," so the three bullet items are "NO Internet, NO Cloud Services,
NO Server".
- Line 47: Fix the mismatched quotation in the string fragment by adding the
missing closing double-quote so both quoted items are balanced; update the text
`"in-memory index " and "local db(SQLite)` to `"in-memory index" and "local db
(SQLite)"` (adjust spacing as needed) to ensure both quoted phrases — "in-memory
index" and "local db (SQLite)" — are properly closed.
- Line 170: Section numbering skips 11: locate the markdown headings "### 12.
Summary" and the later "### 13" heading and renumber them so the sequence is
continuous (change "### 12. Summary" to "### 11. Summary" and update the "###
13" heading to "### 12"), ensuring all subsequent section numbers are adjusted
accordingly.

In `@GSoC-Ideas/2026/Discord-GitHub-Sync-Bot.md`:
- Line 1: The document title uses a level-6 heading; change the top-level title
line "###### Discord–GitHub Sync Bot" to a level-1 heading by replacing it with
"# Discord–GitHub Sync Bot" so the file's main title renders correctly as H1;
leave other headings unchanged unless they also need demotion to lower levels.
- Around line 104-108: The "Contribution scoring table" block under the "Action
Points" section is not valid Markdown; replace the plain text lines with a
proper Markdown table by adding a header row and a separator row using pipes and
hyphens (e.g., a header like "Action | Points" and a separator like "--- | ---")
so the entries ("PR merged", "Issue closed", "PR review", "Documentation PR")
appear as rows with their corresponding scores (+10, +5, +7, +3); update the
block where the current "Action	Points" lines appear to use this pipe-separated
table format so it renders correctly in Markdown.

In `@README.md`:
- Around line 11-20: TOC anchor links for emoji-prefixed headings (for example
the anchors "#-guidance-for-admins" and "#-important-notes") may not match
GitHub's generated anchors for headings like "🛡️ Guidance for Admins" and "⚠️
Important Notes"; update the README TOC to use the exact anchors GitHub creates
(or remove the leading emojis from the corresponding headings) so links
resolve—locate the headings "🛡️ Guidance for Admins" and "⚠️ Important Notes"
and either (a) change the heading text to remove the emoji or (b) replace the
TOC entries "- [🛡️ Guidance for Admins](`#-guidance-for-admins`)" and "- [⚠️
Important Notes](`#-important-notes`)" with the correct GitHub anchor strings that
match the rendered headings.

In `@smart-notes/rag_mvp/embeddings/chunker.py`:
- Around line 9-29: The chunk_text function can infinite-loop when overlap >=
max_length because start = end - overlap may not advance; inside chunk_text
validate and guard the overlap parameter (e.g., if overlap >= max_length raise
ValueError or clamp overlap to max(0, max_length-1)) before the loop, or ensure
the loop advances by forcing start = max(end - overlap, start + 1); update the
logic around the start/end calculation in chunk_text to guarantee progress on
each iteration.

In `@smart-notes/rag_mvp/embeddings/indexer.py`:
- Around line 23-28: The add method currently allows mismatched inputs which
will corrupt the FAISS index; before calling self.index.add and
self.texts.extend (in the add(self, embeddings: np.ndarray, chunks: List[str])
method) validate that embeddings is a 2D array and that embeddings.shape[0] ==
len(chunks); if embeddings is empty keep the existing early return, otherwise
raise a ValueError with a clear message when the row count and chunks length
differ so the index and self.texts cannot go out of sync; keep using the same
symbols (embeddings, chunks, self.index.add, self.texts.extend, search) so
reviewers can locate the change.

In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py`:
- Line 10: The SentenceTransformer instance is using a hardcoded Windows-only
cache_folder ("D:/models_cache") which will break on other OSes; update the
SentenceTransformer initialization in the embedding pipeline (where self.model
is set) to derive the cache folder from a portable source—preferably use an
environment variable (e.g., MODEL_CACHE_DIR) falling back to a user-relative
path (expand ~) or a relative directory using os.path.expanduser/os.path.join or
pathlib.Path—and add the required import for os (or pathlib) at the top of the
file so the cache path is platform-independent and configurable.
- Around line 14-24: The chunk_text method can infinite-loop when overlap >=
max_length and duplicates logic in embeddings/chunker.py; update chunk_text in
class (method name chunk_text) to validate inputs and ensure progress: raise or
clamp overlap to be less than max_length (or set overlap = min(overlap,
max_length-1)) and ensure start increases (e.g., compute next_start = max(start
+ 1, end - overlap)) so the while loop always advances; alternatively, remove
this method and import/reuse the existing chunk_text implementation from
embeddings/chunker.py (which uses max_length=500) to avoid duplication.
- Around line 41-46: The semantic_search method currently assumes self.index and
valid indices; first add a guard at the start of semantic_search to raise a
clear error if self.index is None (hint: reference build_index and process_notes
which populate the index) so callers know to run process_notes/build_index
first, and then sanitize FAISS indices before indexing self.chunks by mapping
the returned indices from self.index.search to results only when index >= 0 and
< len(self.chunks) (for any -1 or out-of-range values, skip or return
None/placeholders) to avoid silently returning the last chunk via negative
indexing.

In `@smart-notes/rag_mvp/qa_cli.py`:
- Around line 4-5: Fix the typo in the top-of-file comment: change
"emedding-pipeline-chunking concept" to "embedding-pipeline-chunking concept"
near the import of EmbeddingPipeline (class/import reference: EmbeddingPipeline
from rag_mvp.pipelines.embedding_pipeline) so the comment correctly spells
"embedding".
- Around line 63-82: In search_notes, avoid substring matches and silently
returning nothing: instead of checking "word in sentence_lower", perform
word-boundary matching (e.g., build a regex using r'\b{re.escape(word)}\b' for
each word or split sentence_lower into tokens and compare token sets) so "ai"
won't match inside other words; also handle the case where query_words becomes
empty (all tokens in QUESTION_WORDS) by returning an explicit informative result
or raising/returning a sentinel (e.g., a message like "no query terms after
removing stopwords") rather than silently returning an empty list; update
references in this function (search_notes), using QUESTION_WORDS and
split_sentences, and ensure you normalize casing and strip punctuation when
tokenizing.

In `@smart-notes/rag_mvp/README.md`:
- Around line 28-44: The README contains an unclosed fenced code block that
starts with "```bash" around the example run command; close that fence
immediately after the run command by adding a terminating "```" so the following
sections (sample output, the second heading, and the project structure) are not
swallowed into the code block, and move or reformat the sample CLI output (the
lines beginning with ">>" and the "[1] From test.md:" result) into its own
fenced block or plain paragraph so the "How to run" section and other headings
render correctly.
- Around line 75-84: The README's project tree is inaccurate: it lists embed.py,
index.py, and utils.py but the codebase actually contains
embeddings/embedder.py, embeddings/indexer.py, embeddings/chunker.py, and
pipelines/embedding_pipeline.py; update the README to either mirror the real
layout (replace embed.py → embeddings/embedder.py, index.py →
embeddings/indexer.py, add embeddings/chunker.py and
pipelines/embedding_pipeline.py) or provide a brief mapping note explaining
which modules correspond to the listed conceptual files so contributors can find
embedder, indexer, chunker and embedding_pipeline easily.
🧹 Nitpick comments (5)
smart-notes/rag_mvp/embeddings/chunker.py (1)

9-29: Duplicate chunking logic with EmbeddingPipeline.chunk_text.

A nearly identical chunk_text exists in smart-notes/rag_mvp/pipelines/embedding_pipeline.py (lines 13–23) with different defaults (max_length=300) and without the .strip()/empty-check refinements. Consider having the pipeline delegate to this standalone function to avoid divergence.

smart-notes/rag_mvp/embeddings/indexer.py (1)

34-34: Prefix unused distances with underscore.

Per Ruff RUF059, distances is never used.

-        distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
+        _distances, indices = self.index.search(query_embedding.reshape(1, -1), k)
smart-notes/rag_mvp/pipelines/embedding_pipeline.py (1)

8-12: EmbeddingPipeline re-implements chunking, embedding, and indexing already provided by sibling modules.

The PR introduces embeddings/chunker.py, embeddings/embedder.py, and embeddings/indexer.py with the same responsibilities. This pipeline should compose those modules rather than duplicate their logic, to avoid divergence (e.g., different default max_length values: 300 here vs 500 in chunker.py).

smart-notes/rag_mvp/qa_cli.py (2)

85-88: demo_embeddings_pipeline() loads a heavy ML model unconditionally on every CLI start.

This forces a model download/load even when the user only wants the keyword-based note search. Gate it behind a flag or make it opt-in.

Example
 if __name__ == "__main__":
-
-    demo_embeddings_pipeline()      # Temporary demo for embeddings pipeline
+    import sys
+    if "--demo-embeddings" in sys.argv:
+        demo_embeddings_pipeline()

16-16: Unused variable embeddings.

Per Ruff RUF059, prefix with underscore since it's only used for the side effect.

-    chunks, embeddings = pipeline.process_notes(note_text)
+    chunks, _embeddings = pipeline.process_notes(note_text)

Comment on lines +13 to +16
Offline-first experience :
NO Internet,
NO Cloud Serives,
NO Server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: "Serives" → "Services" (Line 15).

-    NO Cloud Serives,
+    NO Cloud Services,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Offline-first experience :
NO Internet,
NO Cloud Serives,
NO Server
Offline-first experience :
NO Internet,
NO Cloud Services,
NO Server
🤖 Prompt for AI Agents
In `@Design/local` markdown storage.md around lines 13 - 16, Fix the spelling
mistake by replacing "Serives" with "Services" in the list under the
"Offline-first experience" section (the line currently reading "NO Cloud
Serives,"). Ensure the corrected line reads "NO Cloud Services," so the three
bullet items are "NO Internet, NO Cloud Services, NO Server".


The app maintains:

"in-memory index " and "local db(SQLite)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unclosed quotation mark.

The line has a mismatched quote: "local db(SQLite) is missing its closing ".

-  "in-memory index " and "local db(SQLite) 
+  "in-memory index" and "local db (SQLite)"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"in-memory index " and "local db(SQLite)
"in-memory index" and "local db (SQLite)"
🤖 Prompt for AI Agents
In `@Design/local` markdown storage.md at line 47, Fix the mismatched quotation in
the string fragment by adding the missing closing double-quote so both quoted
items are balanced; update the text `"in-memory index " and "local db(SQLite)`
to `"in-memory index" and "local db (SQLite)"` (adjust spacing as needed) to
ensure both quoted phrases — "in-memory index" and "local db (SQLite)" — are
properly closed.




### 12. Summary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Section numbering skips from 10 to 12 — section 11 is missing.

-### 12. Summary
+### 11. Summary

(And renumber section 13 → 12 accordingly on Line 176.)

🤖 Prompt for AI Agents
In `@Design/local` markdown storage.md at line 170, Section numbering skips 11:
locate the markdown headings "### 12. Summary" and the later "### 13" heading
and renumber them so the sequence is continuous (change "### 12. Summary" to
"### 11. Summary" and update the "### 13" heading to "### 12"), ensuring all
subsequent section numbers are adjusted accordingly.

@@ -0,0 +1,203 @@
###### Discord–GitHub Sync Bot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Heading level: use # (H1) for the document title, not ###### (H6).

###### renders as the smallest heading. For a top-level document title, use #.

-###### Discord–GitHub Sync Bot
+# Discord–GitHub Sync Bot
🤖 Prompt for AI Agents
In `@GSoC-Ideas/2026/Discord-GitHub-Sync-Bot.md` at line 1, The document title
uses a level-6 heading; change the top-level title line "###### Discord–GitHub
Sync Bot" to a level-1 heading by replacing it with "# Discord–GitHub Sync Bot"
so the file's main title renders correctly as H1; leave other headings unchanged
unless they also need demotion to lower levels.

Comment on lines +104 to +108
Action Points
PR merged-> +10
Issue closed-> +5
PR review-> +7
Documentation PR-> +3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Contribution scoring table is not valid Markdown.

The table on these lines won't render correctly. Use proper Markdown table syntax.

Proposed fix
-Action	Points
-PR merged-> 	+10
-Issue closed-> 	+5
-PR review-> 	+7
-Documentation PR-> 	+3
+| Action           | Points |
+|------------------|--------|
+| PR merged        | +10    |
+| Issue closed     | +5     |
+| PR review        | +7     |
+| Documentation PR | +3     |
🤖 Prompt for AI Agents
In `@GSoC-Ideas/2026/Discord-GitHub-Sync-Bot.md` around lines 104 - 108, The
"Contribution scoring table" block under the "Action Points" section is not
valid Markdown; replace the plain text lines with a proper Markdown table by
adding a header row and a separator row using pipes and hyphens (e.g., a header
like "Action | Points" and a separator like "--- | ---") so the entries ("PR
merged", "Issue closed", "PR review", "Documentation PR") appear as rows with
their corresponding scores (+10, +5, +7, +3); update the block where the current
"Action	Points" lines appear to use this pipe-separated table format so it
renders correctly in Markdown.

Comment on lines +41 to +46
def semantic_search(self, query, top_k=3):
query_vec = self.model.encode([query])
query_vec = np.array(query_vec).astype("float32")

distances, indices = self.index.search(query_vec, top_k)
results = [self.chunks[i] for i in indices[0]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

semantic_search crashes if called before process_notes, and can index into self.chunks with -1.

Two issues:

  1. self.index is None until build_index is called — calling semantic_search first will raise AttributeError.
  2. FAISS returns -1 for indices when fewer than top_k results exist. self.chunks[-1] silently returns the last chunk instead of indicating "no result."
Proposed fix
     def semantic_search(self, query, top_k=3):
+        if self.index is None:
+            raise RuntimeError("Index not built. Call process_notes() first.")
         query_vec = self.model.encode([query])
         query_vec = np.array(query_vec).astype("float32")
 
-        distances, indices = self.index.search(query_vec, top_k)
-        results = [self.chunks[i] for i in indices[0]]
+        _distances, indices = self.index.search(query_vec, top_k)
+        results = [self.chunks[i] for i in indices[0] if i != -1]
         return results
🧰 Tools
🪛 Ruff (0.15.0)

[warning] 45-45: Unpacked variable distances is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/pipelines/embedding_pipeline.py` around lines 41 - 46,
The semantic_search method currently assumes self.index and valid indices; first
add a guard at the start of semantic_search to raise a clear error if self.index
is None (hint: reference build_index and process_notes which populate the index)
so callers know to run process_notes/build_index first, and then sanitize FAISS
indices before indexing self.chunks by mapping the returned indices from
self.index.search to results only when index >= 0 and < len(self.chunks) (for
any -1 or out-of-range values, skip or return None/placeholders) to avoid
silently returning the last chunk via negative indexing.

Comment on lines +4 to +5
#-------------------emedding-pipeline-chunking concept
from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo in comment: "emedding" → "embedding".

-#-------------------emedding-pipeline-chunking concept
+#-------------------embedding-pipeline-chunking concept
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#-------------------emedding-pipeline-chunking concept
from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline
`#-------------------embedding-pipeline-chunking` concept
from rag_mvp.pipelines.embedding_pipeline import EmbeddingPipeline
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/qa_cli.py` around lines 4 - 5, Fix the typo in the
top-of-file comment: change "emedding-pipeline-chunking concept" to
"embedding-pipeline-chunking concept" near the import of EmbeddingPipeline
(class/import reference: EmbeddingPipeline from
rag_mvp.pipelines.embedding_pipeline) so the comment correctly spells
"embedding".

Comment on lines +63 to +82
def search_notes(query, notes):
results = []

query_words = [
word.lower()
for word in query.split()
if word.lower() not in QUESTION_WORDS
]

for note in notes:
sentences = split_sentences(note["content"])
for sentence in sentences:
sentence_lower = sentence.lower()
if any(word in sentence_lower for word in query_words):
results.append({
"filename": note["filename"],
"sentence": sentence.strip()
})

return results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keyword search uses substring matching, producing false positives; also silently returns nothing if all words are stopwords.

Two issues:

  1. word in sentence_lower matches substrings — e.g., query word "ai" matches "ertainly", "is" matches "this". Use word-boundary matching instead.
  2. If every word in the query is in QUESTION_WORDS, query_words becomes empty, any(...) returns False, and the user gets "No relevant notes found" with no explanation.
Proposed fix sketch
+import re
+
 def search_notes(query, notes):
     results = []
 
     query_words = [
         word.lower()
         for word in query.split()
         if word.lower() not in QUESTION_WORDS
     ]
 
+    if not query_words:
+        return results  # or warn the user
+
     for note in notes:
         sentences = split_sentences(note["content"])
         for sentence in sentences:
             sentence_lower = sentence.lower()
-            if any(word in sentence_lower for word in query_words):
+            if any(re.search(r'\b' + re.escape(word) + r'\b', sentence_lower) for word in query_words):
                 results.append({
                     "filename": note["filename"],
                     "sentence": sentence.strip()
                 })
 
     return results
🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/qa_cli.py` around lines 63 - 82, In search_notes, avoid
substring matches and silently returning nothing: instead of checking "word in
sentence_lower", perform word-boundary matching (e.g., build a regex using
r'\b{re.escape(word)}\b' for each word or split sentence_lower into tokens and
compare token sets) so "ai" won't match inside other words; also handle the case
where query_words becomes empty (all tokens in QUESTION_WORDS) by returning an
explicit informative result or raising/returning a sentinel (e.g., a message
like "no query terms after removing stopwords") rather than silently returning
an empty list; update references in this function (search_notes), using
QUESTION_WORDS and split_sentences, and ensure you normalize casing and strip
punctuation when tokenizing.

Comment on lines +28 to +44
```bash
python smart-notes/rag_mvp/qa_cli.py



>> what is AI

[1] From test.md:
Artificial Intelligence (AI) is the simulation of human intelligence in machines.


>> what is machine learning
how is machine learning used
difference between AI and ML



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Broken markdown: unclosed code block merges the "How to run" section with sample output and the second half of the document.

The ```bash block starting at line 28 is never closed. Everything after it — including sample queries, the second README heading, and the project structure — renders as one continuous code block.

Add the closing ``` after the run command and format sample output separately.

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/README.md` around lines 28 - 44, The README contains an
unclosed fenced code block that starts with "```bash" around the example run
command; close that fence immediately after the run command by adding a
terminating "```" so the following sections (sample output, the second heading,
and the project structure) are not swallowed into the code block, and move or
reformat the sample CLI output (the lines beginning with ">>" and the "[1] From
test.md:" result) into its own fenced block or plain paragraph so the "How to
run" section and other headings render correctly.

Comment on lines +75 to +84
```bash
smart-notes/
├── rag_mvp/
│ ├── embed.py # Embedding logic
│ ├── index.py # FAISS index creation
│ ├── qa_cli.py # CLI for asking questions
│ └── utils.py # Helper functions
├── notes/ # Put your .txt notes here
├── requirements.txt
└── README.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Project structure lists filenames that don't match the actual modules.

The tree references embed.py, index.py, and utils.py, but the actual files are embeddings/embedder.py, embeddings/indexer.py, embeddings/chunker.py, and pipelines/embedding_pipeline.py. This will confuse contributors.

🤖 Prompt for AI Agents
In `@smart-notes/rag_mvp/README.md` around lines 75 - 84, The README's project
tree is inaccurate: it lists embed.py, index.py, and utils.py but the codebase
actually contains embeddings/embedder.py, embeddings/indexer.py,
embeddings/chunker.py, and pipelines/embedding_pipeline.py; update the README to
either mirror the real layout (replace embed.py → embeddings/embedder.py,
index.py → embeddings/indexer.py, add embeddings/chunker.py and
pipelines/embedding_pipeline.py) or provide a brief mapping note explaining
which modules correspond to the listed conceptual files so contributors can find
embedder, indexer, chunker and embedding_pipeline easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant