🔍 Codebase Semantic Search Engine (Vector-Based Retrieval)

💡 Overview

I'm Nishita. This project develops a cutting-edge search engine that allows developers to query a codebase using natural language. Unlike traditional keyword search (like grep or basic full-text indexing), this system leverages vector embeddings to understand the meaning and intent of a query, returning code snippets that are semantically relevant, even if they don't contain the exact keywords. This significantly boosts developer productivity and code comprehension.

✨ Key Features & Impact

Natural Language Querying: Enables users to ask questions like "Where is the user authentication logic handled?" and receive highly accurate code matches.
Vector-Based Retrieval: Implemented high-performance similarity search using FAISS (Facebook AI Similarity Search) and Sentence-BERT embeddings.
Performance Improvement: Demonstrated a 35% increase in Mean Reciprocal Rank (MRR) compared to a conventional keyword (TF-IDF) search baseline.
Low Latency: Optimized the retrieval pipeline to ensure fast results, achieving an average search latency of under 500ms.

🛠️ Technical Stack

Component	Technology	Purpose
Backend & API	Python (Flask)	REST API to handle natural language query requests and manage the search pipeline.
Semantic Embedding	Sentence-BERT	Used to transform both the codebase segments and the user's natural language query into high-dimensional vectors.
Vector Database	FAISS / Chroma	High-speed indexing and nearest-neighbor search over the generated vector embeddings.
Initial Indexing	Elasticsearch (Optional)	Used for initial text-based indexing and metadata storage prior to vectorization.

🚀 Getting Started

Prerequisites

Python 3.8+
pip

Installation

Clone the repository:

git clone https://github.com/nish941/Codebase-Semantic-Search-Engine-Vector-Based-Retrieval
cd codebase-semantic-search-engine

Install dependencies:

pip install -r requirements.txt
# Note: FAISS requires specific installation steps, refer to its documentation.

Set up the vector index:
- Run the indexing script to parse the source code and generate embeddings:
```
python index_codebase.py --repo-path [path/to/target/code]
```
- This process uses Sentence-BERT to embed code fragments and stores them in a FAISS index file.

Usage

Start the Flask API server:
```
python app.py
```

Send a semantic search request to the API (e.g., using curl or Postman):

curl -X POST [http://127.0.0.1:5000/search](http://127.0.0.1:5000/search) \
-H "Content-Type: application/json" \
-d '{"query": "functions for handling token validation"}'

🗓️ Project Timeline

Duration: April 2024 – June 2024
Subject Relevance: Natural Language Processing (NLP), Information Retrieval, Machine Learning Engineering

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
api		api
scripts		scripts
src		src
tests		tests
Dockerfile.dockerfile		Dockerfile.dockerfile
Makefile.makefile		Makefile.makefile
README.md		README.md
app.py		app.py
config.py		config.py
docker-compose.yaml		docker-compose.yaml
index_codebase.py		index_codebase.py
pre-commit-config.yaml		pre-commit-config.yaml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 Codebase Semantic Search Engine (Vector-Based Retrieval)

💡 Overview

✨ Key Features & Impact

🛠️ Technical Stack

🚀 Getting Started

Prerequisites

Installation

Usage

🗓️ Project Timeline

About

Uh oh!

Releases

Packages

Languages

nish941/Codebase-Semantic-Search-Engine-Vector-Based-Retrieval-

Folders and files

Latest commit

History

Repository files navigation

🔍 Codebase Semantic Search Engine (Vector-Based Retrieval)

💡 Overview

✨ Key Features & Impact

🛠️ Technical Stack

🚀 Getting Started

Prerequisites

Installation

Usage

🗓️ Project Timeline

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages