I'm Nishita. This project develops a cutting-edge search engine that allows developers to query a codebase using natural language. Unlike traditional keyword search (like grep or basic full-text indexing), this system leverages vector embeddings to understand the meaning and intent of a query, returning code snippets that are semantically relevant, even if they don't contain the exact keywords. This significantly boosts developer productivity and code comprehension.
- Natural Language Querying: Enables users to ask questions like "Where is the user authentication logic handled?" and receive highly accurate code matches.
- Vector-Based Retrieval: Implemented high-performance similarity search using FAISS (Facebook AI Similarity Search) and Sentence-BERT embeddings.
- Performance Improvement: Demonstrated a 35% increase in Mean Reciprocal Rank (MRR) compared to a conventional keyword (TF-IDF) search baseline.
- Low Latency: Optimized the retrieval pipeline to ensure fast results, achieving an average search latency of under 500ms.
| Component | Technology | Purpose |
|---|---|---|
| Backend & API | Python (Flask) | REST API to handle natural language query requests and manage the search pipeline. |
| Semantic Embedding | Sentence-BERT | Used to transform both the codebase segments and the user's natural language query into high-dimensional vectors. |
| Vector Database | FAISS / Chroma | High-speed indexing and nearest-neighbor search over the generated vector embeddings. |
| Initial Indexing | Elasticsearch (Optional) | Used for initial text-based indexing and metadata storage prior to vectorization. |
- Python 3.8+
pip
- Clone the repository:
git clone https://github.com/nish941/Codebase-Semantic-Search-Engine-Vector-Based-Retrieval cd codebase-semantic-search-engine - Install dependencies:
pip install -r requirements.txt # Note: FAISS requires specific installation steps, refer to its documentation. - Set up the vector index:
- Run the indexing script to parse the source code and generate embeddings:
python index_codebase.py --repo-path [path/to/target/code]
- This process uses Sentence-BERT to embed code fragments and stores them in a FAISS index file.
- Run the indexing script to parse the source code and generate embeddings:
- Start the Flask API server:
python app.py
- Send a semantic search request to the API (e.g., using
curlor Postman):curl -X POST [http://127.0.0.1:5000/search](http://127.0.0.1:5000/search) \ -H "Content-Type: application/json" \ -d '{"query": "functions for handling token validation"}'
- Duration: April 2024 – June 2024
- Subject Relevance: Natural Language Processing (NLP), Information Retrieval, Machine Learning Engineering