A hybrid information retrieval system for searching Wikipedia articles using both lexical (BM25) and semantic search methods, combined with Reciprocal Rank Fusion (RRF) for optimal results.
- Hybrid Search: Combines BM25 (lexical) and semantic search using Jina embeddings
- Reciprocal Rank Fusion: Merges results from multiple retrieval methods for improved relevance
- FastAPI Backend: RESTful API for search functionality
- Web Interface: Clean, responsive search interface
- Pre-indexed Data: Includes Wikipedia dataset with pre-built search indices
- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
Create a
.envfile with your Jina API key:
JINA_API_KEY=your_jina_api_key_here
- Ensure data files are present:
wiki_dataset.csv- Wikipedia articles datasetbm25_index_content/- Pre-built BM25 indexsemantic_full.usearch- Pre-built semantic index
uvicorn retriever:app --reloadThe API will be available at http://localhost:8000
python -m http.server 3000The search interface will be available at http://localhost:3000
-
POST /search- Search articles{ "query": "your search query", "k": 3 } -
GET /document/{doc_id}- Get document by ID