Skip to content

SankarSubbayya/document_search

Repository files navigation

PubMed Semantic Search System

A high-performance semantic search system for the PubMed 200k RCT dataset, featuring sub-100ms search latency across 195,000+ medical research abstracts.

Python 3.9+ Streamlit Qdrant Docker

🚀 Quick Start

Prerequisites

  • Python 3.9+
  • Docker Desktop
  • 4GB RAM minimum
  • 2GB disk space

Automated Setup (Recommended)

Simply run the setup script after cloning:

# Clone and setup
git clone <repository-url>
cd document_search
python3 setup.py

The script will automatically:

  • ✅ Check prerequisites
  • ✅ Install Python dependencies
  • ✅ Start Docker and Qdrant
  • ✅ Download PubMed dataset
  • ✅ Create embeddings and index
  • ✅ Verify everything works

Then just run:

streamlit run app.py

Manual Setup

If you prefer step-by-step setup:

# 1. Clone the repository
git clone <repository-url>
cd document_search

# 2. Install dependencies
pip install -e .

# 3. Start Docker and Qdrant
docker-compose up -d qdrant

# 4. Download PubMed dataset
python src/data/download_and_prepare.py --size 20k

# 5. Index documents
python scripts/index_pubmed_data.py

# 6. Launch the web interface
streamlit run app.py

Setup Options

# Skip dependency installation
python3 setup.py --skip-deps

# Skip Docker/Qdrant setup
python3 setup.py --skip-docker

# Force re-indexing
python3 setup.py --force-index

Open http://localhost:8501 in your browser to start searching!

🎯 Features

Core Capabilities

  • ⚡ Fast Search: 2-100ms response time for semantic search
  • 📊 Large Scale: Handle 195,000+ medical research abstracts
  • 🧠 AI-Powered: Sentence transformers for semantic understanding
  • 📱 Web Interface: Beautiful Streamlit UI with real-time results
  • 🔧 CLI Tools: Command-line interface for scripting
  • 🐳 Dockerized: Qdrant runs in Docker for easy deployment

Search Features

  • Semantic similarity search
  • Section-based filtering (Background, Methods, Results, etc.)
  • Relevance scoring and ranking
  • Search history tracking
  • Performance metrics display

📖 Documentation

All documentation is organized in the docs/ folder:

💻 Usage Examples

Web Interface

streamlit run app.py

Command Line Search

# Interactive search
python scripts/fast_search.py --interactive

# Single query
python scripts/fast_search.py "HIV treatment effectiveness"

# Benchmark performance
python scripts/fast_search.py "diabetes" --benchmark

Python API

from scripts.index_pubmed_data import PubMedIndexer

# Initialize
indexer = PubMedIndexer(host="localhost", port=6333)

# Search
results = indexer.search("cancer immunotherapy", limit=5)
for result in results:
    print(f"Score: {result['score']:.4f}")
    print(f"Abstract: {result['abstract_id']}")
    print(f"Content: {result['content'][:200]}...")

🏗️ Architecture

User Query → Embedding Model → Vector Search (Qdrant) → Ranked Results
     ↓             ↓                    ↓                     ↓
  Streamlit   Sentence-BERT      Docker Container        Web UI
  • Vector Database: Qdrant (Dockerized)
  • Embedding Model: all-MiniLM-L6-v2 (384 dimensions)
  • Dataset: PubMed 200k RCT (medical research abstracts)
  • Storage: ~1.5GB for full dataset

📊 Performance

Metric Value
Search Latency 2-100ms
Documents 195,654
Indexing Speed ~550 docs/sec
Model Load Time ~1.5 seconds
Vector Dimensions 384

🛠️ Project Structure

document_search/
├── app.py                 # Streamlit web interface
├── scripts/              # Utility scripts
│   ├── fast_search.py    # Optimized search CLI
│   └── index_pubmed_data.py  # Document indexer
├── src/                  # Source code
│   ├── data/            # Data downloaders & processors
│   ├── config/          # Configuration management
│   └── core/            # Core RAG system
├── docs/                # Documentation
├── config.yaml          # Main configuration
└── docker-compose.yml   # Docker services

📦 Key Dependencies

  • qdrant-client - Vector database client
  • sentence-transformers - Embedding generation
  • streamlit - Web interface
  • pandas - Data processing
  • docker - Container management

🤝 Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • PubMed for the 200k RCT dataset
  • Qdrant for the vector database
  • Hugging Face for sentence transformers
  • Streamlit for the web framework

Need help? Check the documentation or open an issue!

About

RAG based semantic search

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors