A high-performance semantic search system for the PubMed 200k RCT dataset, featuring sub-100ms search latency across 195,000+ medical research abstracts.
- Python 3.9+
- Docker Desktop
- 4GB RAM minimum
- 2GB disk space
Simply run the setup script after cloning:
# Clone and setup
git clone <repository-url>
cd document_search
python3 setup.pyThe script will automatically:
- ✅ Check prerequisites
- ✅ Install Python dependencies
- ✅ Start Docker and Qdrant
- ✅ Download PubMed dataset
- ✅ Create embeddings and index
- ✅ Verify everything works
Then just run:
streamlit run app.pyIf you prefer step-by-step setup:
# 1. Clone the repository
git clone <repository-url>
cd document_search
# 2. Install dependencies
pip install -e .
# 3. Start Docker and Qdrant
docker-compose up -d qdrant
# 4. Download PubMed dataset
python src/data/download_and_prepare.py --size 20k
# 5. Index documents
python scripts/index_pubmed_data.py
# 6. Launch the web interface
streamlit run app.py# Skip dependency installation
python3 setup.py --skip-deps
# Skip Docker/Qdrant setup
python3 setup.py --skip-docker
# Force re-indexing
python3 setup.py --force-indexOpen http://localhost:8501 in your browser to start searching!
- ⚡ Fast Search: 2-100ms response time for semantic search
- 📊 Large Scale: Handle 195,000+ medical research abstracts
- 🧠 AI-Powered: Sentence transformers for semantic understanding
- 📱 Web Interface: Beautiful Streamlit UI with real-time results
- 🔧 CLI Tools: Command-line interface for scripting
- 🐳 Dockerized: Qdrant runs in Docker for easy deployment
- Semantic similarity search
- Section-based filtering (Background, Methods, Results, etc.)
- Relevance scoring and ranking
- Search history tracking
- Performance metrics display
All documentation is organized in the docs/ folder:
- Getting Started Guide - Setup and first steps
- Architecture Overview - System design and components
- API Reference - Python API documentation
- Configuration Guide - Config.yaml settings
- File Documentation - Detailed file descriptions
streamlit run app.py# Interactive search
python scripts/fast_search.py --interactive
# Single query
python scripts/fast_search.py "HIV treatment effectiveness"
# Benchmark performance
python scripts/fast_search.py "diabetes" --benchmarkfrom scripts.index_pubmed_data import PubMedIndexer
# Initialize
indexer = PubMedIndexer(host="localhost", port=6333)
# Search
results = indexer.search("cancer immunotherapy", limit=5)
for result in results:
print(f"Score: {result['score']:.4f}")
print(f"Abstract: {result['abstract_id']}")
print(f"Content: {result['content'][:200]}...")User Query → Embedding Model → Vector Search (Qdrant) → Ranked Results
↓ ↓ ↓ ↓
Streamlit Sentence-BERT Docker Container Web UI
- Vector Database: Qdrant (Dockerized)
- Embedding Model: all-MiniLM-L6-v2 (384 dimensions)
- Dataset: PubMed 200k RCT (medical research abstracts)
- Storage: ~1.5GB for full dataset
| Metric | Value |
|---|---|
| Search Latency | 2-100ms |
| Documents | 195,654 |
| Indexing Speed | ~550 docs/sec |
| Model Load Time | ~1.5 seconds |
| Vector Dimensions | 384 |
document_search/
├── app.py # Streamlit web interface
├── scripts/ # Utility scripts
│ ├── fast_search.py # Optimized search CLI
│ └── index_pubmed_data.py # Document indexer
├── src/ # Source code
│ ├── data/ # Data downloaders & processors
│ ├── config/ # Configuration management
│ └── core/ # Core RAG system
├── docs/ # Documentation
├── config.yaml # Main configuration
└── docker-compose.yml # Docker services
- qdrant-client - Vector database client
- sentence-transformers - Embedding generation
- streamlit - Web interface
- pandas - Data processing
- docker - Container management
Contributions are welcome! Please feel free to submit issues or pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
- PubMed for the 200k RCT dataset
- Qdrant for the vector database
- Hugging Face for sentence transformers
- Streamlit for the web framework
Need help? Check the documentation or open an issue!