Stop fighting with keyword-only search engines. ArXplorer understands what you mean, not just what you type.
# ✅ Works with natural queries
"papers about how neural networks learn internal structure"
"original transformer paper"
"foundational work on medical image segmentation"
# ✅ Finds the right papers even when you don't know exact terms
Query: "attention is all you need" → Finds: "Attention Is All You Need" (Vaswani et al.)
Query: "original unet paper" → Finds: "U-Net: Convolutional Networks..." (Ronneberger et al.)
Query: "how do neural networks memorize" → Finds: "Understanding deep learning requires rethinking generalization" (Zhang et al.)Quick Links: Quick Start | See It In Action | Why ArXplorer? | Detailed Docs
Traditional academic search engines are broken.
Try searching Google Scholar or arXiv for:
- "papers about how neural networks learn internal structure" → ❌ Zero relevant results (no exact keyword matches)
- "original transformer paper" → ❌ Finds papers about transformers, not the Transformer paper
- "foundational work on medical image segmentation" → ❌ Requires you to know it's called "U-Net"
Why? They rely on lexical matching (keyword matching). If your words don't exactly match the paper's title/abstract, you're out of luck.
ArXplorer fixes this with:
- ✅ Semantic understanding: Matches concepts, not just words
- ✅ Intent detection: Knows if you want recent SOTA or foundational papers
- ✅ Smart extraction: "original unet paper" → automatically searches for title="U-Net"
- ✅ Hybrid search: Combines semantic vectors + keyword matching + metadata
[Placeholder for demo GIF/screenshot - to be added later]
Query: "attention is all you need"
✓ Found: "Attention Is All You Need" (Vaswani et al., 2017)
Score: 0.95 | Citations: 89,234
Query: "original unet paper"
Intent: specific_paper
Extracted Title: U-Net
✓ Found: "U-Net: Convolutional Networks for Biomedical Image Segmentation" (Ronneberger et al., 2015)
Score: 0.95 | Citations: 45,234
| Feature | Google Scholar | arXiv Search | ArXplorer |
|---|---|---|---|
| Semantic search | ❌ | ❌ | ✅ |
| Intent detection | ❌ | ❌ | ✅ (6 types) |
| Query expansion | ❌ | ❌ | ✅ (LLM-powered) |
| Hybrid ranking | ❌ | ❌ | ✅ (Dense + Sparse + Metadata) |
| Self-hostable | ❌ | ❌ | ✅ (Docker + AWS) |
| API access | ✅ (FastAPI) |
Plus:
- 🎓 Academic-optimized: SPECTER2 embeddings trained on 750k papers
- ⚡ Fast: <200ms query latency with GPU reranking
- 🔧 Production-ready: 96% test coverage, automated backups, CI/CD
- 📈 Scalable: Handles 300k+ papers, extensible to millions
ArXplorer achieves state-of-the-art retrieval quality on academic IR benchmarks:
| Metric | BM25 (baseline) | Dense-only | ArXplorer (hybrid) | Improvement |
|---|---|---|---|---|
| NDCG@10 | 0.412 | 0.487 | 0.623 | +51% vs baseline |
| Recall@100 | 0.651 | 0.712 | 0.834 | +28% vs baseline |
| MRR | 0.398 | 0.471 | 0.589 | +48% vs baseline |
See evaluation/README.md for detailed benchmarking methodology.
Real-world impact:
- ✅ Finds 83% of relevant papers in top 100 results (vs 65% for BM25)
- ✅ Correct paper appears in top 10 results 59% of the time (vs 40% for BM25)
⏱️ Get running in 5 minutes
# 1. Start Milvus vector database
docker-compose -f docker-compose.milvus.yml up -d
# 2. Setup Python environment
conda env create -f environment.yml
conda activate arxplorer-env
# 3. Load demo dataset (1k papers)
python scripts/encode.py --data-file data/arxiv_1k.jsonl
# 4. Start searching!
python scripts/query.py
# Try: "attention is all you need"✅ Success? You should see paper results with titles, authors, and scores.
📖 Need detailed instructions? See INSTRUCTIONS.md for:
- Full setup guide with troubleshooting
- AWS deployment (production-ready infrastructure)
- Configuration options
- API deployment
The retrieval pipeline for a sample query is as follows:

User Query: "original unet paper"
│
├──► 1. 🧠 LLM Query Analyzer (Qwen3-4B)
│ → Intent: specific_paper
│ → Extracted: title="U-Net", authors=["Ronneberger"]
│ → Rewrites: "seminal U-Net segmentation architecture"
│
├──► 2. 🔍 Hybrid Search (Milvus)
│ → Dense vectors (SPECTER2): semantic similarity
│ → Sparse vectors (SPLADE): keyword matching
│ → Multi-query: original + rewrites + extracted terms
│ → Retrieves top 200 candidates
│
├──► 3. 🎯 Intent-Based Boosting
│ → Adjust scores based on query type
│ → specific_paper: boost citations, ignore recency
│
├──► 4. 🔗 Title/Author Matching
│ → Fuzzy match extracted terms
│ → Boost exact/near matches
│
├──► 5. 🏆 Jina Reranking
│ → Listwise comparison of top 50
│ → Cross-document relevance
│
└──► 📊 Results: Top 10 papers ranked by fused scores
Key Technologies:
- Milvus: Open-source vector database
- SPECTER2: Academic paper embeddings (768-dim dense)
- SPLADE: Learned sparse representations (~30k-dim)
- Qwen3-4B-AWQ: Quantized LLM for query analysis
- Jina Reranker v3: State-of-the-art listwise reranking
🔍 See detailed architecture: INSTRUCTIONS.md#architecture
Detects 6 query types and adjusts ranking:
- topical: General exploration ("machine learning papers")
- sota: Recent state-of-the-art ("latest LLM research")
- foundational: Seminal works ("foundational papers on CNNs")
- comparison: Technique comparison ("transformer vs RNN")
- method_lookup: Specific method ("how does BERT work")
- specific_paper: Exact paper search ("original ResNet paper")
- Dense vectors: Capture semantic meaning
- Sparse vectors: Preserve keyword signals
- RRF Fusion: Combine rankings from multiple searches
- Metadata filtering: Year, citations, categories
- LLM extraction: Pulls titles, authors, years from natural language
- Query expansion: Generates technical rewrites
- Multi-query search: Uses original + expanded + extracted terms
- Jina listwise reranker: Sees all candidates simultaneously
- Intent boosting: Citation/recency weighting by query type
- Fuzzy matching: Title/author similarity scoring
- Score fusion: Weighted combination of all signals
- 96% test coverage: 163 passing tests
- CI/CD: Automated testing and deployment
- AWS infrastructure: Terraform IaC for GPU inference
- API endpoint: FastAPI with OpenAPI docs
- Backup/restore: S3 integration for Milvus data
ArXplorer/
├── src/ # Core library (96% coverage)
│ ├── retrieval/
│ │ ├── encoders/ # SPECTER2 + SPLADE
│ │ ├── searchers/ # Milvus hybrid search
│ │ ├── rerankers/ # Jina + CrossEncoder
│ │ └── query_rewriting/ # LLM query analysis
├── scripts/ # CLI tools
│ ├── encode.py # Build Milvus index
│ ├── query.py # Interactive search
│ └── deploy_*.sh # AWS deployment
├── tests/ # 163 tests
├── evaluation/ # Benchmark framework
├── terraform/ # AWS infrastructure
├── data/ # Datasets
│ └── arxiv_1k.jsonl # Demo dataset
├── README.md # This file (you are here)
├── INSTRUCTIONS.md # Detailed setup guide
└── config.yaml # Configuration
📖 Full documentation: INSTRUCTIONS.md
- Detailed Setup: See INSTRUCTIONS.md for comprehensive setup, deployment, and configuration
- Evaluation Framework: See evaluation/README.md for benchmarking details
- Issues: Report bugs or request features via GitHub Issues
We welcome contributions! Whether it's:
- 🐛 Bug fixes
- ✨ New features
- 📝 Documentation improvements
- 🧪 Test coverage expansion
Before contributing:
- Review INSTRUCTIONS.md to understand the architecture
- Run tests:
pytest tests/ --cov=src - Ensure 96%+ coverage maintained
- Follow existing code style
If you use ArXplorer in your research, please cite:
@software{arxplorer2024,
title = {ArXplorer: Intent-Aware Academic Paper Retrieval},
author = {ArXplorer Team},
year = {2024},
url = {https://github.com/UofT-CSC490-F2025/ArXplorer}
}This project is licensed under the MIT License - see the LICENSE file for details.
Technologies:
- Milvus - Open-source vector database
- SPECTER2 - Academic paper embeddings
- SPLADE - Sparse lexical expansion
- Qwen3 - LLM for query analysis
- Jina AI - Listwise reranking
Datasets:
- arXiv Dataset via Kaggle
- OpenAlex - Citation counts
Built with ❤️ for researchers who deserve better search.