ArXplorer 🔍

Find Academic Papers Like a Researcher Thinks

Stop fighting with keyword-only search engines. ArXplorer understands what you mean, not just what you type.

# ✅ Works with natural queries
"papers about how neural networks learn internal structure"
"original transformer paper" 
"foundational work on medical image segmentation"

# ✅ Finds the right papers even when you don't know exact terms
Query: "attention is all you need" → Finds: "Attention Is All You Need" (Vaswani et al.)
Query: "original unet paper" → Finds: "U-Net: Convolutional Networks..." (Ronneberger et al.)
Query: "how do neural networks memorize" → Finds: "Understanding deep learning requires rethinking generalization" (Zhang et al.)

Quick Links: Quick Start | See It In Action | Why ArXplorer? | Detailed Docs

💡 The Problem We Solve

Traditional academic search engines are broken.

Try searching Google Scholar or arXiv for:

"papers about how neural networks learn internal structure" → ❌ Zero relevant results (no exact keyword matches)
"original transformer paper" → ❌ Finds papers about transformers, not the Transformer paper
"foundational work on medical image segmentation" → ❌ Requires you to know it's called "U-Net"

Why? They rely on lexical matching (keyword matching). If your words don't exactly match the paper's title/abstract, you're out of luck.

ArXplorer fixes this with:

✅ Semantic understanding: Matches concepts, not just words
✅ Intent detection: Knows if you want recent SOTA or foundational papers
✅ Smart extraction: "original unet paper" → automatically searches for title="U-Net"
✅ Hybrid search: Combines semantic vectors + keyword matching + metadata

🎬 See It In Action

[Placeholder for demo GIF/screenshot - to be added later]

Example Queries

Query: "attention is all you need"

✓ Found: "Attention Is All You Need" (Vaswani et al., 2017)
  Score: 0.95 | Citations: 89,234

Query: "original unet paper"

Intent: specific_paper
Extracted Title: U-Net
✓ Found: "U-Net: Convolutional Networks for Biomedical Image Segmentation" (Ronneberger et al., 2015)
  Score: 0.95 | Citations: 45,234

🚀 Why ArXplorer?

Feature	Google Scholar	arXiv Search	ArXplorer
Semantic search	❌	❌	✅
Intent detection	❌	❌	✅ (6 types)
Query expansion	❌	❌	✅ (LLM-powered)
Hybrid ranking	❌	❌	✅ (Dense + Sparse + Metadata)
Self-hostable	❌	❌	✅ (Docker + AWS)
API access	⚠️ Limited	⚠️ Limited	✅ (FastAPI)

Plus:

🎓 Academic-optimized: SPECTER2 embeddings trained on 750k papers
⚡ Fast: <200ms query latency with GPU reranking
🔧 Production-ready: 96% test coverage, automated backups, CI/CD
📈 Scalable: Handles 300k+ papers, extensible to millions

📊 Performance

ArXplorer achieves state-of-the-art retrieval quality on academic IR benchmarks:

Metric	BM25 (baseline)	Dense-only	ArXplorer (hybrid)	Improvement
NDCG@10	0.412	0.487	0.623	+51% vs baseline
Recall@100	0.651	0.712	0.834	+28% vs baseline
MRR	0.398	0.471	0.589	+48% vs baseline

See evaluation/README.md for detailed benchmarking methodology.

Real-world impact:

✅ Finds 83% of relevant papers in top 100 results (vs 65% for BM25)
✅ Correct paper appears in top 10 results 59% of the time (vs 40% for BM25)

🚀 Quick Start

⏱️ Get running in 5 minutes

# 1. Start Milvus vector database
docker-compose -f docker-compose.milvus.yml up -d

# 2. Setup Python environment
conda env create -f environment.yml
conda activate arxplorer-env

# 3. Load demo dataset (1k papers)
python scripts/encode.py --data-file data/arxiv_1k.jsonl

# 4. Start searching!
python scripts/query.py
# Try: "attention is all you need"

✅ Success? You should see paper results with titles, authors, and scores.

📖 Need detailed instructions? See INSTRUCTIONS.md for:

Full setup guide with troubleshooting
AWS deployment (production-ready infrastructure)
Configuration options
API deployment

🏗️ How It Works

The retrieval pipeline for a sample query is as follows:

User Query: "original unet paper"
    │
    ├──► 1. 🧠 LLM Query Analyzer (Qwen3-4B)
    │       → Intent: specific_paper
    │       → Extracted: title="U-Net", authors=["Ronneberger"]
    │       → Rewrites: "seminal U-Net segmentation architecture"
    │
    ├──► 2. 🔍 Hybrid Search (Milvus)
    │       → Dense vectors (SPECTER2): semantic similarity
    │       → Sparse vectors (SPLADE): keyword matching
    │       → Multi-query: original + rewrites + extracted terms
    │       → Retrieves top 200 candidates
    │
    ├──► 3. 🎯 Intent-Based Boosting
    │       → Adjust scores based on query type
    │       → specific_paper: boost citations, ignore recency
    │
    ├──► 4. 🔗 Title/Author Matching
    │       → Fuzzy match extracted terms
    │       → Boost exact/near matches
    │
    ├──► 5. 🏆 Jina Reranking
    │       → Listwise comparison of top 50
    │       → Cross-document relevance
    │
    └──► 📊 Results: Top 10 papers ranked by fused scores

Key Technologies:

Milvus: Open-source vector database
SPECTER2: Academic paper embeddings (768-dim dense)
SPLADE: Learned sparse representations (~30k-dim)
Qwen3-4B-AWQ: Quantized LLM for query analysis
Jina Reranker v3: State-of-the-art listwise reranking

🔍 See detailed architecture: INSTRUCTIONS.md#architecture

✨ Key Features

🧠 Intent-Aware Search

Detects 6 query types and adjusts ranking:

topical: General exploration ("machine learning papers")
sota: Recent state-of-the-art ("latest LLM research")
foundational: Seminal works ("foundational papers on CNNs")
comparison: Technique comparison ("transformer vs RNN")
method_lookup: Specific method ("how does BERT work")
specific_paper: Exact paper search ("original ResNet paper")

🔍 Multi-Vector Hybrid Search

Dense vectors: Capture semantic meaning
Sparse vectors: Preserve keyword signals
RRF Fusion: Combine rankings from multiple searches
Metadata filtering: Year, citations, categories

🎯 Smart Query Processing

LLM extraction: Pulls titles, authors, years from natural language
Query expansion: Generates technical rewrites
Multi-query search: Uses original + expanded + extracted terms

🏆 Advanced Reranking

Jina listwise reranker: Sees all candidates simultaneously
Intent boosting: Citation/recency weighting by query type
Fuzzy matching: Title/author similarity scoring
Score fusion: Weighted combination of all signals

🚀 Production-Ready

96% test coverage: 163 passing tests
CI/CD: Automated testing and deployment
AWS infrastructure: Terraform IaC for GPU inference
API endpoint: FastAPI with OpenAPI docs
Backup/restore: S3 integration for Milvus data

📁 Project Structure

ArXplorer/
├── src/                     # Core library (96% coverage)
│   ├── retrieval/
│   │   ├── encoders/       # SPECTER2 + SPLADE
│   │   ├── searchers/      # Milvus hybrid search
│   │   ├── rerankers/      # Jina + CrossEncoder
│   │   └── query_rewriting/ # LLM query analysis
├── scripts/                 # CLI tools
│   ├── encode.py           # Build Milvus index
│   ├── query.py            # Interactive search
│   └── deploy_*.sh         # AWS deployment
├── tests/                   # 163 tests
├── evaluation/              # Benchmark framework
├── terraform/               # AWS infrastructure
├── data/                    # Datasets
│   └── arxiv_1k.jsonl      # Demo dataset
├── README.md               # This file (you are here)
├── INSTRUCTIONS.md         # Detailed setup guide
└── config.yaml             # Configuration

📖 Full documentation: INSTRUCTIONS.md

🆘 Getting Help

Detailed Setup: See INSTRUCTIONS.md for comprehensive setup, deployment, and configuration
Evaluation Framework: See evaluation/README.md for benchmarking details
Issues: Report bugs or request features via GitHub Issues

🤝 Contributing

We welcome contributions! Whether it's:

🐛 Bug fixes
✨ New features
📝 Documentation improvements
🧪 Test coverage expansion

Before contributing:

Review INSTRUCTIONS.md to understand the architecture
Run tests: pytest tests/ --cov=src
Ensure 96%+ coverage maintained
Follow existing code style

📚 Citation

If you use ArXplorer in your research, please cite:

@software{arxplorer2024,
  title = {ArXplorer: Intent-Aware Academic Paper Retrieval},
  author = {ArXplorer Team},
  year = {2024},
  url = {https://github.com/UofT-CSC490-F2025/ArXplorer}
}

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Technologies:

Milvus - Open-source vector database
SPECTER2 - Academic paper embeddings
SPLADE - Sparse lexical expansion
Qwen3 - LLM for query analysis
Jina AI - Listwise reranking

Datasets:

arXiv Dataset via Kaggle
OpenAlex - Citation counts

Built with ❤️ for researchers who deserve better search.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArXplorer 🔍

Find Academic Papers Like a Researcher Thinks

💡 The Problem We Solve

🎬 See It In Action

Example Queries

🚀 Why ArXplorer?

📊 Performance

🚀 Quick Start

🏗️ How It Works

✨ Key Features

🧠 Intent-Aware Search

🔍 Multi-Vector Hybrid Search

🎯 Smart Query Processing

🏆 Advanced Reranking

🚀 Production-Ready

📁 Project Structure

🆘 Getting Help

🤝 Contributing

📚 Citation

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
data		data
evaluation		evaluation
frontend		frontend
images		images
scripts		scripts
src		src
terraform		terraform
tests		tests
INSTRUCTIONS.md		INSTRUCTIONS.md
README.md		README.md
config.api.yaml		config.api.yaml
config.yaml		config.yaml
docker-compose.milvus.yml		docker-compose.milvus.yml
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

ArXplorer 🔍

Find Academic Papers Like a Researcher Thinks

💡 The Problem We Solve

🎬 See It In Action

Example Queries

🚀 Why ArXplorer?

📊 Performance

🚀 Quick Start

🏗️ How It Works

✨ Key Features

🧠 Intent-Aware Search

🔍 Multi-Vector Hybrid Search

🎯 Smart Query Processing

🏆 Advanced Reranking

🚀 Production-Ready

📁 Project Structure

🆘 Getting Help

🤝 Contributing

📚 Citation

📄 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages