Skip to content

UofT-CSC490-F2025/ArXplorer

Repository files navigation

ArXplorer 🔍

Find Academic Papers Like a Researcher Thinks

Coverage Tests

Stop fighting with keyword-only search engines. ArXplorer understands what you mean, not just what you type.

# ✅ Works with natural queries
"papers about how neural networks learn internal structure"
"original transformer paper" 
"foundational work on medical image segmentation"

# ✅ Finds the right papers even when you don't know exact terms
Query: "attention is all you need"Finds: "Attention Is All You Need" (Vaswani et al.)
Query: "original unet paper"Finds: "U-Net: Convolutional Networks..." (Ronneberger et al.)
Query: "how do neural networks memorize"Finds: "Understanding deep learning requires rethinking generalization" (Zhang et al.)

Quick Links: Quick Start | See It In Action | Why ArXplorer? | Detailed Docs


💡 The Problem We Solve

Traditional academic search engines are broken.

Try searching Google Scholar or arXiv for:

  • "papers about how neural networks learn internal structure" → ❌ Zero relevant results (no exact keyword matches)
  • "original transformer paper" → ❌ Finds papers about transformers, not the Transformer paper
  • "foundational work on medical image segmentation" → ❌ Requires you to know it's called "U-Net"

Why? They rely on lexical matching (keyword matching). If your words don't exactly match the paper's title/abstract, you're out of luck.

ArXplorer fixes this with:

  • Semantic understanding: Matches concepts, not just words
  • Intent detection: Knows if you want recent SOTA or foundational papers
  • Smart extraction: "original unet paper" → automatically searches for title="U-Net"
  • Hybrid search: Combines semantic vectors + keyword matching + metadata

🎬 See It In Action

[Placeholder for demo GIF/screenshot - to be added later]

Example Queries

Query: "attention is all you need"

✓ Found: "Attention Is All You Need" (Vaswani et al., 2017)
  Score: 0.95 | Citations: 89,234

Query: "original unet paper"

Intent: specific_paper
Extracted Title: U-Net
✓ Found: "U-Net: Convolutional Networks for Biomedical Image Segmentation" (Ronneberger et al., 2015)
  Score: 0.95 | Citations: 45,234

🚀 Why ArXplorer?

Feature Google Scholar arXiv Search ArXplorer
Semantic search
Intent detection ✅ (6 types)
Query expansion ✅ (LLM-powered)
Hybrid ranking ✅ (Dense + Sparse + Metadata)
Self-hostable ✅ (Docker + AWS)
API access ⚠️ Limited ⚠️ Limited ✅ (FastAPI)

Plus:

  • 🎓 Academic-optimized: SPECTER2 embeddings trained on 750k papers
  • Fast: <200ms query latency with GPU reranking
  • 🔧 Production-ready: 96% test coverage, automated backups, CI/CD
  • 📈 Scalable: Handles 300k+ papers, extensible to millions

📊 Performance

ArXplorer achieves state-of-the-art retrieval quality on academic IR benchmarks:

Metric BM25 (baseline) Dense-only ArXplorer (hybrid) Improvement
NDCG@10 0.412 0.487 0.623 +51% vs baseline
Recall@100 0.651 0.712 0.834 +28% vs baseline
MRR 0.398 0.471 0.589 +48% vs baseline

See evaluation/README.md for detailed benchmarking methodology.

Real-world impact:

  • ✅ Finds 83% of relevant papers in top 100 results (vs 65% for BM25)
  • ✅ Correct paper appears in top 10 results 59% of the time (vs 40% for BM25)

🚀 Quick Start

⏱️ Get running in 5 minutes

# 1. Start Milvus vector database
docker-compose -f docker-compose.milvus.yml up -d

# 2. Setup Python environment
conda env create -f environment.yml
conda activate arxplorer-env

# 3. Load demo dataset (1k papers)
python scripts/encode.py --data-file data/arxiv_1k.jsonl

# 4. Start searching!
python scripts/query.py
# Try: "attention is all you need"

✅ Success? You should see paper results with titles, authors, and scores.

📖 Need detailed instructions? See INSTRUCTIONS.md for:

  • Full setup guide with troubleshooting
  • AWS deployment (production-ready infrastructure)
  • Configuration options
  • API deployment

🏗️ How It Works

The retrieval pipeline for a sample query is as follows: ArXplorer Query Pipeline

User Query: "original unet paper"
    │
    ├──► 1. 🧠 LLM Query Analyzer (Qwen3-4B)
    │       → Intent: specific_paper
    │       → Extracted: title="U-Net", authors=["Ronneberger"]
    │       → Rewrites: "seminal U-Net segmentation architecture"
    │
    ├──► 2. 🔍 Hybrid Search (Milvus)
    │       → Dense vectors (SPECTER2): semantic similarity
    │       → Sparse vectors (SPLADE): keyword matching
    │       → Multi-query: original + rewrites + extracted terms
    │       → Retrieves top 200 candidates
    │
    ├──► 3. 🎯 Intent-Based Boosting
    │       → Adjust scores based on query type
    │       → specific_paper: boost citations, ignore recency
    │
    ├──► 4. 🔗 Title/Author Matching
    │       → Fuzzy match extracted terms
    │       → Boost exact/near matches
    │
    ├──► 5. 🏆 Jina Reranking
    │       → Listwise comparison of top 50
    │       → Cross-document relevance
    │
    └──► 📊 Results: Top 10 papers ranked by fused scores

Key Technologies:

  • Milvus: Open-source vector database
  • SPECTER2: Academic paper embeddings (768-dim dense)
  • SPLADE: Learned sparse representations (~30k-dim)
  • Qwen3-4B-AWQ: Quantized LLM for query analysis
  • Jina Reranker v3: State-of-the-art listwise reranking

🔍 See detailed architecture: INSTRUCTIONS.md#architecture


✨ Key Features

🧠 Intent-Aware Search

Detects 6 query types and adjusts ranking:

  • topical: General exploration ("machine learning papers")
  • sota: Recent state-of-the-art ("latest LLM research")
  • foundational: Seminal works ("foundational papers on CNNs")
  • comparison: Technique comparison ("transformer vs RNN")
  • method_lookup: Specific method ("how does BERT work")
  • specific_paper: Exact paper search ("original ResNet paper")

🔍 Multi-Vector Hybrid Search

  • Dense vectors: Capture semantic meaning
  • Sparse vectors: Preserve keyword signals
  • RRF Fusion: Combine rankings from multiple searches
  • Metadata filtering: Year, citations, categories

🎯 Smart Query Processing

  • LLM extraction: Pulls titles, authors, years from natural language
  • Query expansion: Generates technical rewrites
  • Multi-query search: Uses original + expanded + extracted terms

🏆 Advanced Reranking

  • Jina listwise reranker: Sees all candidates simultaneously
  • Intent boosting: Citation/recency weighting by query type
  • Fuzzy matching: Title/author similarity scoring
  • Score fusion: Weighted combination of all signals

🚀 Production-Ready

  • 96% test coverage: 163 passing tests
  • CI/CD: Automated testing and deployment
  • AWS infrastructure: Terraform IaC for GPU inference
  • API endpoint: FastAPI with OpenAPI docs
  • Backup/restore: S3 integration for Milvus data

📁 Project Structure

ArXplorer/
├── src/                     # Core library (96% coverage)
│   ├── retrieval/
│   │   ├── encoders/       # SPECTER2 + SPLADE
│   │   ├── searchers/      # Milvus hybrid search
│   │   ├── rerankers/      # Jina + CrossEncoder
│   │   └── query_rewriting/ # LLM query analysis
├── scripts/                 # CLI tools
│   ├── encode.py           # Build Milvus index
│   ├── query.py            # Interactive search
│   └── deploy_*.sh         # AWS deployment
├── tests/                   # 163 tests
├── evaluation/              # Benchmark framework
├── terraform/               # AWS infrastructure
├── data/                    # Datasets
│   └── arxiv_1k.jsonl      # Demo dataset
├── README.md               # This file (you are here)
├── INSTRUCTIONS.md         # Detailed setup guide
└── config.yaml             # Configuration

📖 Full documentation: INSTRUCTIONS.md


🆘 Getting Help


🤝 Contributing

We welcome contributions! Whether it's:

  • 🐛 Bug fixes
  • ✨ New features
  • 📝 Documentation improvements
  • 🧪 Test coverage expansion

Before contributing:

  1. Review INSTRUCTIONS.md to understand the architecture
  2. Run tests: pytest tests/ --cov=src
  3. Ensure 96%+ coverage maintained
  4. Follow existing code style

📚 Citation

If you use ArXplorer in your research, please cite:

@software{arxplorer2024,
  title = {ArXplorer: Intent-Aware Academic Paper Retrieval},
  author = {ArXplorer Team},
  year = {2024},
  url = {https://github.com/UofT-CSC490-F2025/ArXplorer}
}

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


🙏 Acknowledgments

Technologies:

  • Milvus - Open-source vector database
  • SPECTER2 - Academic paper embeddings
  • SPLADE - Sparse lexical expansion
  • Qwen3 - LLM for query analysis
  • Jina AI - Listwise reranking

Datasets:


Built with ❤️ for researchers who deserve better search.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors